<?xml version="1.0" encoding="UTF-8"?><rss version="2.0" xmlns:content="http://purl.org/rss/1.0/modules/content/"><channel><title>Alex Merced&apos;s Lakehouse Blog</title><description>A blog about Apache Iceberg &amp; the Agentic Data Lakehouse space.</description><link>https://iceberglakehouse.com/</link><item><title>Context Management Strategies for VS Code with LLM Plugins: A Complete Guide to Building Your Own AI-Powered IDE</title><link>https://iceberglakehouse.com/posts/2026-03-context-vscode-llm-plugins/</link><guid isPermaLink="true">https://iceberglakehouse.com/posts/2026-03-context-vscode-llm-plugins/</guid><description>
Visual Studio Code is the most widely used code editor in the world, and its extensibility means you can integrate AI capabilities through a growing ...</description><pubDate>Sun, 08 Mar 2026 01:00:00 GMT</pubDate><content:encoded>&lt;p&gt;Visual Studio Code is the most widely used code editor in the world, and its extensibility means you can integrate AI capabilities through a growing ecosystem of LLM plugins. Unlike purpose-built AI editors (Cursor, Windsurf, Zed), VS Code gives you the freedom to choose and combine AI extensions, configure them to your preferences, and even switch between providers without changing editors. The tradeoff is that context management is not as seamlessly integrated as in dedicated AI editors. It requires more deliberate configuration.&lt;/p&gt;
&lt;p&gt;This guide covers context management strategies for the most popular VS Code AI extensions: GitHub Copilot, Continue, Cline (formerly Claude Dev), Aider, and others. It explains what context management capabilities each offers and how to configure them for maximum effectiveness.&lt;/p&gt;
&lt;h2&gt;The VS Code AI Extension Landscape&lt;/h2&gt;
&lt;p&gt;VS Code&apos;s AI extension ecosystem falls into several categories:&lt;/p&gt;
&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Category&lt;/th&gt;
&lt;th&gt;Extensions&lt;/th&gt;
&lt;th&gt;Approach&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Inline completion&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;GitHub Copilot, CodeiumChat, Supermaven&lt;/td&gt;
&lt;td&gt;Suggest code as you type&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Chat panel&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Copilot Chat, Continue, Cody&lt;/td&gt;
&lt;td&gt;Conversational AI in a sidebar&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Agentic coding&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Cline, Aider, Roo Code&lt;/td&gt;
&lt;td&gt;Autonomous agents that read/write files&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Specialized&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Mintlify, Tabnine&lt;/td&gt;
&lt;td&gt;Documentation, enterprise-focused&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;
&lt;p&gt;Each category manages context differently. Inline completion plugins use the current file and nearby tabs. Chat panel plugins use conversation history and file references. Agentic plugins have the broadest context, reading the codebase, running commands, and making multi-file changes.&lt;/p&gt;
&lt;h2&gt;GitHub Copilot: Context Management&lt;/h2&gt;
&lt;p&gt;GitHub Copilot is the most widely used AI coding assistant. Its context management has evolved significantly with the introduction of Copilot Chat and Agent Mode.&lt;/p&gt;
&lt;h3&gt;Inline Completions&lt;/h3&gt;
&lt;p&gt;Copilot&apos;s inline suggestions use:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;The current file content (especially the lines around your cursor)&lt;/li&gt;
&lt;li&gt;Open tabs in the editor (nearby files provide additional context)&lt;/li&gt;
&lt;li&gt;File names and directory structure (for naming conventions)&lt;/li&gt;
&lt;li&gt;Comment and docstring context (comments above your cursor guide suggestions)&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;&lt;strong&gt;Tip:&lt;/strong&gt; Keep related files open in tabs. Copilot considers open files as context, so having related source files, type definitions, and tests open improves suggestion quality.&lt;/p&gt;
&lt;h3&gt;Copilot Chat&lt;/h3&gt;
&lt;p&gt;Copilot Chat operates in the sidebar with conversation-based interaction:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Use &lt;code&gt;#file&lt;/code&gt; to reference specific files&lt;/li&gt;
&lt;li&gt;Use &lt;code&gt;#editor&lt;/code&gt; to reference the active editor content&lt;/li&gt;
&lt;li&gt;Use &lt;code&gt;#selection&lt;/code&gt; to reference selected code&lt;/li&gt;
&lt;li&gt;Use &lt;code&gt;#codebase&lt;/code&gt; to search the workspace&lt;/li&gt;
&lt;li&gt;Use &lt;code&gt;@workspace&lt;/code&gt; to ask questions about the entire project&lt;/li&gt;
&lt;/ul&gt;
&lt;h3&gt;Copilot Agent Mode&lt;/h3&gt;
&lt;p&gt;Agent Mode (introduced in 2025) makes Copilot an autonomous agent that can:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Plan multi-step changes&lt;/li&gt;
&lt;li&gt;Read and write files across the project&lt;/li&gt;
&lt;li&gt;Run terminal commands&lt;/li&gt;
&lt;li&gt;Make and verify changes iteratively&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;Agent Mode uses the broadest context of any Copilot feature: it can explore the codebase, read package.json, check test results, and understand project structure before making changes.&lt;/p&gt;
&lt;h3&gt;Custom Instructions for Copilot&lt;/h3&gt;
&lt;p&gt;Create a &lt;code&gt;.github/copilot-instructions.md&lt;/code&gt; file in your project root:&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-markdown&quot;&gt;# Copilot Instructions

## Code Style
- Use TypeScript strict mode
- Prefer functional components with hooks
- Use named exports, not default exports
- Follow the Airbnb ESLint configuration

## Testing
- Write tests using Vitest
- Use React Testing Library for component tests
- Mock API calls with MSW

## Architecture
- Components go in src/components/
- API clients go in src/api/
- Shared types go in src/types/
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;These instructions are loaded by Copilot for every interaction within the project, functioning like .cursor/rules/ in Cursor.&lt;/p&gt;
&lt;h2&gt;Continue: Open-Source AI Extension&lt;/h2&gt;
&lt;p&gt;Continue is an open-source VS Code extension that supports multiple LLM providers and offers extensive context management features.&lt;/p&gt;
&lt;h3&gt;Provider Configuration&lt;/h3&gt;
&lt;p&gt;Continue supports:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;OpenAI, Anthropic, Google models via API keys&lt;/li&gt;
&lt;li&gt;Ollama for local models&lt;/li&gt;
&lt;li&gt;Any OpenAI-compatible endpoint&lt;/li&gt;
&lt;/ul&gt;
&lt;h3&gt;Context Providers&lt;/h3&gt;
&lt;p&gt;Continue&apos;s &amp;quot;@-mention&amp;quot; context system includes:&lt;/p&gt;
&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Context Provider&lt;/th&gt;
&lt;th&gt;Function&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;@file&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;Include a specific file&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;@code&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;Include code blocks from the codebase&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;@docs&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;Search indexed documentation&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;@codebase&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;Semantic search across the project&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;@terminal&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;Include recent terminal output&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;@diff&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;Include current Git diff&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;@repo&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;Include repository metadata&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;@folder&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;Include folder structure&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;
&lt;h3&gt;.continuerc.json Configuration&lt;/h3&gt;
&lt;pre&gt;&lt;code class=&quot;language-json&quot;&gt;{
  &amp;quot;models&amp;quot;: [
    {
      &amp;quot;title&amp;quot;: &amp;quot;Claude Sonnet&amp;quot;,
      &amp;quot;provider&amp;quot;: &amp;quot;anthropic&amp;quot;,
      &amp;quot;model&amp;quot;: &amp;quot;claude-sonnet-4-20250514&amp;quot;,
      &amp;quot;apiKey&amp;quot;: &amp;quot;your-key&amp;quot;
    },
    {
      &amp;quot;title&amp;quot;: &amp;quot;Local Llama&amp;quot;,
      &amp;quot;provider&amp;quot;: &amp;quot;ollama&amp;quot;,
      &amp;quot;model&amp;quot;: &amp;quot;llama3.1:70b&amp;quot;
    }
  ],
  &amp;quot;customCommands&amp;quot;: [
    {
      &amp;quot;name&amp;quot;: &amp;quot;review&amp;quot;,
      &amp;quot;prompt&amp;quot;: &amp;quot;Review this code for security issues, performance problems, and style violations.&amp;quot;
    }
  ],
  &amp;quot;docs&amp;quot;: [
    {
      &amp;quot;title&amp;quot;: &amp;quot;React Docs&amp;quot;,
      &amp;quot;startUrl&amp;quot;: &amp;quot;https://react.dev/reference&amp;quot;
    }
  ]
}
&lt;/code&gt;&lt;/pre&gt;
&lt;h3&gt;Why Continue Stands Out for Context&lt;/h3&gt;
&lt;p&gt;Continue&apos;s open-source nature means you can inspect exactly how context is assembled. Its support for custom context providers extends beyond built-in options, allowing teams to create project-specific context sources.&lt;/p&gt;
&lt;h2&gt;Cline (formerly Claude Dev): Agentic Coding Agent&lt;/h2&gt;
&lt;p&gt;Cline is a VS Code extension that turns Claude into an autonomous coding agent within the editor.&lt;/p&gt;
&lt;h3&gt;Context Capabilities&lt;/h3&gt;
&lt;p&gt;Cline has one of the broadest context scopes among VS Code extensions:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Reads and writes files across the entire project&lt;/li&gt;
&lt;li&gt;Runs terminal commands&lt;/li&gt;
&lt;li&gt;Browses the web (for documentation lookup)&lt;/li&gt;
&lt;li&gt;Takes screenshots of running applications&lt;/li&gt;
&lt;li&gt;Manages its own task history&lt;/li&gt;
&lt;/ul&gt;
&lt;h3&gt;Project Instructions&lt;/h3&gt;
&lt;p&gt;Create a &lt;code&gt;.clinerules&lt;/code&gt; file in your project root:&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-markdown&quot;&gt;# Project: SaaS Application

## Stack
- Python 3.12 with FastAPI
- PostgreSQL with SQLAlchemy
- Redis for caching
- React frontend with TypeScript

## Build Commands
- Backend: `uvicorn app.main:app --reload`
- Frontend: `npm run dev`
- Tests: `pytest -v`

## Conventions
- All API responses use the ResponseModel pattern
- Database sessions are managed by dependency injection
- Frontend state uses React Query for server state
&lt;/code&gt;&lt;/pre&gt;
&lt;h3&gt;Custom MCP Servers&lt;/h3&gt;
&lt;p&gt;Cline supports MCP servers configured through its settings panel, enabling connections to databases, APIs, and other external tools directly within the VS Code environment.&lt;/p&gt;
&lt;h3&gt;Context Window Management&lt;/h3&gt;
&lt;p&gt;Cline tracks context window usage and can summarize previous conversation history when the window fills up. This automatic context management prevents the common problem of long sessions degrading quality.&lt;/p&gt;
&lt;h2&gt;Aider: Git-Aware AI Pair Programmer&lt;/h2&gt;
&lt;p&gt;Aider integrates with VS Code as a terminal-based tool that focuses on Git-aware code modifications.&lt;/p&gt;
&lt;h3&gt;Context Management in Aider&lt;/h3&gt;
&lt;p&gt;Aider uses a unique context model:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Chat files:&lt;/strong&gt; Files actively being discussed and modified&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Watch files:&lt;/strong&gt; Files included as read-only context&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Repository map:&lt;/strong&gt; An overview of the entire repository structure that fits in context&lt;/li&gt;
&lt;/ul&gt;
&lt;h3&gt;Commands for Context Control&lt;/h3&gt;
&lt;pre&gt;&lt;code&gt;/add src/auth/middleware.ts    # Add to chat context (can be edited)
/read docs/architecture.md     # Add as read-only context
/drop src/auth/middleware.ts   # Remove from context
/map                           # Show the repository map
&lt;/code&gt;&lt;/pre&gt;
&lt;h3&gt;The Repository Map&lt;/h3&gt;
&lt;p&gt;Aider&apos;s repository map is a compressed representation of your entire codebase (file names, function signatures, class definitions) that fits within the context window. This gives the AI a bird&apos;s-eye view of the project without consuming the entire context budget.&lt;/p&gt;
&lt;h2&gt;Thinking About Context Levels Across Extensions&lt;/h2&gt;
&lt;h3&gt;Minimal Context (Quick Completions)&lt;/h3&gt;
&lt;p&gt;For inline code completions, Copilot and Supermaven work well with minimal setup. Keep related files open in tabs and let the extension use the editor context.&lt;/p&gt;
&lt;h3&gt;Moderate Context (Feature Development)&lt;/h3&gt;
&lt;p&gt;Use a chat extension (Copilot Chat, Continue) with explicit file references. The @-mention system lets you include exactly the files relevant to the current task.&lt;/p&gt;
&lt;h3&gt;Comprehensive Context (Major Refactoring)&lt;/h3&gt;
&lt;p&gt;Use an agentic extension (Cline, Copilot Agent Mode) that can explore the codebase, run tests, and make changes across multiple files. Configure project instructions (.clinerules, copilot-instructions.md, .continuerc.json) to ensure the agent follows your conventions.&lt;/p&gt;
&lt;h2&gt;External Documents: PDFs vs. Markdown&lt;/h2&gt;
&lt;h3&gt;Markdown Is Universal&lt;/h3&gt;
&lt;p&gt;All VS Code AI extensions work natively with Markdown. Project instructions, coding standards, and architecture documents should be Markdown files in your repository.&lt;/p&gt;
&lt;h3&gt;PDFs&lt;/h3&gt;
&lt;p&gt;Most VS Code extensions do not parse PDFs directly. If you have reference material in PDF form, extract relevant sections into Markdown files. Some extensions (like Cline with web browsing) can fetch online documentation, reducing the need for local PDF conversion.&lt;/p&gt;
&lt;h3&gt;Documentation Indexing&lt;/h3&gt;
&lt;p&gt;Continue and Copilot Chat support documentation indexing through @docs. Add your framework documentation URLs to the extension configuration so the AI can reference current documentation during conversations.&lt;/p&gt;
&lt;h2&gt;MCP Server Support&lt;/h2&gt;
&lt;p&gt;MCP support varies by extension:&lt;/p&gt;
&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Extension&lt;/th&gt;
&lt;th&gt;MCP Support&lt;/th&gt;
&lt;th&gt;Configuration&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Cline&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Yes&lt;/td&gt;
&lt;td&gt;Settings panel&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Continue&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Yes&lt;/td&gt;
&lt;td&gt;config.json&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Copilot&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Limited&lt;/td&gt;
&lt;td&gt;Through GitHub integration&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Aider&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;No&lt;/td&gt;
&lt;td&gt;Direct terminal commands instead&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;
&lt;p&gt;For extensions that support MCP, the configuration follows the standard pattern: specify the server command, arguments, and environment variables. MCP tools become available within the extension&apos;s chat or agent interface.&lt;/p&gt;
&lt;h2&gt;settings.json: Centralizing AI Configuration&lt;/h2&gt;
&lt;p&gt;VS Code&apos;s &lt;code&gt;settings.json&lt;/code&gt; is where many AI extensions read their configuration. Here are common settings patterns:&lt;/p&gt;
&lt;h3&gt;Per-Workspace Settings&lt;/h3&gt;
&lt;p&gt;Create a &lt;code&gt;.vscode/settings.json&lt;/code&gt; file in your project to configure AI extensions per-project:&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-json&quot;&gt;{
  &amp;quot;github.copilot.enable&amp;quot;: {
    &amp;quot;markdown&amp;quot;: true,
    &amp;quot;plaintext&amp;quot;: false
  },
  &amp;quot;continue.enableTabAutocomplete&amp;quot;: false,
  &amp;quot;cline.customInstructions&amp;quot;: &amp;quot;Follow the conventions in INSTRUCTIONS.md&amp;quot;
}
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Per-workspace settings override user-level settings, allowing you to tailor AI behavior to each project&apos;s needs.&lt;/p&gt;
&lt;h3&gt;Workspace Trust and Security&lt;/h3&gt;
&lt;p&gt;VS Code&apos;s Workspace Trust feature is important when using AI extensions. In untrusted workspaces, some extensions may limit their capabilities (for example, restricting file access or command execution). This is a security feature: it prevents untrusted code from being automatically processed by AI tools that have file system access.&lt;/p&gt;
&lt;p&gt;For your own projects, trust the workspace to enable full AI capabilities. For third-party codebases, consider the implications before trusting.&lt;/p&gt;
&lt;h2&gt;When to Use VS Code with Plugins vs. Dedicated AI Editors&lt;/h2&gt;
&lt;p&gt;&lt;strong&gt;Choose VS Code with plugins when:&lt;/strong&gt;&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;You already use VS Code and want to add AI incrementally&lt;/li&gt;
&lt;li&gt;You want to mix and match extensions from different providers&lt;/li&gt;
&lt;li&gt;You have existing VS Code extensions and workflows you cannot replicate elsewhere&lt;/li&gt;
&lt;li&gt;You need the specific capabilities of an extension that only exists for VS Code (like Cline)&lt;/li&gt;
&lt;li&gt;Your team uses different AI providers and needs a common editor&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;&lt;strong&gt;Choose Cursor or Windsurf when:&lt;/strong&gt;&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;You want the most seamlessly integrated AI experience&lt;/li&gt;
&lt;li&gt;You prefer automatic codebase indexing over manual context management&lt;/li&gt;
&lt;li&gt;You are starting fresh and do not have an existing VS Code extension stack&lt;/li&gt;
&lt;li&gt;You want features like .cursor/rules/ or Cascade flows that are deeply integrated&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;&lt;strong&gt;Choose a terminal agent (Claude Code, Gemini CLI) when:&lt;/strong&gt;&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Your workflow is terminal-centric&lt;/li&gt;
&lt;li&gt;You need direct shell command execution as your primary interaction&lt;/li&gt;
&lt;li&gt;You prefer a focused, distraction-free coding experience&lt;/li&gt;
&lt;/ul&gt;
&lt;h2&gt;Advanced Patterns&lt;/h2&gt;
&lt;h3&gt;The Multi-Extension Stack&lt;/h3&gt;
&lt;p&gt;Use multiple extensions simultaneously for different purposes:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Copilot&lt;/strong&gt; for inline completions (fast, low-friction)&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Continue&lt;/strong&gt; for chat with @codebase search (exploratory questions)&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Cline&lt;/strong&gt; for agentic tasks (multi-file changes, complex features)&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;Each extension handles a different level of context and interaction.&lt;/p&gt;
&lt;h3&gt;The Consistent Instructions Pattern&lt;/h3&gt;
&lt;p&gt;Maintain a single &lt;code&gt;INSTRUCTIONS.md&lt;/code&gt; file in your project root and reference it from each extension&apos;s configuration:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;code&gt;.github/copilot-instructions.md&lt;/code&gt; imports or mirrors INSTRUCTIONS.md&lt;/li&gt;
&lt;li&gt;&lt;code&gt;.continuerc.json&lt;/code&gt; references INSTRUCTIONS.md&lt;/li&gt;
&lt;li&gt;&lt;code&gt;.clinerules&lt;/code&gt; mirrors the same conventions&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;This ensures consistent behavior regardless of which extension handles the task.&lt;/p&gt;
&lt;h3&gt;The Provider Rotation Pattern&lt;/h3&gt;
&lt;p&gt;Use different providers for different extensions:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Copilot: GitHub&apos;s infrastructure (fast, always available)&lt;/li&gt;
&lt;li&gt;Continue: Anthropic API (strong at code analysis)&lt;/li&gt;
&lt;li&gt;Cline: Local Ollama model (privacy for sensitive code)&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;This gives you the benefits of multiple providers within a single editor.&lt;/p&gt;
&lt;h2&gt;Common Mistakes&lt;/h2&gt;
&lt;ol&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Using too many AI extensions simultaneously.&lt;/strong&gt; Running five AI extensions creates conflicts, performance overhead, and conflicting suggestions. Pick a primary stack and disable the rest.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Not configuring project instructions.&lt;/strong&gt; Every AI extension supports some form of project-level instructions. Without them, the AI relies on generic conventions.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Ignoring @codebase search.&lt;/strong&gt; Both Copilot Chat and Continue offer codebase search. Using it produces more relevant responses than manually specifying files.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Not keeping related tabs open.&lt;/strong&gt; Inline completion quality improves when related files are open in the editor. Keep type definitions, tests, and related source files in your tab bar.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Choosing the wrong extension for the task.&lt;/strong&gt; Inline completions for quick code, chat for questions, agent mode for complex changes. Match the tool to the task.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Skipping documentation indexing.&lt;/strong&gt; If you are working with a framework, index its documentation so the AI references current, accurate information rather than potentially outdated training data.&lt;/p&gt;
&lt;/li&gt;
&lt;/ol&gt;
&lt;h2&gt;Go Deeper&lt;/h2&gt;
&lt;p&gt;To learn more about AI-assisted development and context management strategies, check out these resources by Alex Merced:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;
&lt;p&gt;&lt;a href=&quot;https://www.amazon.com/2026-Guide-AI-Assisted-Development-Engineering-ebook/dp/B0GQW7CTML/&quot;&gt;The 2026 Guide to AI-Assisted Development&lt;/a&gt; covers AI-assisted development workflows, prompt engineering, and context strategies for software engineers.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;a href=&quot;https://www.amazon.com/Lakehouses-Apache-Iceberg-Agentic-Hands/dp/B0GQNY21TD/&quot;&gt;The 2026 Guide to Lakehouses, Apache Iceberg and Agentic AI&lt;/a&gt; explores how AI agents are reshaping data architecture and how to build systems that support agentic workflows.&lt;/p&gt;
&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;And for a fictional take on where AI is heading:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;a href=&quot;https://www.amazon.com/Emperors-I-Valley-Novel-Future/dp/B0GQHKF4ZT/&quot;&gt;The Emperors of A.I. Valley: A Novel of Power, Code, and the War for the Future&lt;/a&gt; is a novel about the power struggles and ethical dilemmas behind the companies building the most powerful AI systems in the world.&lt;/li&gt;
&lt;/ul&gt;
</content:encoded><author>Alex Merced</author></item><item><title>Context Management Strategies for T3 Chat: A Complete Guide to the Unified Multi-Model AI Interface</title><link>https://iceberglakehouse.com/posts/2026-03-context-t3-chat/</link><guid isPermaLink="true">https://iceberglakehouse.com/posts/2026-03-context-t3-chat/</guid><description>
T3 Chat is a modern web-based AI chat interface that gives you access to multiple AI models through a single unified platform. Its primary value prop...</description><pubDate>Sun, 08 Mar 2026 00:00:00 GMT</pubDate><content:encoded>&lt;p&gt;T3 Chat is a modern web-based AI chat interface that gives you access to multiple AI models through a single unified platform. Its primary value proposition is model flexibility: instead of being locked into one provider, you can switch between Claude, GPT, Gemini, Llama, and other models within the same interface. This makes T3 Chat unique from a context management perspective because the same context strategies must work across fundamentally different model families with different capabilities, context window sizes, and strengths.&lt;/p&gt;
&lt;p&gt;This guide covers how to manage context effectively in T3 Chat to get the most from its multi-model architecture, from conversation organization to system prompts and file handling.&lt;/p&gt;
&lt;h2&gt;How T3 Chat Manages Context&lt;/h2&gt;
&lt;p&gt;T3 Chat builds its context from several sources:&lt;/p&gt;
&lt;ol&gt;
&lt;li&gt;&lt;strong&gt;System prompts&lt;/strong&gt; - persistent instructions that shape every response&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Model selection&lt;/strong&gt; - the underlying model determines context window and capabilities&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Conversation history&lt;/strong&gt; - the message thread within the current chat&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;File attachments&lt;/strong&gt; - documents and images uploaded to conversations&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Personas&lt;/strong&gt; - saved configurations combining system prompts with preferred models&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Folders and organization&lt;/strong&gt; - conversation grouping for project-based workflows&lt;/li&gt;
&lt;/ol&gt;
&lt;p&gt;The context management challenge unique to T3 Chat is that different models interpret your context differently. A system prompt that works well with Claude may need adjustment for GPT or Gemini. Understanding these differences helps you write model-portable context.&lt;/p&gt;
&lt;h2&gt;System Prompts: The Foundation&lt;/h2&gt;
&lt;p&gt;T3 Chat supports custom system prompts that you set per-conversation or through Personas.&lt;/p&gt;
&lt;h3&gt;Writing Effective System Prompts&lt;/h3&gt;
&lt;pre&gt;&lt;code class=&quot;language-markdown&quot;&gt;You are a senior software architect with expertise in distributed systems.

## Response Style
- Be technical and precise
- Include code examples when relevant
- Use bullet points for lists of recommendations
- Explain tradeoffs, do not just give the &amp;quot;right&amp;quot; answer

## Constraints
- Assume the reader has 5+ years of programming experience
- Do not explain basic concepts unless asked
- When discussing frameworks, focus on architectural implications, not syntax tutorials

## Output Format
- Use headers to organize long responses
- Include a &amp;quot;Key Takeaway&amp;quot; section at the end of detailed analyses
- Format code blocks with language annotations
&lt;/code&gt;&lt;/pre&gt;
&lt;h3&gt;Model-Portable System Prompts&lt;/h3&gt;
&lt;p&gt;Because T3 Chat supports multiple models, write system prompts that work across model families:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Be explicit&lt;/strong&gt; about format expectations. Different models interpret vague formatting instructions differently.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Avoid model-specific references.&lt;/strong&gt; Do not write &amp;quot;As Claude, you should...&amp;quot; or &amp;quot;Using your GPT capabilities...&amp;quot;&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Focus on behavior and output.&lt;/strong&gt; Describe what you want the model to do, not how you think it should reason internally.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Test across models.&lt;/strong&gt; Send the same prompt to Claude, GPT, and Gemini within T3 Chat to verify consistent behavior.&lt;/li&gt;
&lt;/ul&gt;
&lt;h2&gt;Personas: Reusable Context Configurations&lt;/h2&gt;
&lt;p&gt;Personas combine a system prompt with a preferred model selection into a reusable configuration. Think of them as &amp;quot;modes&amp;quot; you can switch between.&lt;/p&gt;
&lt;h3&gt;Creating Effective Personas&lt;/h3&gt;
&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Persona&lt;/th&gt;
&lt;th&gt;System Prompt Focus&lt;/th&gt;
&lt;th&gt;Model Choice&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Code Reviewer&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Security, performance, style guide checks&lt;/td&gt;
&lt;td&gt;Claude Sonnet (strong at code analysis)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Technical Writer&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Documentation standards, audience awareness&lt;/td&gt;
&lt;td&gt;GPT-4o (strong at prose)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Research Analyst&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Citation requirements, source evaluation&lt;/td&gt;
&lt;td&gt;Gemini Pro (strong at retrieval and synthesis)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Creative Brainstormer&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Divergent thinking, idea generation&lt;/td&gt;
&lt;td&gt;Claude Opus or GPT-4o (creative capabilities)&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;
&lt;h3&gt;When to Create Personas&lt;/h3&gt;
&lt;p&gt;Create a Persona when you find yourself:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Repeating the same system prompt across conversations&lt;/li&gt;
&lt;li&gt;Switching to the same model for a specific type of task&lt;/li&gt;
&lt;li&gt;Wanting to standardize how the AI handles a particular workflow&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;Personas save time and ensure consistency. Instead of re-configuring the system prompt and model for each new conversation, select the appropriate Persona and start working.&lt;/p&gt;
&lt;h2&gt;Model Selection as Context Management&lt;/h2&gt;
&lt;p&gt;Choosing the right model in T3 Chat is itself a context management decision because different models have different context window sizes and capabilities.&lt;/p&gt;
&lt;h3&gt;Context Window Comparison&lt;/h3&gt;
&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Model&lt;/th&gt;
&lt;th&gt;Approximate Context Window&lt;/th&gt;
&lt;th&gt;Strengths&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Claude Sonnet&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;200K tokens&lt;/td&gt;
&lt;td&gt;Long context, code analysis, nuanced reasoning&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Claude Opus&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;200K tokens&lt;/td&gt;
&lt;td&gt;Complex analysis, creative writing&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;GPT-4o&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;128K tokens&lt;/td&gt;
&lt;td&gt;Broad capabilities, strong at prose and instruction following&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;GPT-o3&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;200K tokens&lt;/td&gt;
&lt;td&gt;Deep reasoning, complex problem solving&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Gemini Pro&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;1M+ tokens&lt;/td&gt;
&lt;td&gt;Massive context, document analysis&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Llama 3.1 (70B)&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;128K tokens&lt;/td&gt;
&lt;td&gt;Open source, privacy-friendly&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;
&lt;h3&gt;Model Selection Strategy&lt;/h3&gt;
&lt;p&gt;For T3 Chat users, the model selection strategy directly affects context management:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Long documents or many files:&lt;/strong&gt; Choose Gemini Pro (massive context window) or Claude (200K)&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Quick questions:&lt;/strong&gt; Choose a fast model (GPT-4o-mini, Claude Haiku) for responsiveness&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Privacy-sensitive content:&lt;/strong&gt; Choose Llama through a local endpoint&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Complex analysis:&lt;/strong&gt; Choose Claude Opus or GPT-o3 for deep reasoning&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;Being deliberate about model selection means your context is used more effectively by a model suited to the task.&lt;/p&gt;
&lt;h2&gt;Conversation Organization&lt;/h2&gt;
&lt;p&gt;T3 Chat provides tools for organizing your conversations into a structured workspace.&lt;/p&gt;
&lt;h3&gt;Folders&lt;/h3&gt;
&lt;p&gt;Group conversations by project, topic, or workflow. This is not just for tidiness; organized conversations make it easier to find and resume context:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;code&gt;/projects/web-app/&lt;/code&gt; might contain conversations about frontend, backend, and deployment&lt;/li&gt;
&lt;li&gt;&lt;code&gt;/research/market-analysis/&lt;/code&gt; might contain conversations about different market segments&lt;/li&gt;
&lt;li&gt;&lt;code&gt;/writing/blog-series/&lt;/code&gt; might contain conversations for each blog post&lt;/li&gt;
&lt;/ul&gt;
&lt;h3&gt;Pinned Conversations&lt;/h3&gt;
&lt;p&gt;Pin important conversations for quick access. Pin your most frequently referenced threads so you can revisit them without searching.&lt;/p&gt;
&lt;h3&gt;Naming Conventions&lt;/h3&gt;
&lt;p&gt;Name conversations descriptively:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&amp;quot;Auth module refactoring plan&amp;quot; is searchable and findable&lt;/li&gt;
&lt;li&gt;&amp;quot;New chat 47&amp;quot; is neither&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;Good naming is a form of context management because it makes your accumulated knowledge retrievable.&lt;/p&gt;
&lt;h2&gt;File Attachments&lt;/h2&gt;
&lt;p&gt;T3 Chat supports file uploads for providing document-level context within conversations.&lt;/p&gt;
&lt;h3&gt;Supported File Types&lt;/h3&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Documents:&lt;/strong&gt; PDFs, Markdown, plain text&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Images:&lt;/strong&gt; Screenshots, diagrams, mockups&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Spreadsheets:&lt;/strong&gt; CSV, Excel files for data analysis&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Code files:&lt;/strong&gt; Source code in any language&lt;/li&gt;
&lt;/ul&gt;
&lt;h3&gt;Best Practices for File Attachments&lt;/h3&gt;
&lt;ul&gt;
&lt;li&gt;Upload only the files relevant to the current question. Uploading your entire project creates noise.&lt;/li&gt;
&lt;li&gt;For long documents, tell the model which sections to focus on: &amp;quot;This is our API specification. Focus on the authentication endpoints in Section 3.&amp;quot;&lt;/li&gt;
&lt;li&gt;For images, provide a text description of what the model should look for: &amp;quot;This is a screenshot of our dashboard. The chart in the upper-right shows incorrect data.&amp;quot;&lt;/li&gt;
&lt;/ul&gt;
&lt;h2&gt;External Documents: PDFs vs. Markdown&lt;/h2&gt;
&lt;h3&gt;PDFs&lt;/h3&gt;
&lt;p&gt;T3 Chat can process PDFs uploaded as attachments. PDFs work well for:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Formal documents (research papers, specifications, contracts)&lt;/li&gt;
&lt;li&gt;Published content with fixed formatting&lt;/li&gt;
&lt;li&gt;Multi-page documents with embedded images and tables&lt;/li&gt;
&lt;/ul&gt;
&lt;h3&gt;Markdown&lt;/h3&gt;
&lt;p&gt;For context you author specifically for the AI (system prompts, reference documents, instructions), Markdown is cleaner:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Models parse Markdown more reliably than extracted PDF text&lt;/li&gt;
&lt;li&gt;Markdown is easier to version and update&lt;/li&gt;
&lt;li&gt;The structure (headings, lists, code blocks) is explicit, not inferred&lt;/li&gt;
&lt;/ul&gt;
&lt;h3&gt;The Practical Rule&lt;/h3&gt;
&lt;p&gt;If the document exists as a PDF and you cannot easily convert it, upload the PDF. If you are writing the document for the purpose of giving it to the AI, write it in Markdown.&lt;/p&gt;
&lt;h2&gt;MCP Server Support&lt;/h2&gt;
&lt;p&gt;T3 Chat supports MCP (Model Context Protocol) server connections, allowing the platform to integrate with external data sources and tools. This extends T3 Chat&apos;s capabilities beyond conversation and file uploads by enabling connections to services like Google Drive, Slack, GitHub, databases, and custom APIs.&lt;/p&gt;
&lt;h3&gt;How MCP Works in T3 Chat&lt;/h3&gt;
&lt;p&gt;MCP servers provide T3 Chat with access to external resources and tools. When configured, the AI can query external data sources, retrieve real-time information, and perform actions through connected services. This makes T3 Chat more than just a chatbot: it becomes an interface for interacting with your broader tool ecosystem.&lt;/p&gt;
&lt;h3&gt;When MCP Adds Value&lt;/h3&gt;
&lt;p&gt;MCP is most useful in T3 Chat when your conversations need live data access:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Querying a database while discussing architecture decisions&lt;/li&gt;
&lt;li&gt;Accessing project management data during planning conversations&lt;/li&gt;
&lt;li&gt;Retrieving documentation from connected services&lt;/li&gt;
&lt;li&gt;Interacting with APIs through a conversational interface&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;For conversations that rely purely on the AI&apos;s training data or uploaded files, MCP is unnecessary. It adds the most value when you need real-time connections to external systems during your conversations.&lt;/p&gt;
&lt;h2&gt;Thinking About Context Levels in T3 Chat&lt;/h2&gt;
&lt;h3&gt;Quick Questions (Minimal Context)&lt;/h3&gt;
&lt;p&gt;For factual or conceptual questions, just ask. No special setup needed:&lt;/p&gt;
&lt;p&gt;&amp;quot;What is the difference between horizontal and vertical scaling in database architecture?&amp;quot;&lt;/p&gt;
&lt;p&gt;The model&apos;s training data is sufficient context, and no files or custom prompts are required.&lt;/p&gt;
&lt;h3&gt;Working Sessions (Moderate Context)&lt;/h3&gt;
&lt;p&gt;For sustained work on a topic, create a conversation with an appropriate Persona and provide reference files:&lt;/p&gt;
&lt;p&gt;&amp;quot;I am building a REST API for a healthcare application. Here is the data model [attach file]. Help me design the endpoints following HIPAA compliance patterns.&amp;quot;&lt;/p&gt;
&lt;h3&gt;Complex Projects (Comprehensive Context)&lt;/h3&gt;
&lt;p&gt;For multi-day projects, create a folder of organized conversations, use Personas for different phases of work, and bridge context between conversations using explicit summaries.&lt;/p&gt;
&lt;h2&gt;Model-Specific Context Tuning&lt;/h2&gt;
&lt;p&gt;Each model family responds slightly differently to the same context. Here are practical tips for tuning:&lt;/p&gt;
&lt;h3&gt;Claude in T3 Chat&lt;/h3&gt;
&lt;ul&gt;
&lt;li&gt;Responds well to role-based system prompts (&amp;quot;You are a...&amp;quot;)&lt;/li&gt;
&lt;li&gt;Handles very long contexts gracefully&lt;/li&gt;
&lt;li&gt;Benefits from explicit format instructions&lt;/li&gt;
&lt;/ul&gt;
&lt;h3&gt;GPT Models in T3 Chat&lt;/h3&gt;
&lt;ul&gt;
&lt;li&gt;Follows formatting instructions precisely&lt;/li&gt;
&lt;li&gt;Works well with example-based prompts (&amp;quot;Here is an example of what I want: ...&amp;quot;)&lt;/li&gt;
&lt;li&gt;Benefits from numbered constraints&lt;/li&gt;
&lt;/ul&gt;
&lt;h3&gt;Gemini in T3 Chat&lt;/h3&gt;
&lt;ul&gt;
&lt;li&gt;Excels with document analysis tasks&lt;/li&gt;
&lt;li&gt;Handles massive context windows (1M+ tokens)&lt;/li&gt;
&lt;li&gt;Benefits from clear section headers in system prompts&lt;/li&gt;
&lt;/ul&gt;
&lt;h3&gt;When to Use T3 Chat vs. Other Tools&lt;/h3&gt;
&lt;p&gt;&lt;strong&gt;Use T3 Chat when:&lt;/strong&gt;&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;You want to compare responses across different models&lt;/li&gt;
&lt;li&gt;You need flexible model selection without multiple subscriptions&lt;/li&gt;
&lt;li&gt;Your task is conversational (research, analysis, writing, brainstorming)&lt;/li&gt;
&lt;li&gt;You want Personas for reusable configurations&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;&lt;strong&gt;Use a coding IDE (Cursor, Windsurf, Zed) when:&lt;/strong&gt;&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Your task involves editing code files directly&lt;/li&gt;
&lt;li&gt;You need workspace indexing and @codebase search&lt;/li&gt;
&lt;li&gt;You want agent mode to make cross-file changes&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;&lt;strong&gt;Use a terminal agent (Claude Code, Gemini CLI) when:&lt;/strong&gt;&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;You need direct terminal access and command execution&lt;/li&gt;
&lt;li&gt;Your task involves running tests, builds, or deployments&lt;/li&gt;
&lt;/ul&gt;
&lt;h2&gt;Advanced Patterns&lt;/h2&gt;
&lt;h3&gt;The Model Comparison Pattern&lt;/h3&gt;
&lt;p&gt;Use T3 Chat&apos;s multi-model support to compare responses:&lt;/p&gt;
&lt;ol&gt;
&lt;li&gt;Ask the same question to Claude, GPT, and Gemini&lt;/li&gt;
&lt;li&gt;Compare the responses for accuracy, depth, and style&lt;/li&gt;
&lt;li&gt;Use the best response as a starting point and refine it&lt;/li&gt;
&lt;/ol&gt;
&lt;p&gt;This is especially useful for high-stakes content where you want multiple perspectives before finalizing.&lt;/p&gt;
&lt;h3&gt;The Persona Pipeline Pattern&lt;/h3&gt;
&lt;p&gt;Chain Personas for multi-step work:&lt;/p&gt;
&lt;ol&gt;
&lt;li&gt;&lt;strong&gt;Research Persona&lt;/strong&gt; (Gemini): Gather information and sources&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Analysis Persona&lt;/strong&gt; (Claude): Analyze the research and identify key themes&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Writing Persona&lt;/strong&gt; (GPT): Draft the final output based on the analysis&lt;/li&gt;
&lt;/ol&gt;
&lt;p&gt;Each step uses a model optimized for that type of work, with context transferred manually between conversations.&lt;/p&gt;
&lt;h3&gt;The Context Bridging Pattern&lt;/h3&gt;
&lt;p&gt;When switching between models in the same conversation, bridge the context explicitly:&lt;/p&gt;
&lt;p&gt;&amp;quot;Here is a summary of what we discussed so far: [summary]. I am switching to a different model. Please continue from this point.&amp;quot;&lt;/p&gt;
&lt;p&gt;This helps the new model pick up the thread without losing continuity.&lt;/p&gt;
&lt;h2&gt;Common Mistakes&lt;/h2&gt;
&lt;ol&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Not using Personas for repeatable work.&lt;/strong&gt; If you are configuring the same system prompt and model combination repeatedly, create a Persona.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Ignoring model differences.&lt;/strong&gt; Claude, GPT, and Gemini respond differently to the same prompt. If results are not meeting expectations, try a different model before rewriting the prompt.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Uploading too many files.&lt;/strong&gt; Each file consumes context window space. Be selective and upload only what is relevant to the current question.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Not organizing conversations.&lt;/strong&gt; Without folders and descriptive names, your accumulated research and context becomes unfindable as conversations accumulate.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Using the same model for everything.&lt;/strong&gt; T3 Chat&apos;s strength is model flexibility. Use Gemini for massive documents, Claude for code analysis, and GPT for prose generation.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Writing model-specific system prompts.&lt;/strong&gt; If your system prompt only works with one model, it is too model-specific. Write instructions that describe behavior and output, not internal reasoning.&lt;/p&gt;
&lt;/li&gt;
&lt;/ol&gt;
&lt;h2&gt;Go Deeper&lt;/h2&gt;
&lt;p&gt;To learn more about working effectively with AI interfaces and context management strategies, check out these resources by Alex Merced:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;
&lt;p&gt;&lt;a href=&quot;https://www.amazon.com/2026-Guide-AI-Assisted-Development-Engineering-ebook/dp/B0GQW7CTML/&quot;&gt;The 2026 Guide to AI-Assisted Development&lt;/a&gt; covers AI-assisted development workflows, prompt engineering, and context strategies for software engineers.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;a href=&quot;https://www.amazon.com/Lakehouses-Apache-Iceberg-Agentic-Hands/dp/B0GQNY21TD/&quot;&gt;The 2026 Guide to Lakehouses, Apache Iceberg and Agentic AI&lt;/a&gt; explores how AI agents are reshaping data architecture and how to build systems that support agentic workflows.&lt;/p&gt;
&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;And for a fictional take on where AI is heading:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;a href=&quot;https://www.amazon.com/Emperors-I-Valley-Novel-Future/dp/B0GQHKF4ZT/&quot;&gt;The Emperors of A.I. Valley: A Novel of Power, Code, and the War for the Future&lt;/a&gt; is a novel about the power struggles and ethical dilemmas behind the companies building the most powerful AI systems in the world.&lt;/li&gt;
&lt;/ul&gt;
</content:encoded><author>Alex Merced</author></item><item><title>Context Management Strategies for Zed: A Complete Guide to the High-Performance AI Code Editor</title><link>https://iceberglakehouse.com/posts/2026-03-context-zed/</link><guid isPermaLink="true">https://iceberglakehouse.com/posts/2026-03-context-zed/</guid><description>
Zed is a high-performance code editor built in Rust that prioritizes speed, simplicity, and real-time collaboration. Its AI integration is designed t...</description><pubDate>Sat, 07 Mar 2026 23:00:00 GMT</pubDate><content:encoded>&lt;p&gt;Zed is a high-performance code editor built in Rust that prioritizes speed, simplicity, and real-time collaboration. Its AI integration is designed to be fast and unobtrusive, with context management built around an assistant panel, inline transformations, slash commands, and a flexible provider system that supports multiple AI services. What sets Zed apart from other AI editors is its focus on performance (everything runs natively, not in Electron) and its built-in multiplayer editing that extends to AI interactions.&lt;/p&gt;
&lt;p&gt;This guide covers how to manage context effectively in Zed&apos;s AI features to get the most from its lightweight but capable AI integration.&lt;/p&gt;
&lt;h2&gt;How Zed Manages Context&lt;/h2&gt;
&lt;p&gt;Zed builds AI context from several sources:&lt;/p&gt;
&lt;ol&gt;
&lt;li&gt;&lt;strong&gt;Assistant panel&lt;/strong&gt; - a dedicated panel for multi-turn conversations with persistent context threads&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Inline transformations&lt;/strong&gt; - context-aware edits triggered in the editor&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Slash commands&lt;/strong&gt; - special commands that inject structured context into prompts&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Active buffers&lt;/strong&gt; - files currently open in the editor&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Project structure&lt;/strong&gt; - the workspace file tree&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Custom prompts library&lt;/strong&gt; - saved, reusable prompt templates&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Language server data&lt;/strong&gt; - type information and diagnostics from LSPs&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;MCP servers&lt;/strong&gt; - external tool connections (supported in recent versions)&lt;/li&gt;
&lt;/ol&gt;
&lt;p&gt;Zed takes a minimalist approach to context management: rather than automatically indexing your entire codebase (like Cursor or Windsurf), it gives you explicit control over what goes into context through slash commands and file references.&lt;/p&gt;
&lt;h2&gt;The Assistant Panel: Structured Conversations&lt;/h2&gt;
&lt;p&gt;Zed&apos;s Assistant Panel is the primary interface for AI interactions that require context beyond the current file. It operates as a structured conversation where you build context explicitly.&lt;/p&gt;
&lt;h3&gt;How the Panel Works&lt;/h3&gt;
&lt;p&gt;The panel displays a conversation thread where each message can include code blocks, file references, and slash command outputs. You compose messages, include context, and receive AI responses in a single, reviewable flow.&lt;/p&gt;
&lt;h3&gt;Persistent Context Threads&lt;/h3&gt;
&lt;p&gt;Each conversation in the panel is a persistent thread. You can name threads, save them, and return to them later. This means you can maintain ongoing conversations about specific features or architectural decisions without losing context between sessions.&lt;/p&gt;
&lt;h3&gt;Including Code from Open Buffers&lt;/h3&gt;
&lt;p&gt;You can drag files or code selections into the assistant panel to include them as context. This explicit inclusion model means you always know exactly what context the AI is working with, unlike tools that silently assemble context behind the scenes.&lt;/p&gt;
&lt;h2&gt;Zed&apos;s Explicit Context Philosophy&lt;/h2&gt;
&lt;p&gt;Zed&apos;s approach to context management is fundamentally different from editors like Cursor or Windsurf that automatically index and retrieve context. In Zed, you explicitly choose what context to provide through slash commands and file inclusions. This has important tradeoffs:&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Advantages of explicit context:&lt;/strong&gt;&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;You always know what the AI is working with&lt;/li&gt;
&lt;li&gt;No surprises from irrelevant code being included&lt;/li&gt;
&lt;li&gt;Works well with smaller model context windows (no wasted tokens)&lt;/li&gt;
&lt;li&gt;Context is reproducible: the same slash commands always produce the same context&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;&lt;strong&gt;Tradeoffs:&lt;/strong&gt;&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Requires more manual effort to set up context&lt;/li&gt;
&lt;li&gt;You need to know which files are relevant before asking&lt;/li&gt;
&lt;li&gt;The AI cannot discover related code on its own (unlike @codebase in other editors)&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;Understanding this philosophy helps you use Zed&apos;s AI features effectively: invest time in selecting the right context rather than expecting the editor to figure it out for you.&lt;/p&gt;
&lt;h2&gt;Real-Time Collaboration and AI&lt;/h2&gt;
&lt;p&gt;Zed&apos;s built-in multiplayer editing extends to AI interactions. When collaborating in a shared workspace:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Multiple developers can contribute to the same assistant panel conversation&lt;/li&gt;
&lt;li&gt;One developer can set up the context while another frames the question&lt;/li&gt;
&lt;li&gt;AI suggestions can be reviewed and discussed collaboratively in real time&lt;/li&gt;
&lt;li&gt;The AI&apos;s output is visible to all participants simultaneously&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;This makes Zed uniquely suited for pair programming and team code review workflows that incorporate AI assistance.&lt;/p&gt;
&lt;h2&gt;Slash Commands: Explicit Context Injection&lt;/h2&gt;
&lt;p&gt;Slash commands are Zed&apos;s primary mechanism for injecting specific types of context into AI conversations.&lt;/p&gt;
&lt;h3&gt;Available Slash Commands&lt;/h3&gt;
&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Command&lt;/th&gt;
&lt;th&gt;Function&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;/file [path]&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;Include a specific file&apos;s content&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;/tab&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;Include all currently open tabs&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;/diagnostics&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;Include current LSP errors and warnings&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;/search [query]&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;Search the project and include results&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;/prompt [name]&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;Load a saved prompt template&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;/now&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;Include the current date and time&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;/fetch [url]&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;Fetch and include content from a URL&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;
&lt;h3&gt;Using Slash Commands Effectively&lt;/h3&gt;
&lt;p&gt;The power of slash commands is precision. Instead of sending your entire codebase as context, you choose exactly which files and information are relevant:&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;/file src/auth/middleware.ts
/file src/auth/types.ts
/diagnostics

I need to fix the TypeScript errors in the auth middleware.
The types file defines the expected interfaces.
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;This focused approach produces better results than sending the AI a vague question against a massive context window. Each piece of context is intentional and relevant.&lt;/p&gt;
&lt;h3&gt;/diagnostics for Error-Driven Context&lt;/h3&gt;
&lt;p&gt;The &lt;code&gt;/diagnostics&lt;/code&gt; command is particularly powerful because it pulls language server errors and warnings directly into the AI conversation. Instead of manually copying error messages, one command gives the AI structured diagnostic information.&lt;/p&gt;
&lt;h3&gt;/fetch for External Documentation&lt;/h3&gt;
&lt;p&gt;The &lt;code&gt;/fetch&lt;/code&gt; command retrieves content from URLs, making it easy to include external documentation, API specifications, or reference material without manual copying:&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;/fetch https://docs.myframework.com/api/routing

How do I implement nested routing using this framework&apos;s API?
&lt;/code&gt;&lt;/pre&gt;
&lt;h2&gt;Custom Prompts Library&lt;/h2&gt;
&lt;p&gt;Zed maintains a library of saved prompts that you can reuse across conversations and projects.&lt;/p&gt;
&lt;h3&gt;Creating Custom Prompts&lt;/h3&gt;
&lt;p&gt;Navigate to the prompts library and create templates for common tasks:&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-markdown&quot;&gt;# Code Review Template

Review the provided code for:
1. Security vulnerabilities (injection, XSS, CSRF)
2. Performance issues (N+1 queries, unnecessary allocations)
3. Error handling completeness
4. Type safety issues
5. Missing edge cases

For each issue found:
- Describe the problem
- Explain the risk
- Provide a fix
&lt;/code&gt;&lt;/pre&gt;
&lt;h3&gt;Using Prompts&lt;/h3&gt;
&lt;p&gt;Load a saved prompt with the &lt;code&gt;/prompt&lt;/code&gt; slash command:&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;/prompt code-review
/file src/api/users.ts
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;This combines your predefined review criteria with the specific file, creating a structured, repeatable workflow.&lt;/p&gt;
&lt;h3&gt;When to Create Prompts&lt;/h3&gt;
&lt;p&gt;Create prompts for tasks you perform regularly:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Code reviews with consistent criteria&lt;/li&gt;
&lt;li&gt;Documentation generation in a specific format&lt;/li&gt;
&lt;li&gt;Refactoring with specific patterns (extract function, apply interface)&lt;/li&gt;
&lt;li&gt;Test generation following your testing conventions&lt;/li&gt;
&lt;/ul&gt;
&lt;h2&gt;AI Provider Configuration&lt;/h2&gt;
&lt;p&gt;Zed supports multiple AI providers, giving you flexibility in model selection:&lt;/p&gt;
&lt;h3&gt;Supported Providers&lt;/h3&gt;
&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Provider&lt;/th&gt;
&lt;th&gt;Configuration&lt;/th&gt;
&lt;th&gt;Notes&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Anthropic&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;API key in settings&lt;/td&gt;
&lt;td&gt;Claude models&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;OpenAI&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;API key in settings&lt;/td&gt;
&lt;td&gt;GPT models&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Ollama&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Local endpoint&lt;/td&gt;
&lt;td&gt;Private, local models&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Google&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;API key in settings&lt;/td&gt;
&lt;td&gt;Gemini models&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;OpenRouter&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;API key in settings&lt;/td&gt;
&lt;td&gt;Multi-provider routing&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Custom&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Any OpenAI-compatible endpoint&lt;/td&gt;
&lt;td&gt;Self-hosted models&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;
&lt;h3&gt;Context Window Implications&lt;/h3&gt;
&lt;p&gt;Different providers offer different context window sizes. With Zed&apos;s explicit context management (where you choose what to include via slash commands), you have good visibility into how much context you are using. If you are working with a smaller model through Ollama, be more selective with your slash commands. With a large cloud model, you can include more files.&lt;/p&gt;
&lt;h3&gt;Configuring in settings.json&lt;/h3&gt;
&lt;pre&gt;&lt;code class=&quot;language-json&quot;&gt;{
  &amp;quot;language_model&amp;quot;: {
    &amp;quot;provider&amp;quot;: &amp;quot;anthropic&amp;quot;,
    &amp;quot;model&amp;quot;: &amp;quot;claude-sonnet-4-20250514&amp;quot;
  }
}
&lt;/code&gt;&lt;/pre&gt;
&lt;h2&gt;Inline Transformations&lt;/h2&gt;
&lt;p&gt;For quick edits that do not require a full conversation, Zed&apos;s inline transformation feature lets you select code and apply AI-powered changes directly in the editor.&lt;/p&gt;
&lt;h3&gt;How It Works&lt;/h3&gt;
&lt;ol&gt;
&lt;li&gt;Select code in the editor&lt;/li&gt;
&lt;li&gt;Trigger the inline transformation (keyboard shortcut)&lt;/li&gt;
&lt;li&gt;Type your instruction (&amp;quot;Add error handling&amp;quot; or &amp;quot;Convert to async/await&amp;quot;)&lt;/li&gt;
&lt;li&gt;Zed applies the change inline&lt;/li&gt;
&lt;/ol&gt;
&lt;h3&gt;Context for Inline Transformations&lt;/h3&gt;
&lt;p&gt;Inline transformations use a focused context: the current file, the selection, and your instruction. They do not load your custom prompts or conversation history. This makes them fast and appropriate for small, self-contained changes.&lt;/p&gt;
&lt;h2&gt;MCP Server Support&lt;/h2&gt;
&lt;p&gt;Recent versions of Zed support MCP for connecting to external tools. The implementation follows the standard MCP pattern: configure servers in settings, and their tools become available within the assistant panel.&lt;/p&gt;
&lt;h3&gt;Configuration&lt;/h3&gt;
&lt;pre&gt;&lt;code class=&quot;language-json&quot;&gt;{
  &amp;quot;context_servers&amp;quot;: {
    &amp;quot;postgres&amp;quot;: {
      &amp;quot;command&amp;quot;: &amp;quot;npx&amp;quot;,
      &amp;quot;args&amp;quot;: [&amp;quot;-y&amp;quot;, &amp;quot;@anthropic/mcp-server-postgres&amp;quot;]
    }
  }
}
&lt;/code&gt;&lt;/pre&gt;
&lt;h3&gt;When to Use MCP in Zed&lt;/h3&gt;
&lt;p&gt;MCP is most useful when the assistant needs live data (database schemas, API responses, running service status) that cannot be obtained from static files. For code-only tasks, the slash commands and file references are sufficient.&lt;/p&gt;
&lt;h2&gt;External Documents: PDFs vs. Markdown&lt;/h2&gt;
&lt;h3&gt;Markdown for Everything You Control&lt;/h3&gt;
&lt;p&gt;Prompts, reference documents, and coding standards should be Markdown. Zed&apos;s prompt library and slash commands work natively with text-based formats.&lt;/p&gt;
&lt;h3&gt;PDFs&lt;/h3&gt;
&lt;p&gt;Zed does not have built-in PDF parsing. For reference material in PDF form, extract relevant sections into Markdown files in your project and reference them with &lt;code&gt;/file&lt;/code&gt;. Alternatively, use &lt;code&gt;/fetch&lt;/code&gt; if the content is available online.&lt;/p&gt;
&lt;h2&gt;Thinking About Context Levels in Zed&lt;/h2&gt;
&lt;h3&gt;Minimal Context (Inline Edits)&lt;/h3&gt;
&lt;p&gt;Select code, trigger inline transformation, describe the change. The current file and selection provide sufficient context for small changes.&lt;/p&gt;
&lt;h3&gt;Moderate Context (Feature Work)&lt;/h3&gt;
&lt;p&gt;Use the assistant panel with targeted slash commands: &lt;code&gt;/file&lt;/code&gt; for relevant files, &lt;code&gt;/diagnostics&lt;/code&gt; for current errors, &lt;code&gt;/prompt&lt;/code&gt; for your coding standards.&lt;/p&gt;
&lt;h3&gt;Comprehensive Context (Architecture)&lt;/h3&gt;
&lt;p&gt;Include multiple files via &lt;code&gt;/file&lt;/code&gt; or &lt;code&gt;/tab&lt;/code&gt;, load architecture documentation via &lt;code&gt;/fetch&lt;/code&gt;, and load your team&apos;s conventions via &lt;code&gt;/prompt&lt;/code&gt;. Build the context explicitly and review it before asking complex questions.&lt;/p&gt;
&lt;h2&gt;Advanced Patterns&lt;/h2&gt;
&lt;h3&gt;The Multi-File Context Pattern&lt;/h3&gt;
&lt;p&gt;For changes that span multiple files:&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;/file src/models/user.ts
/file src/services/userService.ts  
/file src/routes/users.ts
/file tests/services/userService.test.ts

Add a &amp;quot;preferences&amp;quot; field to the User model and propagate it through the service layer, API routes, and tests.
&lt;/code&gt;&lt;/pre&gt;
&lt;h3&gt;The Diagnostic-Driven Fix Pattern&lt;/h3&gt;
&lt;ol&gt;
&lt;li&gt;Run your build or test suite&lt;/li&gt;
&lt;li&gt;Open the assistant panel&lt;/li&gt;
&lt;li&gt;&lt;code&gt;/diagnostics&lt;/code&gt; to load all current errors&lt;/li&gt;
&lt;li&gt;Ask the AI to fix the errors systematically&lt;/li&gt;
&lt;/ol&gt;
&lt;h3&gt;The Collaborative AI Pattern&lt;/h3&gt;
&lt;p&gt;Zed&apos;s multiplayer features mean multiple developers can collaborate in real time while using AI. One developer can set up the context (load files, configure the prompt) while another reviews the AI&apos;s output. This collaborative workflow is unique to Zed and makes it particularly effective for pair programming with AI assistance.&lt;/p&gt;
&lt;h3&gt;The Speed-Focused Workflow&lt;/h3&gt;
&lt;p&gt;For developers who prioritize responsiveness:&lt;/p&gt;
&lt;ol&gt;
&lt;li&gt;Use Ollama with a fast local model for inline transformations&lt;/li&gt;
&lt;li&gt;Use a cloud model for assistant panel conversations that need more capability&lt;/li&gt;
&lt;li&gt;Keep assistant conversations focused and short&lt;/li&gt;
&lt;li&gt;Use inline transformations for most edits, reserving the panel for complex tasks&lt;/li&gt;
&lt;/ol&gt;
&lt;h2&gt;Common Mistakes&lt;/h2&gt;
&lt;ol&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Over-including context.&lt;/strong&gt; Zed gives you explicit control over context. Use it wisely. Including every file in your project via &lt;code&gt;/tab&lt;/code&gt; when only 2 files are relevant dilutes the AI&apos;s focus.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Not using saved prompts.&lt;/strong&gt; If you repeat the same instructions across conversations, save them as prompts. One &lt;code&gt;/prompt code-review&lt;/code&gt; is better than retyping your review criteria every time.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Ignoring /diagnostics.&lt;/strong&gt; This command provides structured error context that is faster and more accurate than manually pasting error messages.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Using the assistant panel for simple edits.&lt;/strong&gt; Inline transformations are faster and require less context setup. Use the panel for complex, multi-file work.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Not exploring provider options.&lt;/strong&gt; If response quality is not meeting expectations, try a different model. Zed&apos;s multi-provider support makes switching easy.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Forgetting /fetch for documentation.&lt;/strong&gt; External docs can be pulled directly into context without leaving the editor. This is faster and more reliable than manually copying content.&lt;/p&gt;
&lt;/li&gt;
&lt;/ol&gt;
&lt;h2&gt;Go Deeper&lt;/h2&gt;
&lt;p&gt;To learn more about AI-assisted development and context management strategies, check out these resources by Alex Merced:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;
&lt;p&gt;&lt;a href=&quot;https://www.amazon.com/2026-Guide-AI-Assisted-Development-Engineering-ebook/dp/B0GQW7CTML/&quot;&gt;The 2026 Guide to AI-Assisted Development&lt;/a&gt; covers AI-assisted development workflows, prompt engineering, and context strategies for software engineers.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;a href=&quot;https://www.amazon.com/Lakehouses-Apache-Iceberg-Agentic-Hands/dp/B0GQNY21TD/&quot;&gt;The 2026 Guide to Lakehouses, Apache Iceberg and Agentic AI&lt;/a&gt; explores how AI agents are reshaping data architecture and how to build systems that support agentic workflows.&lt;/p&gt;
&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;And for a fictional take on where AI is heading:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;a href=&quot;https://www.amazon.com/Emperors-I-Valley-Novel-Future/dp/B0GQHKF4ZT/&quot;&gt;The Emperors of A.I. Valley: A Novel of Power, Code, and the War for the Future&lt;/a&gt; is a novel about the power struggles and ethical dilemmas behind the companies building the most powerful AI systems in the world.&lt;/li&gt;
&lt;/ul&gt;
</content:encoded><author>Alex Merced</author></item><item><title>Context Management Strategies for Windsurf: A Complete Guide to the AI Flow IDE</title><link>https://iceberglakehouse.com/posts/2026-03-context-windsurf/</link><guid isPermaLink="true">https://iceberglakehouse.com/posts/2026-03-context-windsurf/</guid><description>
Windsurf is an AI-powered IDE built on the VS Code foundation that introduces the concept of &quot;Flows,&quot; a paradigm where the AI maintains deep awarenes...</description><pubDate>Sat, 07 Mar 2026 22:00:00 GMT</pubDate><content:encoded>&lt;p&gt;Windsurf is an AI-powered IDE built on the VS Code foundation that introduces the concept of &amp;quot;Flows,&amp;quot; a paradigm where the AI maintains deep awareness of your actions, codebase, and development patterns over time. Its context management differentiates from other editors through Cascade (its agentic coding assistant), persistent Rules files, Memories, and a sophisticated context engine that tracks not just what files you are editing, but how you work.&lt;/p&gt;
&lt;p&gt;This guide covers every context management mechanism in Windsurf and explains how to configure them for the most productive development experience.&lt;/p&gt;
&lt;h2&gt;How Windsurf Manages Context&lt;/h2&gt;
&lt;p&gt;Windsurf assembles context through multiple layers:&lt;/p&gt;
&lt;ol&gt;
&lt;li&gt;&lt;strong&gt;Cascade context engine&lt;/strong&gt; - tracks your edits, terminal commands, and navigation patterns in real time&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Rules files&lt;/strong&gt; - project and global instructions that shape AI behavior&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Memories&lt;/strong&gt; - persistent facts that carry across sessions&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Workspace index&lt;/strong&gt; - semantic index of your codebase&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Conversation context&lt;/strong&gt; - the current chat session in Cascade&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Active editor state&lt;/strong&gt; - the file you are editing, your cursor position, selected text&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;MCP server connections&lt;/strong&gt; - external tools and data sources&lt;/li&gt;
&lt;/ol&gt;
&lt;p&gt;The &amp;quot;Flows&amp;quot; concept means Windsurf&apos;s AI is not just responding to individual prompts. It maintains a continuous understanding of what you are doing, which enables more relevant suggestions and fewer context-setting instructions from you.&lt;/p&gt;
&lt;h2&gt;Rules Files: Persistent Project Instructions&lt;/h2&gt;
&lt;p&gt;Windsurf uses Rules files to define project-level and global instructions for the AI.&lt;/p&gt;
&lt;h3&gt;Global Rules&lt;/h3&gt;
&lt;p&gt;Set in Windsurf Settings under AI &amp;gt; Rules, global rules apply across all projects:&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-markdown&quot;&gt;# Global Rules

## My Preferences
- Always use TypeScript over JavaScript
- Prefer functional programming patterns
- Use descriptive variable names (no single-letter variables except in loops)
- Add JSDoc comments to all exported functions

## Communication Style
- Be direct and concise
- Show code changes as diffs when possible
- Explain non-obvious design decisions
&lt;/code&gt;&lt;/pre&gt;
&lt;h3&gt;Project Rules (Workspace)&lt;/h3&gt;
&lt;p&gt;Create project-level rules in your workspace:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;code&gt;.windsurfrules&lt;/code&gt; file in the project root&lt;/li&gt;
&lt;li&gt;Or &lt;code&gt;.windsurf/rules/&lt;/code&gt; directory with multiple rule files&lt;/li&gt;
&lt;/ul&gt;
&lt;pre&gt;&lt;code class=&quot;language-markdown&quot;&gt;# Project: E-Commerce Platform

## Stack
- Next.js 15 with App Router
- TypeScript 5.6
- PostgreSQL with Prisma ORM
- Tailwind CSS 4
- Vitest for testing

## Architecture
- app/ contains page routes and layouts
- lib/ contains shared utilities and API clients
- components/ contains UI components (Atomic Design: atoms, molecules, organisms)
- prisma/ contains schema and migrations

## Conventions
- Server Components by default, Client Components only when necessary
- Use Zod for all input validation
- API routes use the route handler pattern with error boundaries
- All database queries go through Prisma transactions for writes

## Testing
- Every new component needs a unit test
- API routes need integration tests with a test database
- Use MSW for mocking external API calls
&lt;/code&gt;&lt;/pre&gt;
&lt;h3&gt;Rules Best Practices&lt;/h3&gt;
&lt;ul&gt;
&lt;li&gt;Put rules files in version control so the entire team follows the same conventions&lt;/li&gt;
&lt;li&gt;Keep rules actionable and specific, not aspirational&lt;/li&gt;
&lt;li&gt;Include negative constraints (&amp;quot;Do not use inline styles&amp;quot;)&lt;/li&gt;
&lt;li&gt;Update rules when you change frameworks, libraries, or conventions&lt;/li&gt;
&lt;li&gt;Separate global preferences from project-specific rules&lt;/li&gt;
&lt;/ul&gt;
&lt;h2&gt;Memories: Persistent Knowledge&lt;/h2&gt;
&lt;p&gt;Windsurf&apos;s Memory system stores facts that persist across conversations and sessions. Memories can be created automatically (when the AI identifies important information during a conversation) or manually.&lt;/p&gt;
&lt;h3&gt;How Memories Work&lt;/h3&gt;
&lt;p&gt;When you share something important in a conversation (&amp;quot;We decided to switch from REST to GraphQL for the new API&amp;quot;), Windsurf can save this as a Memory. In future sessions, the AI loads relevant Memories to maintain continuity.&lt;/p&gt;
&lt;h3&gt;Managing Memories&lt;/h3&gt;
&lt;ul&gt;
&lt;li&gt;View all Memories in Windsurf Settings&lt;/li&gt;
&lt;li&gt;Delete outdated Memories that no longer apply&lt;/li&gt;
&lt;li&gt;Manually add Memories for important decisions the AI should always remember&lt;/li&gt;
&lt;li&gt;Review periodically to keep the memory store accurate&lt;/li&gt;
&lt;/ul&gt;
&lt;h3&gt;Memories vs. Rules&lt;/h3&gt;
&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Aspect&lt;/th&gt;
&lt;th&gt;Rules&lt;/th&gt;
&lt;th&gt;Memories&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Creation&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;You write them explicitly&lt;/td&gt;
&lt;td&gt;Created during conversations or manually&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Scope&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Global or project-level&lt;/td&gt;
&lt;td&gt;Cross-project&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Purpose&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Define conventions and constraints&lt;/td&gt;
&lt;td&gt;Store facts and decisions&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Update frequency&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;When conventions change&lt;/td&gt;
&lt;td&gt;As new decisions are made&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;
&lt;p&gt;Use Rules for standards and conventions. Use Memories for facts and decisions.&lt;/p&gt;
&lt;h2&gt;Cascade: The Agentic AI Assistant&lt;/h2&gt;
&lt;p&gt;Cascade is Windsurf&apos;s agentic coding assistant. It operates in two modes with different context management implications:&lt;/p&gt;
&lt;h3&gt;Chat Mode&lt;/h3&gt;
&lt;p&gt;Standard conversational interaction where you ask questions and receive answers. Context includes the active file, conversation history, and any files you reference.&lt;/p&gt;
&lt;h3&gt;Agent Mode&lt;/h3&gt;
&lt;p&gt;Autonomous mode where Cascade plans and executes multi-step tasks. In Agent Mode, Cascade:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Reads and writes files across your project&lt;/li&gt;
&lt;li&gt;Runs terminal commands&lt;/li&gt;
&lt;li&gt;Navigates and explores the codebase&lt;/li&gt;
&lt;li&gt;Creates and executes multi-file changes&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;Agent Mode benefits from more comprehensive context (Rules, Memories, workspace index) because it operates autonomously without constant guidance.&lt;/p&gt;
&lt;h2&gt;MCP Server Support&lt;/h2&gt;
&lt;p&gt;Windsurf supports MCP for connecting to external tools and data sources.&lt;/p&gt;
&lt;h3&gt;Configuration&lt;/h3&gt;
&lt;p&gt;Configure MCP servers through Windsurf Settings or in a configuration file:&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-json&quot;&gt;{
  &amp;quot;mcpServers&amp;quot;: {
    &amp;quot;database&amp;quot;: {
      &amp;quot;command&amp;quot;: &amp;quot;npx&amp;quot;,
      &amp;quot;args&amp;quot;: [&amp;quot;-y&amp;quot;, &amp;quot;@anthropic/mcp-server-postgres&amp;quot;],
      &amp;quot;env&amp;quot;: {
        &amp;quot;DATABASE_URL&amp;quot;: &amp;quot;postgresql://dev@localhost:5432/mydb&amp;quot;
      }
    }
  }
}
&lt;/code&gt;&lt;/pre&gt;
&lt;h3&gt;When to Use MCP&lt;/h3&gt;
&lt;p&gt;Use MCP in Windsurf for the same scenarios as other IDE-based tools: database queries, GitHub integration, API testing, and browser automation. The integration is seamless because MCP tools become available within Cascade&apos;s agent mode.&lt;/p&gt;
&lt;h2&gt;Model Selection and Context Configuration&lt;/h2&gt;
&lt;p&gt;Windsurf supports multiple AI providers and models. Your model choice affects context management because different models handle different context window sizes and reasoning capabilities.&lt;/p&gt;
&lt;h3&gt;Configuring the AI Provider&lt;/h3&gt;
&lt;p&gt;In Windsurf Settings, you can select from multiple providers:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Windsurf&apos;s own models&lt;/strong&gt; (optimized for the Windsurf context system)&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Anthropic&lt;/strong&gt; (Claude Sonnet, Opus)&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;OpenAI&lt;/strong&gt; (GPT-4o, o3)&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Custom endpoints&lt;/strong&gt; (any OpenAI-compatible API)&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;For complex refactoring that touches many files, choose a model with a larger context window. For quick completions and small edits, a faster model with a smaller window is more responsive.&lt;/p&gt;
&lt;h3&gt;Tab Completion Context&lt;/h3&gt;
&lt;p&gt;Windsurf&apos;s Tab completion (inline autocomplete) uses a separate context pipeline from Cascade. The completion context includes:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;The current file content&lt;/li&gt;
&lt;li&gt;Recently edited files&lt;/li&gt;
&lt;li&gt;Import statements and type definitions&lt;/li&gt;
&lt;li&gt;Patterns from your codebase&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;Understanding this separation matters because Tab completions are optimized for speed (low latency) while Cascade chat is optimized for depth (comprehensive reasoning). The context for each is assembled differently to match their respective use cases.&lt;/p&gt;
&lt;h2&gt;How Windsurf Assembles Context&lt;/h2&gt;
&lt;p&gt;When you interact with Cascade, Windsurf assembles context through this pipeline:&lt;/p&gt;
&lt;ol&gt;
&lt;li&gt;&lt;strong&gt;Load Rules&lt;/strong&gt;: Global rules first, then project rules from .windsurfrules&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Load Memories&lt;/strong&gt;: Retrieve relevant persistent facts&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Include active editor state&lt;/strong&gt;: Current file, cursor position, selection&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Process @-commands&lt;/strong&gt;: Add referenced files, codebase search results, web results&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Add flow context&lt;/strong&gt;: Recent edits, terminal output, navigation patterns&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Apply model constraints&lt;/strong&gt;: Trim to fit within the model&apos;s context window&lt;/li&gt;
&lt;/ol&gt;
&lt;p&gt;This pipeline runs automatically for every interaction. The more you invest in Rules and Memories, the more relevant the automatically assembled context becomes.&lt;/p&gt;
&lt;h2&gt;Onboarding a New Project to Windsurf&lt;/h2&gt;
&lt;p&gt;Here is a step-by-step process for setting up effective context management on a new project:&lt;/p&gt;
&lt;h3&gt;Day 1: Foundation&lt;/h3&gt;
&lt;ol&gt;
&lt;li&gt;Open the project in Windsurf and let the workspace indexing complete&lt;/li&gt;
&lt;li&gt;Create a &lt;code&gt;.windsurfrules&lt;/code&gt; file with your stack, architecture, and conventions&lt;/li&gt;
&lt;li&gt;Make a few small changes to verify Windsurf follows your conventions&lt;/li&gt;
&lt;/ol&gt;
&lt;h3&gt;Day 2: Refinement&lt;/h3&gt;
&lt;ol&gt;
&lt;li&gt;Review what Memories Windsurf created from Day 1&lt;/li&gt;
&lt;li&gt;Add any important project facts as manual Memories&lt;/li&gt;
&lt;li&gt;Adjust Rules based on how Cascade behaved on Day 1&lt;/li&gt;
&lt;/ol&gt;
&lt;h3&gt;Week 2: Advanced Setup&lt;/h3&gt;
&lt;ol&gt;
&lt;li&gt;Connect relevant MCP servers (database, GitHub)&lt;/li&gt;
&lt;li&gt;Index external documentation for @docs references&lt;/li&gt;
&lt;li&gt;Start using Agent Mode for multi-file changes&lt;/li&gt;
&lt;li&gt;Create directory-specific rules if different modules have different conventions&lt;/li&gt;
&lt;/ol&gt;
&lt;h2&gt;Thinking About Context Levels&lt;/h2&gt;
&lt;h3&gt;Quick Edits (Minimal Context)&lt;/h3&gt;
&lt;p&gt;Use inline editing (Cmd+K / Ctrl+K) for small changes. Windsurf uses the current file and selection, plus applicable Rules, to generate edits. No additional context needed.&lt;/p&gt;
&lt;h3&gt;Feature Development (Moderate Context)&lt;/h3&gt;
&lt;p&gt;Use Cascade chat with explicit file references. The workspace index, Rules, and Memories combine to give Cascade project-aware responses.&lt;/p&gt;
&lt;h3&gt;Complex Architecture Work (Comprehensive Context)&lt;/h3&gt;
&lt;p&gt;Use Agent Mode with well-configured Rules, active Memories, and MCP connections. Let Cascade explore the codebase, run commands, and make changes across multiple files.&lt;/p&gt;
&lt;h2&gt;@ Commands for Context Injection&lt;/h2&gt;
&lt;p&gt;Windsurf supports @-commands similar to Cursor for injecting specific context:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;@file&lt;/strong&gt; - Reference a specific file&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;@codebase&lt;/strong&gt; - Search the indexed codebase&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;@web&lt;/strong&gt; - Search the web for current information&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;@docs&lt;/strong&gt; - Reference indexed documentation&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;@terminal&lt;/strong&gt; - Include terminal output context&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;These commands give you fine-grained control over what context Cascade receives for each prompt.&lt;/p&gt;
&lt;h2&gt;External Documents: PDFs vs. Markdown&lt;/h2&gt;
&lt;h3&gt;Markdown for Rules&lt;/h3&gt;
&lt;p&gt;All Rules files and project documentation should be Markdown. It is the native format for Windsurf&apos;s context system.&lt;/p&gt;
&lt;h3&gt;For Reference Material&lt;/h3&gt;
&lt;p&gt;For external specifications in PDF form, convert key sections to Markdown and include them in your project as reference documents. This makes them discoverable through @codebase searches.&lt;/p&gt;
&lt;h3&gt;Documentation Indexing&lt;/h3&gt;
&lt;p&gt;Like Cursor, Windsurf can index external documentation. Add framework and library docs to the indexed sources so @docs references return relevant, up-to-date information.&lt;/p&gt;
&lt;h2&gt;Advanced Patterns&lt;/h2&gt;
&lt;h3&gt;The Flow-Aware Development Pattern&lt;/h3&gt;
&lt;p&gt;Leverage Windsurf&apos;s flow tracking by working naturally:&lt;/p&gt;
&lt;ol&gt;
&lt;li&gt;Make edits in the editor (Windsurf tracks your changes)&lt;/li&gt;
&lt;li&gt;Run tests in the terminal (Windsurf observes the results)&lt;/li&gt;
&lt;li&gt;Ask Cascade a question (it already knows what you changed and what failed)&lt;/li&gt;
&lt;/ol&gt;
&lt;p&gt;This removes the need to manually explain what you just did. Windsurf already knows.&lt;/p&gt;
&lt;h3&gt;The Rules-Layered Workflow&lt;/h3&gt;
&lt;p&gt;Combine global and project rules for comprehensive coverage:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Global rules: Your personal coding style and preferences&lt;/li&gt;
&lt;li&gt;Project rules: Team conventions and architecture decisions&lt;/li&gt;
&lt;li&gt;Directory-specific rules: Module-specific patterns&lt;/li&gt;
&lt;/ul&gt;
&lt;h3&gt;The Agent-Then-Review Pattern&lt;/h3&gt;
&lt;ol&gt;
&lt;li&gt;Describe the feature to Cascade in Agent Mode&lt;/li&gt;
&lt;li&gt;Let it plan and implement the changes&lt;/li&gt;
&lt;li&gt;Review each file change in the diff view&lt;/li&gt;
&lt;li&gt;Accept, reject, or modify individual changes&lt;/li&gt;
&lt;li&gt;Ask Cascade to adjust based on your feedback&lt;/li&gt;
&lt;/ol&gt;
&lt;p&gt;This uses Agent Mode for speed while maintaining human oversight through the review step.&lt;/p&gt;
&lt;h3&gt;The Memory-Driven Continuity Pattern&lt;/h3&gt;
&lt;p&gt;At the end of each working session, review what Windsurf has stored as Memories. Add any important decisions or discoveries that were not automatically captured. At the start of the next session, Cascade starts with a richer understanding of your project.&lt;/p&gt;
&lt;h2&gt;Common Mistakes&lt;/h2&gt;
&lt;ol&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Not setting up Rules files.&lt;/strong&gt; Without them, Cascade applies generic conventions. Project-specific Rules are the highest-impact configuration.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Ignoring Memories.&lt;/strong&gt; Stale Memories mislead the AI. Review and clean them periodically.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Underusing Agent Mode.&lt;/strong&gt; For multi-file changes, Agent Mode is dramatically faster than chat-based interactions. Trust it for structural changes and review the results.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Over-specifying context in prompts.&lt;/strong&gt; If your Rules and Memories are well-configured, you do not need to re-explain your conventions in every prompt.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Not leveraging flow awareness.&lt;/strong&gt; Windsurf tracks your actions. Instead of explaining what you just did, ask questions that build on your recent work.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Skipping @codebase for exploration.&lt;/strong&gt; When you are unsure which files are relevant, @codebase search is more efficient than manually navigating the project tree.&lt;/p&gt;
&lt;/li&gt;
&lt;/ol&gt;
&lt;h2&gt;Go Deeper&lt;/h2&gt;
&lt;p&gt;To learn more about AI-assisted development and context management strategies, check out these resources by Alex Merced:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;
&lt;p&gt;&lt;a href=&quot;https://www.amazon.com/2026-Guide-AI-Assisted-Development-Engineering-ebook/dp/B0GQW7CTML/&quot;&gt;The 2026 Guide to AI-Assisted Development&lt;/a&gt; covers AI-assisted development workflows, prompt engineering, and context strategies for software engineers.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;a href=&quot;https://www.amazon.com/Lakehouses-Apache-Iceberg-Agentic-Hands/dp/B0GQNY21TD/&quot;&gt;The 2026 Guide to Lakehouses, Apache Iceberg and Agentic AI&lt;/a&gt; explores how AI agents are reshaping data architecture and how to build systems that support agentic workflows.&lt;/p&gt;
&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;And for a fictional take on where AI is heading:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;a href=&quot;https://www.amazon.com/Emperors-I-Valley-Novel-Future/dp/B0GQHKF4ZT/&quot;&gt;The Emperors of A.I. Valley: A Novel of Power, Code, and the War for the Future&lt;/a&gt; is a novel about the power struggles and ethical dilemmas behind the companies building the most powerful AI systems in the world.&lt;/li&gt;
&lt;/ul&gt;
</content:encoded><author>Alex Merced</author></item><item><title>Context Management Strategies for Perplexity AI: A Complete Guide to Research-First AI Conversations</title><link>https://iceberglakehouse.com/posts/2026-03-context-perplexity/</link><guid isPermaLink="true">https://iceberglakehouse.com/posts/2026-03-context-perplexity/</guid><description>
Perplexity AI occupies a unique position in the AI landscape: it is a research-first tool that combines conversational AI with real-time web search t...</description><pubDate>Sat, 07 Mar 2026 21:00:00 GMT</pubDate><content:encoded>&lt;p&gt;Perplexity AI occupies a unique position in the AI landscape: it is a research-first tool that combines conversational AI with real-time web search to produce answers grounded in current sources. Unlike coding-focused tools or general chatbots, Perplexity is built for information retrieval, analysis, and synthesis. Its context management is designed around Spaces (persistent research workspaces), Focus Modes (search scope control), and an elastic context window that adapts to the complexity of your query.&lt;/p&gt;
&lt;p&gt;This guide covers how to manage context effectively in Perplexity for everything from quick fact-checking to sustained research projects.&lt;/p&gt;
&lt;h2&gt;How Perplexity Manages Context&lt;/h2&gt;
&lt;p&gt;Perplexity builds context from several sources:&lt;/p&gt;
&lt;ol&gt;
&lt;li&gt;&lt;strong&gt;Web search results&lt;/strong&gt; - real-time retrieval of current information&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Spaces&lt;/strong&gt; - persistent workspaces with uploaded files and custom instructions&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Focus Modes&lt;/strong&gt; - filters that control which sources are searched&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Conversation history&lt;/strong&gt; - the thread of questions and answers in the current session&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Uploaded files&lt;/strong&gt; - documents you provide for analysis&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Memory&lt;/strong&gt; - persistent facts the system remembers about you (enterprise plans)&lt;/li&gt;
&lt;/ol&gt;
&lt;p&gt;The key difference from other AI tools is that Perplexity actively searches the web for every query by default. This means its context combines your instructions and uploaded files with fresh, real-time information from the internet, producing answers with citations that you can verify.&lt;/p&gt;
&lt;h2&gt;Spaces: Persistent Research Workspaces&lt;/h2&gt;
&lt;p&gt;Spaces are Perplexity&apos;s equivalent of Projects in other tools. A Space groups related conversations, files, and instructions into a persistent workspace.&lt;/p&gt;
&lt;h3&gt;Creating a Space&lt;/h3&gt;
&lt;ol&gt;
&lt;li&gt;Navigate to &lt;strong&gt;Spaces&lt;/strong&gt; in the sidebar&lt;/li&gt;
&lt;li&gt;Create a new Space with a descriptive name&lt;/li&gt;
&lt;li&gt;Add &lt;strong&gt;Custom Instructions&lt;/strong&gt;: Guidelines that shape every response in this Space&lt;/li&gt;
&lt;li&gt;Upload &lt;strong&gt;files&lt;/strong&gt;: PDFs, documents, spreadsheets, and other reference material&lt;/li&gt;
&lt;li&gt;Set the &lt;strong&gt;default Focus Mode&lt;/strong&gt; for the Space&lt;/li&gt;
&lt;/ol&gt;
&lt;h3&gt;Space Instructions&lt;/h3&gt;
&lt;p&gt;Instructions in a Space function like a system prompt for every conversation within it:&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-markdown&quot;&gt;# Research Space: Renewable Energy Markets

## Role
You are a market research assistant focused on renewable energy.

## Requirements
- Cite all claims with sources less than 6 months old
- Include market size and growth rate data when available
- Compare data across geographic regions when relevant
- Flag any statistics from sources over 1 year old

## Format
- Use structured sections with clear headers
- Include a &amp;quot;Sources&amp;quot; section at the end of every response
- Present data in tables when comparing multiple items
&lt;/code&gt;&lt;/pre&gt;
&lt;h3&gt;File Uploads in Spaces&lt;/h3&gt;
&lt;p&gt;Spaces support various file types for persistent reference:&lt;/p&gt;
&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;File Type&lt;/th&gt;
&lt;th&gt;Use Case&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;PDF&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Research papers, reports, whitepapers&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Documents&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Analysis templates, style guides&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Spreadsheets&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Data for analysis and comparison&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Text/Markdown&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Notes, outlines, custom context documents&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;
&lt;p&gt;Files in a Space are available across all conversations in that Space. This means you upload a report once and can reference it in every subsequent conversation.&lt;/p&gt;
&lt;h3&gt;When to Create a Space&lt;/h3&gt;
&lt;ul&gt;
&lt;li&gt;You are researching a topic across multiple sessions&lt;/li&gt;
&lt;li&gt;You have reference documents you want the AI to consult alongside web results&lt;/li&gt;
&lt;li&gt;You need consistent response formatting and focus&lt;/li&gt;
&lt;li&gt;You are working on a project that requires accumulating research over time&lt;/li&gt;
&lt;/ul&gt;
&lt;h2&gt;Focus Modes: Controlling Search Scope&lt;/h2&gt;
&lt;p&gt;Focus Modes let you control where Perplexity searches for information:&lt;/p&gt;
&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Focus Mode&lt;/th&gt;
&lt;th&gt;Sources&lt;/th&gt;
&lt;th&gt;Best For&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;All&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Entire web&lt;/td&gt;
&lt;td&gt;General research, broad questions&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Academic&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Google Scholar, research databases&lt;/td&gt;
&lt;td&gt;Scientific research, literature reviews&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Writing&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;No web search (uses training data)&lt;/td&gt;
&lt;td&gt;Content creation, drafting, brainstorming&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Math&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Computation-focused, mathematical sources&lt;/td&gt;
&lt;td&gt;Calculations, proofs, statistical analysis&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Video&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;YouTube and video platforms&lt;/td&gt;
&lt;td&gt;Tutorial discovery, visual explanations&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Social&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Reddit, forums, social platforms&lt;/td&gt;
&lt;td&gt;Community opinions, user experiences, discussions&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;
&lt;h3&gt;Using Focus Modes as Context Filters&lt;/h3&gt;
&lt;p&gt;Focus Modes are a form of context management because they determine what kind of information reaches the model. Choosing the right Focus Mode prevents irrelevant results from diluting the response:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Researching a technical specification?&lt;/strong&gt; Use &amp;quot;All&amp;quot; for comprehensive coverage&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Writing a literature review?&lt;/strong&gt; Use &amp;quot;Academic&amp;quot; to prioritize peer-reviewed sources&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Looking for real-world experiences?&lt;/strong&gt; Use &amp;quot;Social&amp;quot; to surface personal accounts and community discussions&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Drafting text without needing web data?&lt;/strong&gt; Use &amp;quot;Writing&amp;quot; to focus on generation rather than retrieval&lt;/li&gt;
&lt;/ul&gt;
&lt;h3&gt;Switching Focus Modes Mid-Research&lt;/h3&gt;
&lt;p&gt;You can switch Focus Modes within a Space. A common pattern:&lt;/p&gt;
&lt;ol&gt;
&lt;li&gt;Start with &amp;quot;Academic&amp;quot; to find foundational research&lt;/li&gt;
&lt;li&gt;Switch to &amp;quot;All&amp;quot; for industry reports and market data&lt;/li&gt;
&lt;li&gt;Use &amp;quot;Social&amp;quot; to gauge public perception and user experiences&lt;/li&gt;
&lt;li&gt;Switch to &amp;quot;Writing&amp;quot; to draft your synthesis&lt;/li&gt;
&lt;/ol&gt;
&lt;p&gt;Each mode shapes the context differently, giving you control over the type of information the model works with.&lt;/p&gt;
&lt;h2&gt;Thinking About Context Levels&lt;/h2&gt;
&lt;h3&gt;Quick Questions (Minimal Context)&lt;/h3&gt;
&lt;p&gt;For factual questions with clear answers, just ask. Perplexity will search the web and return a sourced response:&lt;/p&gt;
&lt;p&gt;&amp;quot;What is the current market size of the global data analytics industry?&amp;quot;&lt;/p&gt;
&lt;p&gt;No Space, no file uploads, no special Focus Mode needed. Perplexity&apos;s default behavior handles this well.&lt;/p&gt;
&lt;h3&gt;Focused Research (Moderate Context)&lt;/h3&gt;
&lt;p&gt;For deeper exploration, create a Space with instructions and upload relevant reference material:&lt;/p&gt;
&lt;p&gt;&amp;quot;Based on the market report I uploaded and current web data, compare the growth trajectories of the three largest cloud providers in the data analytics space.&amp;quot;&lt;/p&gt;
&lt;p&gt;The combination of uploaded files (for baseline data) and web search (for current information) produces comprehensive analysis.&lt;/p&gt;
&lt;h3&gt;Extended Research Projects (Comprehensive Context)&lt;/h3&gt;
&lt;p&gt;For multi-week research projects, use a fully configured Space with detailed instructions, multiple uploaded documents, and strategic Focus Mode switching. Build on previous conversations by referencing insights from earlier threads.&lt;/p&gt;
&lt;h2&gt;Deep Research 2.0&lt;/h2&gt;
&lt;p&gt;Perplexity&apos;s Deep Research feature performs multi-step research autonomously. When you invoke Deep Research, the system:&lt;/p&gt;
&lt;ol&gt;
&lt;li&gt;Analyzes your question and creates a research plan&lt;/li&gt;
&lt;li&gt;Executes multiple web searches across diverse sources&lt;/li&gt;
&lt;li&gt;Reads and analyzes full articles (not just snippets)&lt;/li&gt;
&lt;li&gt;Synthesizes findings into a comprehensive report&lt;/li&gt;
&lt;li&gt;Provides structured output with citations for every claim&lt;/li&gt;
&lt;/ol&gt;
&lt;p&gt;Deep Research is available on Pro plans and uses significantly more compute than standard queries. The tradeoff is worth it for complex questions that require multi-source synthesis.&lt;/p&gt;
&lt;h3&gt;Context Management for Deep Research&lt;/h3&gt;
&lt;p&gt;Deep Research benefits from clear, specific prompts. Because the system executes autonomously, your initial prompt is the primary context it works from:&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Less effective:&lt;/strong&gt; &amp;quot;Tell me about AI in healthcare&amp;quot;&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;More effective:&lt;/strong&gt; &amp;quot;Research the current state of AI-powered diagnostic tools in radiology. Focus on: (1) FDA-approved systems as of 2026, (2) clinical accuracy compared to human radiologists, (3) adoption rates across US hospitals, and (4) barriers to wider adoption. Prioritize peer-reviewed sources and official regulatory data.&amp;quot;&lt;/p&gt;
&lt;p&gt;The specific prompt gives Deep Research a structured plan to follow, producing a more focused and useful report.&lt;/p&gt;
&lt;h2&gt;Structuring Prompts for Effective Context&lt;/h2&gt;
&lt;h3&gt;The Research Question Framework&lt;/h3&gt;
&lt;p&gt;Structure your prompts using this framework for best results:&lt;/p&gt;
&lt;ol&gt;
&lt;li&gt;&lt;strong&gt;Topic:&lt;/strong&gt; What are you researching?&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Scope:&lt;/strong&gt; What specific aspects matter?&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Sources:&lt;/strong&gt; What type of sources do you want?&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Recency:&lt;/strong&gt; How current must the information be?&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Format:&lt;/strong&gt; How should the response be structured?&lt;/li&gt;
&lt;/ol&gt;
&lt;p&gt;Example: &amp;quot;Research [topic]. Focus on [scope]. Prioritize [source type] from [time period]. Present findings as [format].&amp;quot;&lt;/p&gt;
&lt;h3&gt;Follow-Up Strategies&lt;/h3&gt;
&lt;p&gt;Perplexity maintains conversation context within a thread. Use follow-ups strategically:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Drilling down:&lt;/strong&gt; &amp;quot;Tell me more about point 3 from your previous response&amp;quot;&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Pivoting:&lt;/strong&gt; &amp;quot;How does this compare to the European market?&amp;quot;&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Validating:&lt;/strong&gt; &amp;quot;Find additional sources that support or contradict the statistics you cited&amp;quot;&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Updating:&lt;/strong&gt; &amp;quot;What has changed on this topic in the last 3 months?&amp;quot;&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;Each follow-up builds on the accumulated context of the conversation, producing progressively deeper analysis.&lt;/p&gt;
&lt;h2&gt;MCP Server Support&lt;/h2&gt;
&lt;p&gt;Perplexity supports local MCP (Model Context Protocol) servers on its macOS desktop application. This allows the AI to connect to external tools and data sources running on your local machine, extending its capabilities beyond web search.&lt;/p&gt;
&lt;h3&gt;How MCP Works in Perplexity&lt;/h3&gt;
&lt;p&gt;On the macOS app, you can configure local MCP servers that provide Perplexity with access to your file system, local databases, applications, and other services. This is configured through the app&apos;s settings. Remote MCP servers (cloud-based services) are planned for paid subscribers.&lt;/p&gt;
&lt;h3&gt;When MCP Adds Value&lt;/h3&gt;
&lt;p&gt;For most Perplexity use cases, web search is the primary context extension mechanism. MCP adds value when you need Perplexity to combine its web research capabilities with local data:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Researching a topic while cross-referencing your local documents&lt;/li&gt;
&lt;li&gt;Analyzing data from a local database alongside web-sourced information&lt;/li&gt;
&lt;li&gt;Integrating with local development tools or APIs&lt;/li&gt;
&lt;/ul&gt;
&lt;h3&gt;Web Search vs. MCP&lt;/h3&gt;
&lt;p&gt;Where other tools use MCP to reach databases or APIs, Perplexity&apos;s distinguishing feature is its web search capability. MCP complements this by adding local data access, but for most research workflows, Perplexity&apos;s web search provides the primary context extension. If you need extensive MCP functionality (writing code, managing databases, interacting with multiple external services), pair Perplexity with a coding-focused tool like Claude Desktop or Cursor.&lt;/p&gt;
&lt;h2&gt;External Documents: PDFs vs. Markdown&lt;/h2&gt;
&lt;h3&gt;PDFs in Perplexity&lt;/h3&gt;
&lt;p&gt;Perplexity handles PDFs well, especially for research papers and reports. Upload them to a Space for persistent reference. Perplexity can:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Extract text and answer questions about the content&lt;/li&gt;
&lt;li&gt;Compare information across multiple uploaded PDFs&lt;/li&gt;
&lt;li&gt;Combine uploaded PDF data with web search results&lt;/li&gt;
&lt;/ul&gt;
&lt;h3&gt;Markdown&lt;/h3&gt;
&lt;p&gt;For context documents you author (instructions, outlines, research frameworks), Markdown is cleaner and more precisely parsed. Use Markdown for structure-dependent content where formatting matters.&lt;/p&gt;
&lt;h3&gt;The Hybrid Approach&lt;/h3&gt;
&lt;p&gt;Use PDFs for received documents (research papers, reports, specifications). Use Markdown for documents you create (Space instructions, research frameworks, output templates).&lt;/p&gt;
&lt;h2&gt;Memory (Enterprise)&lt;/h2&gt;
&lt;p&gt;On enterprise plans, Perplexity supports persistent Memory that remembers facts about you across conversations. This is similar to ChatGPT&apos;s Memory feature and stores preferences, role information, and recurring context that you should not have to re-state every time.&lt;/p&gt;
&lt;p&gt;For individual users, Spaces serve a similar purpose by maintaining per-workspace instructions and files, even though the memory mechanism is different.&lt;/p&gt;
&lt;h2&gt;Advanced Patterns&lt;/h2&gt;
&lt;h3&gt;The Research Pipeline Pattern&lt;/h3&gt;
&lt;p&gt;Use Perplexity as the front end of a research pipeline:&lt;/p&gt;
&lt;ol&gt;
&lt;li&gt;&lt;strong&gt;Discovery:&lt;/strong&gt; Use Deep Research to survey a topic comprehensively&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Validation:&lt;/strong&gt; Switch to Academic Focus to verify key claims with peer-reviewed sources&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Community insight:&lt;/strong&gt; Switch to Social Focus to understand real-world adoption and reception&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Synthesis:&lt;/strong&gt; Switch to Writing Focus to draft your analysis based on the accumulated context&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Export:&lt;/strong&gt; Copy the synthesized research into your writing tool of choice&lt;/li&gt;
&lt;/ol&gt;
&lt;h3&gt;The Comparative Analysis Pattern&lt;/h3&gt;
&lt;p&gt;Use Spaces to compare multiple topics or options:&lt;/p&gt;
&lt;ol&gt;
&lt;li&gt;Upload comparison criteria as a Markdown file&lt;/li&gt;
&lt;li&gt;Research each option in a separate conversation within the Space&lt;/li&gt;
&lt;li&gt;Use a final conversation to synthesize the findings into a comparison table&lt;/li&gt;
&lt;/ol&gt;
&lt;p&gt;The Space maintains the criteria and accumulated research across all conversations.&lt;/p&gt;
&lt;h3&gt;The Source Quality Verification Pattern&lt;/h3&gt;
&lt;p&gt;Use Focus Mode switching to verify claims across different source types:&lt;/p&gt;
&lt;ol&gt;
&lt;li&gt;Find a claim in &amp;quot;All&amp;quot; mode&lt;/li&gt;
&lt;li&gt;Verify it in &amp;quot;Academic&amp;quot; mode (peer-reviewed backing)&lt;/li&gt;
&lt;li&gt;Check reception in &amp;quot;Social&amp;quot; mode (how practitioners view the claim)&lt;/li&gt;
&lt;li&gt;Check for retractions or updates in &amp;quot;All&amp;quot; mode with a date filter&lt;/li&gt;
&lt;/ol&gt;
&lt;p&gt;This multi-angle verification produces higher-confidence research than relying on a single source type.&lt;/p&gt;
&lt;h2&gt;Common Mistakes&lt;/h2&gt;
&lt;ol&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Not using Spaces for project research.&lt;/strong&gt; Individual conversations lose context when you close them. Spaces maintain your instructions, files, and conversation history persistently.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Ignoring Focus Modes.&lt;/strong&gt; Using &amp;quot;All&amp;quot; for everything misses the specialized results that Academic, Social, and other modes provide. Match the mode to the question.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Vague Deep Research prompts.&lt;/strong&gt; Deep Research executes autonomously, so a vague prompt produces a vague report. Be specific about what you want investigated and how you want it structured.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Uploading too many unrelated files to one Space.&lt;/strong&gt; Keep Spaces focused on specific topics. A Space with 30 unrelated documents dilutes the context for any specific query.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Not verifying citations.&lt;/strong&gt; Perplexity provides source citations for a reason. Click through and verify key claims, especially for high-stakes research.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Using Perplexity for tasks that need code execution or local tool access.&lt;/strong&gt; Perplexity is a research tool, not a coding agent. For tasks requiring code execution, terminal access, or database interaction, use a coding-focused tool instead.&lt;/p&gt;
&lt;/li&gt;
&lt;/ol&gt;
&lt;h2&gt;Go Deeper&lt;/h2&gt;
&lt;p&gt;To learn more about AI-assisted research, context management, and agentic workflows, check out these resources by Alex Merced:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;
&lt;p&gt;&lt;a href=&quot;https://www.amazon.com/2026-Guide-AI-Assisted-Development-Engineering-ebook/dp/B0GQW7CTML/&quot;&gt;The 2026 Guide to AI-Assisted Development&lt;/a&gt; covers AI-assisted development workflows, prompt engineering, and context strategies for software engineers.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;a href=&quot;https://www.amazon.com/Lakehouses-Apache-Iceberg-Agentic-Hands/dp/B0GQNY21TD/&quot;&gt;The 2026 Guide to Lakehouses, Apache Iceberg and Agentic AI&lt;/a&gt; explores how AI agents are reshaping data architecture and how to build systems that support agentic workflows.&lt;/p&gt;
&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;And for a fictional take on where AI is heading:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;a href=&quot;https://www.amazon.com/Emperors-I-Valley-Novel-Future/dp/B0GQHKF4ZT/&quot;&gt;The Emperors of A.I. Valley: A Novel of Power, Code, and the War for the Future&lt;/a&gt; is a novel about the power struggles and ethical dilemmas behind the companies building the most powerful AI systems in the world.&lt;/li&gt;
&lt;/ul&gt;
</content:encoded><author>Alex Merced</author></item><item><title>Context Management Strategies for Cursor: A Complete Guide to the AI-Native Code Editor</title><link>https://iceberglakehouse.com/posts/2026-03-context-cursor/</link><guid isPermaLink="true">https://iceberglakehouse.com/posts/2026-03-context-cursor/</guid><description>
Cursor is an AI-native code editor built on the VS Code foundation that integrates AI deeply into every aspect of the development workflow. Its conte...</description><pubDate>Sat, 07 Mar 2026 20:00:00 GMT</pubDate><content:encoded>&lt;p&gt;Cursor is an AI-native code editor built on the VS Code foundation that integrates AI deeply into every aspect of the development workflow. Its context management system is one of the most sophisticated among coding tools, combining workspace-level indexing, granular rules files, documentation integration, MCP server support, and intelligent context assembly that automatically determines which files and symbols are relevant to your current task.&lt;/p&gt;
&lt;p&gt;This guide covers every context management mechanism Cursor provides and explains how to configure them for productive, reliable AI-assisted development.&lt;/p&gt;
&lt;h2&gt;How Cursor Manages Context&lt;/h2&gt;
&lt;p&gt;Cursor assembles context from multiple sources, with intelligent prioritization:&lt;/p&gt;
&lt;ol&gt;
&lt;li&gt;&lt;strong&gt;Workspace index&lt;/strong&gt; - a semantic index of your entire codebase built on first open&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;.cursor/rules/ files&lt;/strong&gt; - project-specific instructions in MDC format&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;@-mentions&lt;/strong&gt; - explicit context you inject into prompts (@file, @codebase, @Docs)&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;MCP server connections&lt;/strong&gt; - external tools and data&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Active file and selection&lt;/strong&gt; - the code you are currently looking at&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Conversation history&lt;/strong&gt; - recent messages in the current chat session&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Debug context&lt;/strong&gt; - error messages, stack traces, and terminal output&lt;/li&gt;
&lt;/ol&gt;
&lt;p&gt;The workspace index is what makes Cursor&apos;s context management stand out. Instead of relying on you to specify which files are relevant, Cursor semantically indexes your entire project and retrieves the most relevant code based on your query.&lt;/p&gt;
&lt;h2&gt;.cursor/rules/: Project-Level Instructions&lt;/h2&gt;
&lt;p&gt;Cursor uses &lt;code&gt;.cursor/rules/&lt;/code&gt; files in MDC (Markdown Configuration) format to provide project-level instructions. These files tell Cursor how to behave within your project.&lt;/p&gt;
&lt;h3&gt;Rule Types&lt;/h3&gt;
&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Type&lt;/th&gt;
&lt;th&gt;Behavior&lt;/th&gt;
&lt;th&gt;Best For&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Always&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Loaded for every interaction&lt;/td&gt;
&lt;td&gt;Core conventions, style preferences&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Auto&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Loaded when matched files are active&lt;/td&gt;
&lt;td&gt;File-type specific rules (e.g., Python vs. TypeScript)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Agent&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Available to the agent for self-selection&lt;/td&gt;
&lt;td&gt;Specialized knowledge the agent invokes when needed&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Manual&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Only loaded when explicitly referenced&lt;/td&gt;
&lt;td&gt;Rarely used instructions you invoke for specific tasks&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;
&lt;h3&gt;Creating Rules&lt;/h3&gt;
&lt;p&gt;Create &lt;code&gt;.mdc&lt;/code&gt; files in &lt;code&gt;.cursor/rules/&lt;/code&gt;:&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-markdown&quot;&gt;---
description: Python coding standards for this project
globs: [&amp;quot;**/*.py&amp;quot;]
alwaysApply: false
---

# Python Rules

## Style
- Use type hints for all function parameters and return values
- Use dataclasses or Pydantic models instead of plain dicts
- Prefer f-strings over .format() or %-formatting
- Maximum line length is 88 characters (Black default)

## Testing
- Use pytest, not unittest
- Test files mirror the source tree: src/services/auth.py -&amp;gt; tests/services/test_auth.py
- Use factories for test data, not fixtures
- Mock external services at the client boundary

## Architecture
- Business logic lives in src/services/
- Database access goes through src/repositories/
- API routes are thin: validate input, call service, return response
- Never import from internal modules; use the package&apos;s public API
&lt;/code&gt;&lt;/pre&gt;
&lt;h3&gt;Rules Best Practices&lt;/h3&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Use globs to target rules.&lt;/strong&gt; Auto rules with specific glob patterns (like &lt;code&gt;**/*.py&lt;/code&gt;) keep Python conventions separate from JavaScript conventions.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Keep rules actionable.&lt;/strong&gt; Every rule should describe a specific behavior the agent should follow. Vague guidance like &amp;quot;write clean code&amp;quot; wastes tokens.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Document your architecture.&lt;/strong&gt; Tell Cursor where things live. Understanding your project structure prevents the agent from putting code in the wrong place.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Include negative constraints.&lt;/strong&gt; &amp;quot;Do NOT use class-based views&amp;quot; is often more effective than a long description of what to use instead.&lt;/li&gt;
&lt;/ul&gt;
&lt;h2&gt;@-Mentions: Explicit Context Injection&lt;/h2&gt;
&lt;p&gt;Cursor&apos;s @-mention system lets you add specific context to any prompt.&lt;/p&gt;
&lt;h3&gt;Available @-Mentions&lt;/h3&gt;
&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Mention&lt;/th&gt;
&lt;th&gt;Purpose&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;@file&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;Reference a specific file by name&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;@codebase&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;Search the entire indexed codebase for relevant context&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;@Docs&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;Search indexed documentation&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;@web&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;Search the web for current information&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;@git&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;Reference Git history (diffs, commits, branches)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;@definitions&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;Include symbol definitions referenced in your selection&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;@folders&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;Include directory structure context&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;
&lt;h3&gt;Using @codebase Effectively&lt;/h3&gt;
&lt;p&gt;&lt;code&gt;@codebase&lt;/code&gt; is the most powerful @-mention because it triggers semantic search across your entire project. When you type:&lt;/p&gt;
&lt;p&gt;&amp;quot;@codebase How is authentication implemented in this project?&amp;quot;&lt;/p&gt;
&lt;p&gt;Cursor searches its semantic index, retrieves the most relevant files and symbols, and includes them in the context. This is far more efficient than manually specifying each file.&lt;/p&gt;
&lt;h3&gt;@Docs: Documentation-Aware Context&lt;/h3&gt;
&lt;p&gt;You can index external documentation sources so Cursor can reference them:&lt;/p&gt;
&lt;ol&gt;
&lt;li&gt;Go to &lt;strong&gt;Cursor Settings &amp;gt; Features &amp;gt; Docs&lt;/strong&gt;&lt;/li&gt;
&lt;li&gt;Add documentation URLs (framework docs, API references, internal wikis)&lt;/li&gt;
&lt;li&gt;Cursor crawls and indexes the documentation&lt;/li&gt;
&lt;li&gt;Use &lt;code&gt;@Docs&lt;/code&gt; in prompts to reference the indexed content&lt;/li&gt;
&lt;/ol&gt;
&lt;p&gt;Example: &amp;quot;Using @Docs for React 19, refactor this component to use the new use() hook.&amp;quot;&lt;/p&gt;
&lt;p&gt;This is particularly valuable for newer libraries where the AI&apos;s training data may be outdated.&lt;/p&gt;
&lt;h2&gt;MCP Server Support&lt;/h2&gt;
&lt;p&gt;Cursor supports MCP for connecting to external tools and services.&lt;/p&gt;
&lt;h3&gt;Configuration&lt;/h3&gt;
&lt;p&gt;MCP servers are configured in Cursor&apos;s settings:&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-json&quot;&gt;{
  &amp;quot;mcpServers&amp;quot;: {
    &amp;quot;postgres&amp;quot;: {
      &amp;quot;command&amp;quot;: &amp;quot;npx&amp;quot;,
      &amp;quot;args&amp;quot;: [&amp;quot;-y&amp;quot;, &amp;quot;@anthropic/mcp-server-postgres&amp;quot;],
      &amp;quot;env&amp;quot;: {
        &amp;quot;DATABASE_URL&amp;quot;: &amp;quot;postgresql://dev@localhost:5432/mydb&amp;quot;
      }
    },
    &amp;quot;github&amp;quot;: {
      &amp;quot;command&amp;quot;: &amp;quot;npx&amp;quot;,
      &amp;quot;args&amp;quot;: [&amp;quot;-y&amp;quot;, &amp;quot;@anthropic/mcp-server-github&amp;quot;]
    }
  }
}
&lt;/code&gt;&lt;/pre&gt;
&lt;h3&gt;When to Use MCP in Cursor&lt;/h3&gt;
&lt;p&gt;MCP is most valuable when the task requires live data from outside the codebase:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Querying a development database to understand schema or verify data&lt;/li&gt;
&lt;li&gt;Interacting with GitHub for PR reviews or CI status&lt;/li&gt;
&lt;li&gt;Accessing internal APIs to verify integration behavior&lt;/li&gt;
&lt;li&gt;Running browser automation to test frontend changes&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;For code-only tasks (refactoring, writing tests, fixing bugs), Cursor&apos;s built-in codebase index is sufficient.&lt;/p&gt;
&lt;h2&gt;Debug Mode and Error Context&lt;/h2&gt;
&lt;p&gt;Cursor offers a Debug Mode that automatically provides error context to the AI:&lt;/p&gt;
&lt;ol&gt;
&lt;li&gt;When you encounter an error in the terminal or running application&lt;/li&gt;
&lt;li&gt;Cursor captures the error message, stack trace, and relevant file context&lt;/li&gt;
&lt;li&gt;You can ask the AI to diagnose and fix the issue with full context&lt;/li&gt;
&lt;/ol&gt;
&lt;p&gt;This automatic error context gathering is a significant context management feature because it eliminates the manual process of copying error messages and stack traces into prompts.&lt;/p&gt;
&lt;h2&gt;Thinking About Context Levels&lt;/h2&gt;
&lt;h3&gt;Minimal Context (Quick Fixes)&lt;/h3&gt;
&lt;p&gt;For small edits, select code in the editor and use inline editing (Cmd+K / Ctrl+K). Cursor uses the current file and selection as context. No additional setup needed.&lt;/p&gt;
&lt;h3&gt;Moderate Context (Feature Development)&lt;/h3&gt;
&lt;p&gt;Use the chat panel with @-mentions. Reference the relevant files with @file, use @codebase for broader understanding, and include @Docs for framework-specific guidance.&lt;/p&gt;
&lt;h3&gt;Comprehensive Context (Architecture Work)&lt;/h3&gt;
&lt;p&gt;Combine .cursor/rules/ with @codebase and MCP servers. The rules provide your conventions, @codebase provides structural understanding, and MCP provides live system context.&lt;/p&gt;
&lt;h2&gt;External Documents: PDFs vs Markdown&lt;/h2&gt;
&lt;h3&gt;Markdown for Rules&lt;/h3&gt;
&lt;p&gt;All .cursor/rules/ files use MDC (Markdown-based) format. Your coding standards, style guides, and architectural documentation should be in this format.&lt;/p&gt;
&lt;h3&gt;Documentation Indexing&lt;/h3&gt;
&lt;p&gt;For external documentation, use the @Docs system to index web-based docs directly. This is more effective than converting PDFs to Markdown because Cursor handles the indexing and retrieval automatically.&lt;/p&gt;
&lt;h3&gt;For Reference Material&lt;/h3&gt;
&lt;p&gt;If you have specifications or design documents in PDF form, the most practical approach is to extract key sections into .mdc rule files or Markdown documents in your repository. This makes them searchable through @codebase.&lt;/p&gt;
&lt;h2&gt;Model Selection and Context Windows&lt;/h2&gt;
&lt;p&gt;Cursor supports multiple AI providers and models. Your model choice affects context management because different models have different context window sizes and capabilities.&lt;/p&gt;
&lt;h3&gt;Context Window Considerations&lt;/h3&gt;
&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Model&lt;/th&gt;
&lt;th&gt;Context Window&lt;/th&gt;
&lt;th&gt;Best For&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Claude Sonnet&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;200K tokens&lt;/td&gt;
&lt;td&gt;Large codebase analysis, complex refactoring&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;GPT-4o&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;128K tokens&lt;/td&gt;
&lt;td&gt;Feature development, code generation&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Cursor Small&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Varies&lt;/td&gt;
&lt;td&gt;Quick edits, inline completions&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;
&lt;p&gt;For large projects, choose a model with a bigger context window so Cursor can include more codebase context without hitting limits. For simple edits, a smaller, faster model is more responsive.&lt;/p&gt;
&lt;h3&gt;How Cursor Assembles Context&lt;/h3&gt;
&lt;p&gt;When you send a message in Cursor&apos;s chat, the editor automatically assembles context by:&lt;/p&gt;
&lt;ol&gt;
&lt;li&gt;&lt;strong&gt;Including the active file&lt;/strong&gt; and your cursor position&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Including any @-mentioned files&lt;/strong&gt; or resources&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Searching the workspace index&lt;/strong&gt; if @codebase is used&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Loading applicable rules&lt;/strong&gt; from .cursor/rules/ based on the active file type&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Including recent conversation history&lt;/strong&gt; for continuity&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Adding any MCP server tool descriptions&lt;/strong&gt; for agent mode&lt;/li&gt;
&lt;/ol&gt;
&lt;p&gt;This automatic assembly is why Cursor often produces better results than manually pasting code into a generic chatbot. The context is structured and relevant, not random.&lt;/p&gt;
&lt;h3&gt;Context Budget Management&lt;/h3&gt;
&lt;p&gt;Each prompt has a context budget limited by the model&apos;s context window. When the budget is tight:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Be selective with @file mentions (reference only files directly relevant to the task)&lt;/li&gt;
&lt;li&gt;Use @codebase instead of @file for exploratory questions (it retrieves only relevant snippets)&lt;/li&gt;
&lt;li&gt;Keep rules files concise and targeted&lt;/li&gt;
&lt;li&gt;Start new chat sessions when switching topics&lt;/li&gt;
&lt;/ul&gt;
&lt;h2&gt;Workspace Indexing Deep Dive&lt;/h2&gt;
&lt;p&gt;The workspace index is Cursor&apos;s most powerful context feature. It creates a semantic understanding of your entire codebase that powers @codebase searches and the agent&apos;s ability to navigate your project.&lt;/p&gt;
&lt;h3&gt;How Indexing Works&lt;/h3&gt;
&lt;p&gt;When you open a project in Cursor:&lt;/p&gt;
&lt;ol&gt;
&lt;li&gt;Cursor scans all files (respecting .gitignore)&lt;/li&gt;
&lt;li&gt;It creates embeddings (semantic representations) of code symbols, functions, and classes&lt;/li&gt;
&lt;li&gt;These embeddings are stored in a local index&lt;/li&gt;
&lt;li&gt;When you ask questions, Cursor searches this index for the most relevant code&lt;/li&gt;
&lt;/ol&gt;
&lt;h3&gt;Indexing Best Practices&lt;/h3&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Let the index complete before starting work.&lt;/strong&gt; Look for the indexing indicator in the status bar.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Re-index after major changes.&lt;/strong&gt; If you merge a large branch or restructure directories, trigger a re-index.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Trust the index.&lt;/strong&gt; @codebase search often finds more relevant code than you would think to include manually.&lt;/li&gt;
&lt;/ul&gt;
&lt;h2&gt;Practical Workflow Recommendations&lt;/h2&gt;
&lt;h3&gt;For New Projects&lt;/h3&gt;
&lt;ol&gt;
&lt;li&gt;Open the project in Cursor and let it index&lt;/li&gt;
&lt;li&gt;Create .cursor/rules/ with your core coding standards&lt;/li&gt;
&lt;li&gt;Add @Docs entries for the frameworks you are using&lt;/li&gt;
&lt;li&gt;Start with small tasks to verify Cursor understands your conventions&lt;/li&gt;
&lt;/ol&gt;
&lt;h3&gt;For Team Adoption&lt;/h3&gt;
&lt;ol&gt;
&lt;li&gt;Check .cursor/rules/ into version control&lt;/li&gt;
&lt;li&gt;Agree on shared rule categories: Always rules for team-wide standards, Auto rules for language-specific patterns&lt;/li&gt;
&lt;li&gt;Add team documentation to @Docs&lt;/li&gt;
&lt;li&gt;Create Agent rules for specialized knowledge (deployment, database conventions)&lt;/li&gt;
&lt;/ol&gt;
&lt;h3&gt;For Complex Features&lt;/h3&gt;
&lt;ol&gt;
&lt;li&gt;Start with @codebase to understand the existing implementation&lt;/li&gt;
&lt;li&gt;Use Composer for multi-file changes&lt;/li&gt;
&lt;li&gt;Reference @Docs for framework-specific guidance&lt;/li&gt;
&lt;li&gt;Use Debug Mode to quickly resolve implementation issues&lt;/li&gt;
&lt;/ol&gt;
&lt;h2&gt;Advanced Patterns&lt;/h2&gt;
&lt;h3&gt;The Notepads Pattern&lt;/h3&gt;
&lt;p&gt;Cursor&apos;s Notepads feature lets you create persistent context documents within the editor. Unlike .cursor/rules/ (which are loaded automatically), Notepads are reference documents you can @-mention when needed:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Architecture decision records&lt;/li&gt;
&lt;li&gt;API specifications&lt;/li&gt;
&lt;li&gt;Design system documentation&lt;/li&gt;
&lt;li&gt;Onboarding guides for new team members&lt;/li&gt;
&lt;/ul&gt;
&lt;h3&gt;The Composer Pattern&lt;/h3&gt;
&lt;p&gt;Use Cursor&apos;s Composer (multi-file agent mode) for changes that span multiple files:&lt;/p&gt;
&lt;ol&gt;
&lt;li&gt;Describe the feature or change you want&lt;/li&gt;
&lt;li&gt;Composer plans modifications across relevant files&lt;/li&gt;
&lt;li&gt;Review the proposed changes&lt;/li&gt;
&lt;li&gt;Apply or reject each file modification individually&lt;/li&gt;
&lt;/ol&gt;
&lt;p&gt;Composer automatically assembles context from the workspace index, making it effective for cross-cutting changes.&lt;/p&gt;
&lt;h3&gt;The Rules Layering Strategy&lt;/h3&gt;
&lt;p&gt;Combine different rule types for comprehensive coverage:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Always rules:&lt;/strong&gt; Universal team conventions (style, testing, documentation)&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Auto rules:&lt;/strong&gt; Language-specific standards (Python patterns, TypeScript patterns)&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Agent rules:&lt;/strong&gt; Specialized knowledge (deployment procedures, database conventions)&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;This layering ensures the right context is active for the right task without overloading every interaction.&lt;/p&gt;
&lt;h2&gt;Common Mistakes&lt;/h2&gt;
&lt;ol&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Not creating .cursor/rules/.&lt;/strong&gt; Without rules, Cursor applies generic conventions that may not match your project. The rules are the single highest-impact configuration you can make.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Ignoring @codebase.&lt;/strong&gt; Many users manually specify files when @codebase would find the relevant code automatically. Trust the semantic search.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Not indexing documentation.&lt;/strong&gt; If you are using a newer framework, @Docs with indexed documentation prevents the AI from relying on outdated training data.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Over-specifying context.&lt;/strong&gt; If you include 20 files via @file when only 3 are relevant, you dilute the AI&apos;s attention. Use @codebase to let Cursor find the right files, or be selective with @file mentions.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Skipping the workspace indexing.&lt;/strong&gt; Let Cursor finish indexing your workspace on first open. The index powers @codebase and context assembly. Without it, context quality degrades significantly.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Not using Debug Mode.&lt;/strong&gt; When errors occur, Debug Mode provides structured error context that significantly improves the AI&apos;s diagnostic accuracy compared to manually pasting error messages.&lt;/p&gt;
&lt;/li&gt;
&lt;/ol&gt;
&lt;h2&gt;Go Deeper&lt;/h2&gt;
&lt;p&gt;To learn more about AI-assisted development and context management strategies, check out these resources by Alex Merced:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;
&lt;p&gt;&lt;a href=&quot;https://www.amazon.com/2026-Guide-AI-Assisted-Development-Engineering-ebook/dp/B0GQW7CTML/&quot;&gt;The 2026 Guide to AI-Assisted Development&lt;/a&gt; covers AI-assisted development workflows, prompt engineering, and context strategies for software engineers.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;a href=&quot;https://www.amazon.com/Lakehouses-Apache-Iceberg-Agentic-Hands/dp/B0GQNY21TD/&quot;&gt;The 2026 Guide to Lakehouses, Apache Iceberg and Agentic AI&lt;/a&gt; explores how AI agents are reshaping data architecture and how to build systems that support agentic workflows.&lt;/p&gt;
&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;And for a fictional take on where AI is heading:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;a href=&quot;https://www.amazon.com/Emperors-I-Valley-Novel-Future/dp/B0GQHKF4ZT/&quot;&gt;The Emperors of A.I. Valley: A Novel of Power, Code, and the War for the Future&lt;/a&gt; is a novel about the power struggles and ethical dilemmas behind the companies building the most powerful AI systems in the world.&lt;/li&gt;
&lt;/ul&gt;
</content:encoded><author>Alex Merced</author></item><item><title>Context Management Strategies for OpenWork: A Complete Guide to the Desktop AI Agent Framework</title><link>https://iceberglakehouse.com/posts/2026-03-context-openwork/</link><guid isPermaLink="true">https://iceberglakehouse.com/posts/2026-03-context-openwork/</guid><description>
OpenWork is a desktop-native AI agent framework designed for local, multi-step task execution on your computer. Unlike browser-based AI tools or term...</description><pubDate>Sat, 07 Mar 2026 19:00:00 GMT</pubDate><content:encoded>&lt;p&gt;OpenWork is a desktop-native AI agent framework designed for local, multi-step task execution on your computer. Unlike browser-based AI tools or terminal agents, OpenWork operates as a desktop application that can interact with your file system, manage long-running sessions, and execute complex workflows autonomously. Its context management centers on Skills, session persistence, direct file system access, and a plugin architecture that extends its capabilities.&lt;/p&gt;
&lt;p&gt;This guide explains how to manage context effectively in OpenWork to delegate complex tasks, maintain continuity across sessions, and build reusable automation workflows.&lt;/p&gt;
&lt;h2&gt;How OpenWork Manages Context&lt;/h2&gt;
&lt;p&gt;OpenWork builds its context from several layers:&lt;/p&gt;
&lt;ol&gt;
&lt;li&gt;&lt;strong&gt;Skills&lt;/strong&gt; - predefined capability packages that define what the agent can do&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Session state&lt;/strong&gt; - persistent history and progress tracking across interactions&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;File system access&lt;/strong&gt; - direct read/write access to your local files&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Plugin extensions&lt;/strong&gt; - additional capabilities including MCP server connections&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Task definitions&lt;/strong&gt; - structured descriptions of multi-step workflows&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Your instructions&lt;/strong&gt; - natural language guidance provided at task creation&lt;/li&gt;
&lt;/ol&gt;
&lt;p&gt;The key difference between OpenWork and other tools is its desktop-native design. It is built to interact with your operating system, not just with text in a terminal or browser. This means it can manage files, organize folders, process documents, and perform tasks that span multiple applications.&lt;/p&gt;
&lt;h2&gt;Skills: The Foundation of OpenWork&apos;s Capabilities&lt;/h2&gt;
&lt;p&gt;Skills in OpenWork define focused areas of expertise. Each Skill packages instructions, tools, and workflows into a reusable unit.&lt;/p&gt;
&lt;h3&gt;Built-In Skills&lt;/h3&gt;
&lt;p&gt;OpenWork ships with core Skills for common tasks:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;File Management:&lt;/strong&gt; Organizing, renaming, moving, and transforming files&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Document Processing:&lt;/strong&gt; Reading, summarizing, and creating documents&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Data Analysis:&lt;/strong&gt; Processing spreadsheets, CSVs, and structured data&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Web Research:&lt;/strong&gt; Gathering information from web sources&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Code Assistance:&lt;/strong&gt; Writing, reviewing, and refactoring code&lt;/li&gt;
&lt;/ul&gt;
&lt;h3&gt;Creating Custom Skills&lt;/h3&gt;
&lt;p&gt;Define custom Skills that match your specific workflows:&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-markdown&quot;&gt;# Skill: Monthly Report Generator

## Purpose
Generate monthly departmental reports by combining data from multiple sources.

## Inputs Required
- Sales data CSV from /data/sales/
- Customer feedback file from /data/feedback/
- Team metrics from /data/team/

## Process
1. Read and validate all input files
2. Calculate key metrics (revenue, growth, satisfaction scores)
3. Generate narrative summary for each section
4. Format the report using the template in /templates/monthly-report.md
5. Save to /reports/YYYY-MM-monthly-report.md

## Quality Checks
- All numerical values must be sourced from the input data
- The report must include year-over-year comparisons
- Format all currency values with two decimal places
&lt;/code&gt;&lt;/pre&gt;
&lt;h3&gt;Skill Selection and Context&lt;/h3&gt;
&lt;p&gt;When you assign a task, OpenWork selects the relevant Skills based on the task description. The selected Skills become part of the active context, giving the agent the specific instructions it needs for that type of work. This means well-defined Skills reduce the amount of context you need to provide in each task description.&lt;/p&gt;
&lt;h2&gt;Session Management and Persistence&lt;/h2&gt;
&lt;p&gt;OpenWork maintains persistent sessions that carry context across interactions. This is critical for multi-step tasks that span hours or days. Unlike web-based AI tools where closing the browser tab loses your conversation state, OpenWork sessions are durably stored on your local machine and survive application restarts.&lt;/p&gt;
&lt;h3&gt;Session State&lt;/h3&gt;
&lt;p&gt;Each session tracks:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Conversation history:&lt;/strong&gt; Every instruction and response&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;File operations:&lt;/strong&gt; What files were read, created, or modified&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Task progress:&lt;/strong&gt; Current step in multi-step workflows&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Agent decisions:&lt;/strong&gt; Why specific actions were taken (for auditability)&lt;/li&gt;
&lt;/ul&gt;
&lt;h3&gt;Resuming Sessions&lt;/h3&gt;
&lt;p&gt;When you return to OpenWork after closing it, your sessions are preserved. You can:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Continue where you left off on an interrupted task&lt;/li&gt;
&lt;li&gt;Review what the agent did while you were away (for scheduled tasks)&lt;/li&gt;
&lt;li&gt;Provide additional instructions based on completed work&lt;/li&gt;
&lt;/ul&gt;
&lt;h3&gt;Starting Fresh&lt;/h3&gt;
&lt;p&gt;For unrelated work, start a new session. Carrying over context from a previous project creates noise that degrades the agent&apos;s focus.&lt;/p&gt;
&lt;h2&gt;File System Access: Direct Local Interaction&lt;/h2&gt;
&lt;p&gt;OpenWork&apos;s direct file system access is one of its primary context advantages. The agent reads files in real time (not from uploaded snapshots) and writes output directly to your file system.&lt;/p&gt;
&lt;h3&gt;Context from Your File System&lt;/h3&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Project structures:&lt;/strong&gt; The agent can browse directories to understand organization&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Document contents:&lt;/strong&gt; Read any text-based file without manual copying&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Data files:&lt;/strong&gt; Process CSVs, JSON files, and other structured data in place&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Configuration:&lt;/strong&gt; Read settings files to understand tool configurations&lt;/li&gt;
&lt;/ul&gt;
&lt;h3&gt;Best Practices&lt;/h3&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Organize files before delegating.&lt;/strong&gt; A well-structured file system gives OpenWork better context than a messy one.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Use descriptive file names.&lt;/strong&gt; &lt;code&gt;q3-revenue-analysis.csv&lt;/code&gt; gives the agent more context than &lt;code&gt;data2.csv&lt;/code&gt;.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Create a dedicated working directory&lt;/strong&gt; for each project or task category.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Store templates&lt;/strong&gt; in a consistent location so Skills can reference them.&lt;/li&gt;
&lt;/ul&gt;
&lt;h2&gt;MCP Support Through Plugins&lt;/h2&gt;
&lt;p&gt;OpenWork supports MCP servers through its plugin architecture, enabling connections to external data sources and tools.&lt;/p&gt;
&lt;h3&gt;When MCP Adds Value&lt;/h3&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Database integration:&lt;/strong&gt; Let OpenWork query databases for report generation&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Cloud storage:&lt;/strong&gt; Access files in Google Drive, OneDrive, or S3&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;API integration:&lt;/strong&gt; Connect to internal services for data retrieval&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Communication tools:&lt;/strong&gt; Draft messages or pull context from Slack, email, or other platforms&lt;/li&gt;
&lt;/ul&gt;
&lt;h3&gt;Configuration&lt;/h3&gt;
&lt;p&gt;MCP servers are configured through OpenWork&apos;s settings panel. Each server connection becomes available as a tool that Skills can utilize.&lt;/p&gt;
&lt;h2&gt;Thinking About Context Levels&lt;/h2&gt;
&lt;h3&gt;Simple Tasks (Minimal Context)&lt;/h3&gt;
&lt;p&gt;For straightforward file operations (&amp;quot;Rename all files in /downloads/ to include today&apos;s date&amp;quot;), the task description and file system access provide sufficient context.&lt;/p&gt;
&lt;h3&gt;Moderate Tasks&lt;/h3&gt;
&lt;p&gt;For tasks requiring judgment (&amp;quot;Review the documents in /contracts/ and flag any that expire within 30 days&amp;quot;), provide the criteria and desired output format. OpenWork will use its Skills and file access to execute.&lt;/p&gt;
&lt;h3&gt;Complex Tasks&lt;/h3&gt;
&lt;p&gt;For multi-step workflows (&amp;quot;Create a quarterly business review presentation from data in three different folders, following the template in /templates/&amp;quot;), invest in a detailed task definition and ensure the relevant Skills are configured.&lt;/p&gt;
&lt;h2&gt;Structuring Context for Effective Delegation&lt;/h2&gt;
&lt;h3&gt;The Briefing Document Approach&lt;/h3&gt;
&lt;p&gt;For complex tasks, create a briefing document that OpenWork reads before starting:&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-markdown&quot;&gt;# Task Briefing: Q3 Performance Analysis

## Objective
Create a comprehensive performance analysis comparing Q3 results 
against Q2 and the same quarter last year.

## Data Sources
- /data/revenue/q3-2026.csv (primary revenue data)
- /data/revenue/q2-2026.csv (previous quarter)
- /data/revenue/q3-2025.csv (year-over-year comparison)
- /data/kpis/team-metrics.json (operational metrics)

## Required Sections
1. Executive Summary (250 words max)
2. Revenue Analysis with charts
3. Year-over-Year Comparison
4. Team Performance Metrics
5. Recommendations

## Formatting
- Use the template at /templates/quarterly-analysis.md
- All percentages to one decimal place
- Currency in USD with comma separators
- Charts as ASCII/text-based tables

## Quality Standards
- Every claim must reference a specific data point
- Include both absolute and percentage change figures
- Flag any anomalies or data gaps
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;This structured briefing gives OpenWork comprehensive context without relying on interactive conversation.&lt;/p&gt;
&lt;h3&gt;The Progressive Detail Pattern&lt;/h3&gt;
&lt;p&gt;Provide context in layers, starting broad and getting specific:&lt;/p&gt;
&lt;ol&gt;
&lt;li&gt;&lt;strong&gt;High-level goal:&lt;/strong&gt; &amp;quot;Create a monthly financial report&amp;quot;&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Specific requirements:&lt;/strong&gt; &amp;quot;Include revenue, costs, and margin analysis&amp;quot;&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Data locations:&lt;/strong&gt; &amp;quot;Source data is in /finance/monthly/&amp;quot;&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Quality criteria:&lt;/strong&gt; &amp;quot;All numbers must reconcile with the source data&amp;quot;&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Output format:&lt;/strong&gt; &amp;quot;Follow the template in /templates/&amp;quot;&lt;/li&gt;
&lt;/ol&gt;
&lt;p&gt;Each layer adds specificity without contradicting previous layers.&lt;/p&gt;
&lt;h2&gt;Multi-Agent Coordination&lt;/h2&gt;
&lt;p&gt;OpenWork can coordinate multiple agents working on related but independent tasks:&lt;/p&gt;
&lt;h3&gt;Parallel Execution&lt;/h3&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Agent 1:&lt;/strong&gt; Processes financial data and creates charts&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Agent 2:&lt;/strong&gt; Summarizes customer feedback from text files&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Agent 3:&lt;/strong&gt; Compiles operational metrics from log files&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;Each agent works with its own focused context, and the results are combined into a final deliverable.&lt;/p&gt;
&lt;h3&gt;Sequential Handoffs&lt;/h3&gt;
&lt;p&gt;For workflows where each step depends on the previous one:&lt;/p&gt;
&lt;ol&gt;
&lt;li&gt;Agent A produces raw analysis&lt;/li&gt;
&lt;li&gt;Agent B reviews and refines the analysis&lt;/li&gt;
&lt;li&gt;Agent C formats the final output&lt;/li&gt;
&lt;/ol&gt;
&lt;p&gt;The context from each step flows to the next, creating a pipeline of increasingly refined output.&lt;/p&gt;
&lt;h2&gt;External Documents: PDFs vs. Markdown&lt;/h2&gt;
&lt;h3&gt;PDFs&lt;/h3&gt;
&lt;p&gt;OpenWork can read PDFs directly from your file system. Use PDFs for:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Existing reports and documents that are already in PDF format&lt;/li&gt;
&lt;li&gt;External specifications or contracts received from others&lt;/li&gt;
&lt;li&gt;Formatted documents where layout matters&lt;/li&gt;
&lt;/ul&gt;
&lt;h3&gt;Markdown&lt;/h3&gt;
&lt;p&gt;For documents you create specifically for OpenWork (templates, instructions, style guides), use Markdown. It parses more reliably and is easier for the agent to reference precisely.&lt;/p&gt;
&lt;h3&gt;The File-Based Advantage&lt;/h3&gt;
&lt;p&gt;Because OpenWork accesses files directly (not through uploads), the format matters less than it does for web-based tools. Both PDFs and Markdown are readable from the file system. Choose based on the source: use the original format for received documents, and Markdown for documents you author.&lt;/p&gt;
&lt;h2&gt;Advanced Patterns&lt;/h2&gt;
&lt;h3&gt;The Scheduled Workflow Pattern&lt;/h3&gt;
&lt;p&gt;Set up recurring tasks that OpenWork executes on a schedule:&lt;/p&gt;
&lt;ol&gt;
&lt;li&gt;Define the task with clear inputs, processes, and outputs&lt;/li&gt;
&lt;li&gt;Schedule it to run at a specific time (daily, weekly, monthly)&lt;/li&gt;
&lt;li&gt;OpenWork executes the task autonomously and saves the results&lt;/li&gt;
&lt;li&gt;Review the output when convenient&lt;/li&gt;
&lt;/ol&gt;
&lt;p&gt;This is ideal for report generation, data processing, file organization, and routine maintenance tasks.&lt;/p&gt;
&lt;h3&gt;The Multi-Step Pipeline Pattern&lt;/h3&gt;
&lt;p&gt;Chain multiple Skills into a pipeline:&lt;/p&gt;
&lt;ol&gt;
&lt;li&gt;&lt;strong&gt;Step 1 (Data Collection):&lt;/strong&gt; Gather data from multiple sources&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Step 2 (Processing):&lt;/strong&gt; Clean, transform, and analyze the data&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Step 3 (Generation):&lt;/strong&gt; Create the output document or presentation&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Step 4 (Verification):&lt;/strong&gt; Check the output against quality criteria&lt;/li&gt;
&lt;/ol&gt;
&lt;p&gt;Each step builds on the context from previous steps, creating a coherent end-to-end workflow.&lt;/p&gt;
&lt;h3&gt;The Delegation Escalation Pattern&lt;/h3&gt;
&lt;p&gt;Start with simple delegations and gradually increase complexity:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Week 1:&lt;/strong&gt; File organization and simple document creation&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Week 2:&lt;/strong&gt; Data processing and report generation&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Week 3:&lt;/strong&gt; Multi-source research and synthesis&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Week 4:&lt;/strong&gt; Fully automated recurring workflows&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;This builds your confidence in OpenWork&apos;s handling of context while gradually training the agent (through Skills and session history) on your specific needs.&lt;/p&gt;
&lt;h2&gt;When to Use OpenWork vs. Other Tools&lt;/h2&gt;
&lt;p&gt;&lt;strong&gt;Use OpenWork when:&lt;/strong&gt;&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Your tasks involve desktop-level file management&lt;/li&gt;
&lt;li&gt;You need multi-step autonomous execution&lt;/li&gt;
&lt;li&gt;You want scheduled, recurring task automation&lt;/li&gt;
&lt;li&gt;Your work is document-centric (reports, presentations, data processing)&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;&lt;strong&gt;Use a terminal agent (Claude Code, Gemini CLI, OpenCode) when:&lt;/strong&gt;&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Your work is code-centric&lt;/li&gt;
&lt;li&gt;You need direct terminal command execution&lt;/li&gt;
&lt;li&gt;You want inline access to compilers, test runners, and build tools&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;&lt;strong&gt;Use a web-based tool (ChatGPT, Claude Web) when:&lt;/strong&gt;&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;You need interactive conversation and brainstorming&lt;/li&gt;
&lt;li&gt;The task is primarily knowledge-based&lt;/li&gt;
&lt;li&gt;You do not need local file system access&lt;/li&gt;
&lt;/ul&gt;
&lt;h2&gt;Common Mistakes&lt;/h2&gt;
&lt;ol&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Vague task descriptions.&lt;/strong&gt; &amp;quot;Work on my files&amp;quot; gives OpenWork nothing to execute. Specify what files, what action, and what output you expect.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Skipping Skills for repeatable work.&lt;/strong&gt; If you delegate the same type of task more than twice, create a Skill for it.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Not reviewing autonomous output.&lt;/strong&gt; Scheduled tasks run without supervision. Always review the results, especially during the first few runs.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Disorganized file systems.&lt;/strong&gt; OpenWork&apos;s effectiveness depends on finding and understanding your files. Messy directories produce messy results.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Over-scoping single tasks.&lt;/strong&gt; Break large projects into multiple tasks with clear handoff points. OpenWork handles focused, well-defined tasks better than vague, sweeping ones.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Not leveraging session persistence.&lt;/strong&gt; If a task is partially complete, resume the session rather than starting over. The carried context improves continuity.&lt;/p&gt;
&lt;/li&gt;
&lt;/ol&gt;
&lt;h2&gt;Go Deeper&lt;/h2&gt;
&lt;p&gt;To learn more about working effectively with AI agents and context management strategies, check out these resources by Alex Merced:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;
&lt;p&gt;&lt;a href=&quot;https://www.amazon.com/2026-Guide-AI-Assisted-Development-Engineering-ebook/dp/B0GQW7CTML/&quot;&gt;The 2026 Guide to AI-Assisted Development&lt;/a&gt; covers AI-assisted development workflows, prompt engineering, and context strategies for software engineers.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;a href=&quot;https://www.amazon.com/Lakehouses-Apache-Iceberg-Agentic-Hands/dp/B0GQNY21TD/&quot;&gt;The 2026 Guide to Lakehouses, Apache Iceberg and Agentic AI&lt;/a&gt; explores how AI agents are reshaping data architecture and how to build systems that support agentic workflows.&lt;/p&gt;
&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;And for a fictional take on where AI is heading:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;a href=&quot;https://www.amazon.com/Emperors-I-Valley-Novel-Future/dp/B0GQHKF4ZT/&quot;&gt;The Emperors of A.I. Valley: A Novel of Power, Code, and the War for the Future&lt;/a&gt; is a novel about the power struggles and ethical dilemmas behind the companies building the most powerful AI systems in the world.&lt;/li&gt;
&lt;/ul&gt;
</content:encoded><author>Alex Merced</author></item><item><title>Context Management Strategies for OpenCode: A Complete Guide to the Open-Source Terminal AI Agent</title><link>https://iceberglakehouse.com/posts/2026-03-context-opencode/</link><guid isPermaLink="true">https://iceberglakehouse.com/posts/2026-03-context-opencode/</guid><description>
OpenCode is an open-source terminal-based AI coding agent that prioritizes privacy, local-first operation, and broad model provider support. Built as...</description><pubDate>Sat, 07 Mar 2026 18:00:00 GMT</pubDate><content:encoded>&lt;p&gt;OpenCode is an open-source terminal-based AI coding agent that prioritizes privacy, local-first operation, and broad model provider support. Built as a TUI (terminal user interface) application, it runs entirely in your terminal and supports dozens of LLM providers from OpenAI and Anthropic to local models through Ollama. Its context management system is built around configuration files, session persistence, MCP integration, and a dual-agent architecture that separates planning from code generation.&lt;/p&gt;
&lt;p&gt;This guide covers every context management mechanism OpenCode offers and explains how to configure them for effective development workflows, regardless of which model provider you choose.&lt;/p&gt;
&lt;h2&gt;The TUI Advantage for Context Management&lt;/h2&gt;
&lt;p&gt;OpenCode&apos;s TUI (Terminal User Interface) provides a structured visual interface within your terminal. Unlike bare CLI tools where you interact through plain text, the TUI offers:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;A conversation panel showing the full history with syntax-highlighted code blocks&lt;/li&gt;
&lt;li&gt;A file browser for navigating your project structure&lt;/li&gt;
&lt;li&gt;A status bar showing the active model, session state, and token usage&lt;/li&gt;
&lt;li&gt;Visual indicators for agent mode (Plan vs. Build)&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;The TUI makes context management more tangible because you can see what the agent is working with. Token usage indicators help you understand when you are approaching context limits, and the session panel lets you manage conversation history visually.&lt;/p&gt;
&lt;h2&gt;How OpenCode Manages Context&lt;/h2&gt;
&lt;p&gt;OpenCode assembles its context from several sources:&lt;/p&gt;
&lt;ol&gt;
&lt;li&gt;&lt;strong&gt;opencode.json&lt;/strong&gt; - project-level configuration and instructions&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Session history&lt;/strong&gt; - SQLite-backed persistent sessions&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;MCP server connections&lt;/strong&gt; - external tools and data&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;LSP (Language Server Protocol)&lt;/strong&gt; integration - real-time code intelligence&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;The codebase&lt;/strong&gt; - files, directories, and project structure&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Custom commands&lt;/strong&gt; - user-defined reusable operations&lt;/li&gt;
&lt;/ol&gt;
&lt;p&gt;What distinguishes OpenCode from other terminal agents is its architectural separation between &amp;quot;Build&amp;quot; and &amp;quot;Plan&amp;quot; agents. The Build agent writes code and makes changes. The Plan agent reasons about architecture and strategy without modifying files. This separation affects how you structure context: planning tasks need architectural context, while building tasks need implementation detail.&lt;/p&gt;
&lt;h2&gt;opencode.json: Project Configuration&lt;/h2&gt;
&lt;p&gt;The &lt;code&gt;opencode.json&lt;/code&gt; file in your project root is the primary configuration mechanism. It defines provider settings, model selection, and project-specific context.&lt;/p&gt;
&lt;h3&gt;Basic Configuration&lt;/h3&gt;
&lt;pre&gt;&lt;code class=&quot;language-json&quot;&gt;{
  &amp;quot;$schema&amp;quot;: &amp;quot;https://opencode.ai/config.schema.json&amp;quot;,
  &amp;quot;provider&amp;quot;: {
    &amp;quot;name&amp;quot;: &amp;quot;anthropic&amp;quot;,
    &amp;quot;model&amp;quot;: &amp;quot;claude-sonnet-4.5&amp;quot;
  },
  &amp;quot;context&amp;quot;: {
    &amp;quot;instructions&amp;quot;: &amp;quot;This is a Python FastAPI application with PostgreSQL. Use Ruff for linting and pytest for testing. Follow PEP 8 strictly.&amp;quot;,
    &amp;quot;include&amp;quot;: [&amp;quot;src/&amp;quot;, &amp;quot;tests/&amp;quot;, &amp;quot;docs/&amp;quot;],
    &amp;quot;exclude&amp;quot;: [&amp;quot;*.pyc&amp;quot;, &amp;quot;__pycache__/&amp;quot;, &amp;quot;.venv/&amp;quot;]
  }
}
&lt;/code&gt;&lt;/pre&gt;
&lt;h3&gt;Context Instructions&lt;/h3&gt;
&lt;p&gt;The &lt;code&gt;context.instructions&lt;/code&gt; field functions like CLAUDE.md or GEMINI.md for other tools. Include:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Your technology stack and versions&lt;/li&gt;
&lt;li&gt;Coding conventions and style preferences&lt;/li&gt;
&lt;li&gt;Testing strategy and framework&lt;/li&gt;
&lt;li&gt;Architecture decisions and patterns&lt;/li&gt;
&lt;li&gt;Build and deployment commands&lt;/li&gt;
&lt;/ul&gt;
&lt;h3&gt;Include and Exclude Patterns&lt;/h3&gt;
&lt;p&gt;Control what OpenCode sees by specifying include and exclude patterns. This focuses the agent&apos;s attention on relevant code and prevents it from wasting context on generated files, dependencies, or build artifacts.&lt;/p&gt;
&lt;h3&gt;Provider Flexibility&lt;/h3&gt;
&lt;p&gt;OpenCode supports a wide range of providers:&lt;/p&gt;
&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Provider&lt;/th&gt;
&lt;th&gt;Models&lt;/th&gt;
&lt;th&gt;Notes&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;OpenAI&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;GPT-4o, o3, etc.&lt;/td&gt;
&lt;td&gt;Cloud-hosted&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Anthropic&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Claude Sonnet, Opus&lt;/td&gt;
&lt;td&gt;Cloud-hosted&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Google&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Gemini Pro, Flash&lt;/td&gt;
&lt;td&gt;Cloud-hosted&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Ollama&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Llama, Mistral, etc.&lt;/td&gt;
&lt;td&gt;Local, private&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;OpenRouter&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Many models&lt;/td&gt;
&lt;td&gt;Multi-provider routing&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Custom endpoints&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Any OpenAI-compatible API&lt;/td&gt;
&lt;td&gt;Self-hosted&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;
&lt;p&gt;This flexibility means you can choose the right model for your context needs. Local models through Ollama keep all context on your machine. Cloud models provide more capability but send your context to external servers.&lt;/p&gt;
&lt;h2&gt;The Dual-Agent Architecture: Build vs. Plan&lt;/h2&gt;
&lt;p&gt;OpenCode&apos;s most distinctive context management feature is its separation of planning and execution into two independent agents.&lt;/p&gt;
&lt;h3&gt;The Plan Agent&lt;/h3&gt;
&lt;p&gt;The Plan agent reasons about architecture, strategy, and design without making any file changes. Use it for:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Analyzing a codebase before making changes&lt;/li&gt;
&lt;li&gt;Designing an implementation approach&lt;/li&gt;
&lt;li&gt;Evaluating tradeoffs between different solutions&lt;/li&gt;
&lt;li&gt;Understanding unfamiliar code&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;The Plan agent receives the same project context (opencode.json, codebase, MCP) but operates in a read-only mode. This is valuable because it means you can explore and discuss ideas without risk of unintended changes.&lt;/p&gt;
&lt;h3&gt;The Build Agent&lt;/h3&gt;
&lt;p&gt;The Build agent writes code, creates files, runs commands, and makes changes to your project. It uses the planning context plus implementation-specific details:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;The specific files that need modification&lt;/li&gt;
&lt;li&gt;Test commands to verify changes&lt;/li&gt;
&lt;li&gt;Style and formatting requirements&lt;/li&gt;
&lt;/ul&gt;
&lt;h3&gt;Switching Between Agents&lt;/h3&gt;
&lt;p&gt;Switch between Plan and Build during a session to match the current need:&lt;/p&gt;
&lt;ol&gt;
&lt;li&gt;&lt;strong&gt;Start with Plan:&lt;/strong&gt; &amp;quot;Analyze the authentication module and suggest how to add OAuth support&amp;quot;&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Review the plan:&lt;/strong&gt; Evaluate the agent&apos;s architectural proposal&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Switch to Build:&lt;/strong&gt; &amp;quot;Implement the OAuth integration following the approach you described&amp;quot;&lt;/li&gt;
&lt;/ol&gt;
&lt;p&gt;This two-phase approach prevents the common problem of AI agents diving into implementation before understanding the architecture.&lt;/p&gt;
&lt;h2&gt;Session Persistence&lt;/h2&gt;
&lt;p&gt;OpenCode uses SQLite to persist session data across terminal sessions. This means you can close your terminal, come back later, and pick up where you left off.&lt;/p&gt;
&lt;h3&gt;What Gets Persisted&lt;/h3&gt;
&lt;ul&gt;
&lt;li&gt;Conversation history (messages and responses)&lt;/li&gt;
&lt;li&gt;File changes made during the session&lt;/li&gt;
&lt;li&gt;Agent state (Plan vs. Build mode)&lt;/li&gt;
&lt;li&gt;Active context (which files were being discussed)&lt;/li&gt;
&lt;/ul&gt;
&lt;h3&gt;Session Management&lt;/h3&gt;
&lt;ul&gt;
&lt;li&gt;Start a new session for unrelated work&lt;/li&gt;
&lt;li&gt;Continue an existing session when resuming previous work&lt;/li&gt;
&lt;li&gt;Clear session history when accumulated context becomes counterproductive&lt;/li&gt;
&lt;/ul&gt;
&lt;h3&gt;Context Compaction&lt;/h3&gt;
&lt;p&gt;For long sessions, OpenCode supports context compaction. This summarizes older conversation history to free up context window space while retaining the essential information. Compaction is automatic and configurable: you can control how aggressively it summarizes based on your model&apos;s context window size.&lt;/p&gt;
&lt;p&gt;This is particularly important when using models with smaller context windows (like local Ollama models with 8K or 32K contexts) where every token counts. Cloud models with 128K or 200K windows have much more room, but even they benefit from compaction during extended sessions.&lt;/p&gt;
&lt;h3&gt;Context Window Management Across Providers&lt;/h3&gt;
&lt;p&gt;Different providers offer different context window sizes, and your strategy should adapt:&lt;/p&gt;
&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Provider Tier&lt;/th&gt;
&lt;th&gt;Context Size&lt;/th&gt;
&lt;th&gt;Strategy&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Small (8K-32K)&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Ollama local models&lt;/td&gt;
&lt;td&gt;Aggressive compaction, focused sessions, minimal background context&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Medium (64K-128K)&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;GPT-4o, Claude Sonnet&lt;/td&gt;
&lt;td&gt;Standard compaction, moderate session length, room for codebase context&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Large (200K+)&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Claude Opus, Gemini Pro&lt;/td&gt;
&lt;td&gt;Minimal compaction needed, can handle long sessions with extensive context&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;
&lt;p&gt;Understanding your working model&apos;s context limit helps you decide how much context to load via &lt;code&gt;opencode.json&lt;/code&gt; versus providing interactively. With a small local model, lean heavily on precise &lt;code&gt;include&lt;/code&gt; patterns to keep only the most relevant files in context. With a large cloud model, you can afford broader context.&lt;/p&gt;
&lt;h2&gt;MCP Server Support&lt;/h2&gt;
&lt;p&gt;OpenCode supports MCP through the &lt;code&gt;opencode mcp&lt;/code&gt; command, providing integration with external tools and data.&lt;/p&gt;
&lt;h3&gt;Configuration&lt;/h3&gt;
&lt;pre&gt;&lt;code class=&quot;language-bash&quot;&gt;# Add an MCP server
opencode mcp add my-db-server -- npx @my-org/db-mcp-server

# List configured servers
opencode mcp list

# Remove a server
opencode mcp remove my-db-server
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;MCP servers can also be configured in &lt;code&gt;opencode.json&lt;/code&gt;:&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-json&quot;&gt;{
  &amp;quot;mcp&amp;quot;: {
    &amp;quot;servers&amp;quot;: {
      &amp;quot;filesystem&amp;quot;: {
        &amp;quot;command&amp;quot;: &amp;quot;npx&amp;quot;,
        &amp;quot;args&amp;quot;: [&amp;quot;-y&amp;quot;, &amp;quot;@anthropic/mcp-server-filesystem&amp;quot;, &amp;quot;./&amp;quot;]
      }
    }
  }
}
&lt;/code&gt;&lt;/pre&gt;
&lt;h3&gt;When to Use MCP with OpenCode&lt;/h3&gt;
&lt;p&gt;The same principles apply as with other terminal agents: use MCP when the task requires data from outside the codebase (databases, APIs, external services). For code-only work, OpenCode&apos;s built-in file access is sufficient.&lt;/p&gt;
&lt;p&gt;One consideration specific to OpenCode: if you are using a local model through Ollama, MCP adds server-side processing that runs locally. There is no additional privacy concern since everything stays on your machine.&lt;/p&gt;
&lt;h2&gt;LSP Integration: Real-Time Code Intelligence&lt;/h2&gt;
&lt;p&gt;OpenCode integrates with Language Server Protocol services to provide richer code context. LSP gives the agent:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Type information and function signatures&lt;/li&gt;
&lt;li&gt;Import resolution and dependency tracking&lt;/li&gt;
&lt;li&gt;Error and warning diagnostics from your language&apos;s toolchain&lt;/li&gt;
&lt;li&gt;Symbol navigation and reference finding&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;This means OpenCode understands your code at a deeper level than simple text analysis. When you ask about a function, the agent knows its type signature, where it is called from, and what it depends on.&lt;/p&gt;
&lt;h3&gt;Why LSP Matters for Context&lt;/h3&gt;
&lt;p&gt;LSP provides structured context that would otherwise require the agent to infer from raw code. Knowing that a variable is of type &lt;code&gt;List[UserModel]&lt;/code&gt; is more precise than the agent guessing from how the variable is used. This structured understanding reduces errors and produces more accurate code generation.&lt;/p&gt;
&lt;h2&gt;Custom Commands&lt;/h2&gt;
&lt;p&gt;OpenCode supports user-defined custom commands that encapsulate common operations with predefined context:&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-json&quot;&gt;{
  &amp;quot;commands&amp;quot;: {
    &amp;quot;review&amp;quot;: {
      &amp;quot;description&amp;quot;: &amp;quot;Review the current branch for issues&amp;quot;,
      &amp;quot;prompt&amp;quot;: &amp;quot;Review all changes in the current branch compared to main. Check for: security issues, performance problems, missing error handling, and test coverage gaps.&amp;quot;
    },
    &amp;quot;test-all&amp;quot;: {
      &amp;quot;description&amp;quot;: &amp;quot;Run and analyze the full test suite&amp;quot;,
      &amp;quot;prompt&amp;quot;: &amp;quot;Run the complete test suite. Report any failures, flaky tests, or tests that take unusually long. Suggest fixes for any failures.&amp;quot;
    }
  }
}
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Custom commands combine a descriptive name with a predefined prompt, creating reusable context bundles for common workflows.&lt;/p&gt;
&lt;h2&gt;Thinking About Context Levels in OpenCode&lt;/h2&gt;
&lt;h3&gt;Minimal Context&lt;/h3&gt;
&lt;p&gt;For quick questions about the codebase, just ask. OpenCode will explore files as needed.&lt;/p&gt;
&lt;h3&gt;Moderate Context&lt;/h3&gt;
&lt;p&gt;For feature work, set up your &lt;code&gt;opencode.json&lt;/code&gt; with clear instructions and use the Plan agent first to establish understanding before switching to Build.&lt;/p&gt;
&lt;h3&gt;Heavy Context&lt;/h3&gt;
&lt;p&gt;For complex refactoring or architectural changes, combine: detailed &lt;code&gt;opencode.json&lt;/code&gt; instructions, the Plan agent for architecture analysis, MCP servers for database or service context, and custom commands for verification steps.&lt;/p&gt;
&lt;h2&gt;External Documents: PDFs vs. Markdown&lt;/h2&gt;
&lt;h3&gt;Markdown Is Preferred&lt;/h3&gt;
&lt;p&gt;OpenCode works with text-based formats. Project context documents, architecture decision records, and coding standards should be Markdown files in your repository.&lt;/p&gt;
&lt;h3&gt;PDFs&lt;/h3&gt;
&lt;p&gt;If you have reference material in PDF format, convert the relevant sections to Markdown. OpenCode does not have built-in PDF parsing, so text-based formats are more reliable.&lt;/p&gt;
&lt;h2&gt;Advanced Patterns&lt;/h2&gt;
&lt;h3&gt;The Privacy-First Development Pattern&lt;/h3&gt;
&lt;p&gt;Use Ollama with a local model for sensitive codebases:&lt;/p&gt;
&lt;ol&gt;
&lt;li&gt;Install Ollama and download a capable model (Llama 3.1, Mistral Large, etc.)&lt;/li&gt;
&lt;li&gt;Configure &lt;code&gt;opencode.json&lt;/code&gt; to use the local Ollama endpoint&lt;/li&gt;
&lt;li&gt;All context stays on your machine with zero network calls&lt;/li&gt;
&lt;/ol&gt;
&lt;p&gt;This is particularly valuable for proprietary code, pre-launch features, or security-sensitive applications.&lt;/p&gt;
&lt;h3&gt;The Plan-Then-Build Pattern&lt;/h3&gt;
&lt;ol&gt;
&lt;li&gt;Start with the Plan agent to analyze the codebase&lt;/li&gt;
&lt;li&gt;Discuss the architecture and design approach&lt;/li&gt;
&lt;li&gt;Switch to Build once you agree on the plan&lt;/li&gt;
&lt;li&gt;Use custom commands to verify the implementation&lt;/li&gt;
&lt;/ol&gt;
&lt;h3&gt;The Multi-Provider Context Strategy&lt;/h3&gt;
&lt;p&gt;Use different providers for different context needs:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;A large cloud model (GPT-4o, Claude Opus) for complex architectural planning&lt;/li&gt;
&lt;li&gt;A fast, small model for quick edits and simple tasks&lt;/li&gt;
&lt;li&gt;A local model for sensitive code that should not leave your machine&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;Switch providers in &lt;code&gt;opencode.json&lt;/code&gt; based on the current task.&lt;/p&gt;
&lt;h2&gt;Common Mistakes&lt;/h2&gt;
&lt;ol&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Not configuring opencode.json.&lt;/strong&gt; Without it, OpenCode has no project context beyond what it can infer from file exploration.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Using Build when you should Plan.&lt;/strong&gt; Jumping to code changes without planning leads to rework. Use the Plan agent first for anything non-trivial.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Ignoring context compaction.&lt;/strong&gt; With smaller model context windows, long sessions degrade quality. Let compaction do its job, or start fresh sessions.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Not leveraging LSP.&lt;/strong&gt; Ensure your language&apos;s LSP server is installed and running. The structured code intelligence significantly improves agent accuracy.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Skipping custom commands for repeated tasks.&lt;/strong&gt; If you run the same kind of review or test analysis frequently, create a custom command.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Using cloud models for sensitive code without consideration.&lt;/strong&gt; If code privacy matters, use Ollama with local models. The trade-off is sometimes reduced capability, but the privacy guarantee is absolute.&lt;/p&gt;
&lt;/li&gt;
&lt;/ol&gt;
&lt;h2&gt;Go Deeper&lt;/h2&gt;
&lt;p&gt;To learn more about AI-assisted development and context management strategies, check out these resources by Alex Merced:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;
&lt;p&gt;&lt;a href=&quot;https://www.amazon.com/2026-Guide-AI-Assisted-Development-Engineering-ebook/dp/B0GQW7CTML/&quot;&gt;The 2026 Guide to AI-Assisted Development&lt;/a&gt; covers AI-assisted development workflows, prompt engineering, and context strategies for software engineers.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;a href=&quot;https://www.amazon.com/Lakehouses-Apache-Iceberg-Agentic-Hands/dp/B0GQNY21TD/&quot;&gt;The 2026 Guide to Lakehouses, Apache Iceberg and Agentic AI&lt;/a&gt; explores how AI agents are reshaping data architecture and how to build systems that support agentic workflows.&lt;/p&gt;
&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;And for a fictional take on where AI is heading:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;a href=&quot;https://www.amazon.com/Emperors-I-Valley-Novel-Future/dp/B0GQHKF4ZT/&quot;&gt;The Emperors of A.I. Valley: A Novel of Power, Code, and the War for the Future&lt;/a&gt; is a novel about the power struggles and ethical dilemmas behind the companies building the most powerful AI systems in the world.&lt;/li&gt;
&lt;/ul&gt;
</content:encoded><author>Alex Merced</author></item><item><title>Context Management Strategies for Google Antigravity: A Complete Guide to the Agent-First IDE</title><link>https://iceberglakehouse.com/posts/2026-03-context-google-antigravity/</link><guid isPermaLink="true">https://iceberglakehouse.com/posts/2026-03-context-google-antigravity/</guid><description>
Google Antigravity is an agent-first IDE built by Google DeepMind&apos;s Advanced Agentic Coding team. It approaches context management differently from o...</description><pubDate>Sat, 07 Mar 2026 17:00:00 GMT</pubDate><content:encoded>&lt;p&gt;Google Antigravity is an agent-first IDE built by Google DeepMind&apos;s Advanced Agentic Coding team. It approaches context management differently from other AI coding tools because it is designed from the ground up around agentic workflows, where the AI is not just an assistant responding to prompts, but an autonomous agent that plans, executes, tracks progress, and retains knowledge across sessions. Its context management system centers on three pillars: Skills for reusable capability, Knowledge Items for persistent memory, and Artifacts for transparent documentation of its work.&lt;/p&gt;
&lt;p&gt;This guide covers how to structure and manage context in Antigravity to get the most from its agentic capabilities.&lt;/p&gt;
&lt;h2&gt;How Antigravity Manages Context&lt;/h2&gt;
&lt;p&gt;Antigravity assembles its working context from multiple sources, layered by persistence:&lt;/p&gt;
&lt;ol&gt;
&lt;li&gt;&lt;strong&gt;Knowledge Items (KIs)&lt;/strong&gt; - persistent, distilled knowledge from past conversations&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Skills&lt;/strong&gt; (SKILL.md files) - reusable instruction sets for specific capabilities&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Workflows&lt;/strong&gt; - step-by-step guides in the &lt;code&gt;.agents/workflows/&lt;/code&gt; directory&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Conversation history&lt;/strong&gt; - the current and past interactions&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;The codebase&lt;/strong&gt; - files, directories, and project structure&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;MCP servers&lt;/strong&gt; - external tools and data sources&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Task artifacts&lt;/strong&gt; - implementation plans, walkthroughs, and checklists the AI creates&lt;/li&gt;
&lt;/ol&gt;
&lt;p&gt;What makes Antigravity distinctive is that it actively generates and maintains its own context artifacts. The AI creates task checklists, implementation plans, and walkthroughs as it works, and these become part of the persistent context for future sessions.&lt;/p&gt;
&lt;h2&gt;Skills: Reusable Capability Packages&lt;/h2&gt;
&lt;p&gt;Skills are Antigravity&apos;s primary mechanism for defining reusable capabilities. Each Skill is a folder containing a &lt;code&gt;SKILL.md&lt;/code&gt; file with YAML frontmatter and detailed Markdown instructions.&lt;/p&gt;
&lt;h3&gt;Skill Structure&lt;/h3&gt;
&lt;pre&gt;&lt;code&gt;.agents/skills/
  my-skill/
    SKILL.md          # Required: instructions with YAML frontmatter
    scripts/          # Optional: helper scripts
    examples/         # Optional: reference implementations
    resources/        # Optional: templates or assets
&lt;/code&gt;&lt;/pre&gt;
&lt;h3&gt;SKILL.md Format&lt;/h3&gt;
&lt;pre&gt;&lt;code class=&quot;language-markdown&quot;&gt;---
name: deploy-to-staging
description: Deploy the application to the staging environment
---

## Prerequisites
- Docker must be installed and running
- AWS CLI must be configured with staging credentials
- The current branch must have passing CI

## Steps
1. Build the Docker image with the staging configuration
2. Push the image to ECR
3. Update the ECS task definition
4. Trigger the deployment
5. Verify the health check endpoint responds

## Verification
- Check that the /health endpoint returns 200
- Verify the deployed version matches the expected Git SHA
- Run the smoke test suite against staging
&lt;/code&gt;&lt;/pre&gt;
&lt;h3&gt;When to Create Skills&lt;/h3&gt;
&lt;p&gt;Create a Skill when you have a workflow that:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;You perform more than once&lt;/li&gt;
&lt;li&gt;Requires specific steps in a specific order&lt;/li&gt;
&lt;li&gt;Benefits from consistent execution across team members&lt;/li&gt;
&lt;li&gt;Involves domain knowledge that is not obvious from the codebase alone&lt;/li&gt;
&lt;/ul&gt;
&lt;h3&gt;Skills vs. Other Context Mechanisms&lt;/h3&gt;
&lt;p&gt;Skills are for procedural knowledge (&amp;quot;how to do X&amp;quot;). They differ from:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Knowledge Items&lt;/strong&gt; which store factual knowledge (&amp;quot;what is X&amp;quot;)&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;GEMINI.md or CLAUDE.md style files&lt;/strong&gt; which provide ambient project context&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Artifacts&lt;/strong&gt; which document specific work done in a specific session&lt;/li&gt;
&lt;/ul&gt;
&lt;h2&gt;Knowledge Items: Persistent Memory Across Conversations&lt;/h2&gt;
&lt;p&gt;Knowledge Items (KIs) are Antigravity&apos;s mechanism for retaining knowledge across conversations. Unlike conversation history (which is session-bound), KIs are distilled, curated facts that persist indefinitely.&lt;/p&gt;
&lt;h3&gt;How KIs Work&lt;/h3&gt;
&lt;p&gt;At the end of each conversation, a separate Knowledge Subagent analyzes the conversation and extracts key information into KIs. Each KI has:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;metadata.json&lt;/strong&gt;: summary, timestamps, references to original conversations&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;artifacts/&lt;/strong&gt;: related files, documentation, and analysis&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;KIs are stored in the Knowledge directory and are automatically loaded when starting new conversations. Antigravity checks KI summaries at the beginning of every session to avoid redundant work.&lt;/p&gt;
&lt;h3&gt;What Gets Stored as KIs&lt;/h3&gt;
&lt;ul&gt;
&lt;li&gt;Architecture decisions and their rationale&lt;/li&gt;
&lt;li&gt;Troubleshooting discoveries and resolutions&lt;/li&gt;
&lt;li&gt;Implementation patterns specific to your project&lt;/li&gt;
&lt;li&gt;Configuration details and their implications&lt;/li&gt;
&lt;li&gt;Integration specifics for external services&lt;/li&gt;
&lt;li&gt;Performance characteristics and optimization strategies&lt;/li&gt;
&lt;/ul&gt;
&lt;h3&gt;Using KIs Effectively&lt;/h3&gt;
&lt;p&gt;The most important rule for KIs is: &lt;strong&gt;always check them before starting research.&lt;/strong&gt; If you are about to analyze a codebase module, check whether a KI already covers that analysis. This prevents redundant work and ensures continuity across sessions.&lt;/p&gt;
&lt;p&gt;You can also reference specific KIs in conversations by pointing Antigravity at the KI&apos;s artifact files. This is especially useful when building on previous work or when onboarding new team members who can review the accumulated KIs.&lt;/p&gt;
&lt;h2&gt;Artifacts: The Transparency System&lt;/h2&gt;
&lt;p&gt;Antigravity creates artifacts as structured Markdown documents that make the agent&apos;s work transparent and reviewable. Key artifact types include:&lt;/p&gt;
&lt;h3&gt;task.md&lt;/h3&gt;
&lt;p&gt;A checklist that tracks progress on the current task. Antigravity creates this at the start of complex work and updates it as it progresses:&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-markdown&quot;&gt;# Feature: User Authentication

- [x] Research existing auth patterns
- [x] Create implementation plan
- [/] Implement JWT token generation
- [ ] Add refresh token support
- [ ] Write integration tests
- [ ] Update API documentation
&lt;/code&gt;&lt;/pre&gt;
&lt;h3&gt;implementation_plan.md&lt;/h3&gt;
&lt;p&gt;Created during the PLANNING phase, this documents the proposed changes, file modifications, and verification strategy before any code is written. You review and approve (or modify) this plan before Antigravity proceeds to execution.&lt;/p&gt;
&lt;h3&gt;walkthrough.md&lt;/h3&gt;
&lt;p&gt;Created after completing work, this documents what was accomplished, what was tested, and the results. It serves as a record of the work and can be reviewed by team members.&lt;/p&gt;
&lt;h3&gt;Why Artifacts Matter for Context&lt;/h3&gt;
&lt;p&gt;Artifacts create a structured record that Antigravity can reference in future sessions. When you return to a project, the agent can read the previous implementation plan and walkthrough to understand what was done and why. This is far more efficient than re-analyzing the codebase from scratch.&lt;/p&gt;
&lt;h2&gt;Thinking About Context Levels in Antigravity&lt;/h2&gt;
&lt;h3&gt;Minimal Context (Quick Tasks)&lt;/h3&gt;
&lt;p&gt;For simple questions or small fixes, just ask. Antigravity can explore the codebase, read relevant files, and provide answers without additional setup. Its file exploration tools are fast and respect &lt;code&gt;.gitignore&lt;/code&gt; patterns.&lt;/p&gt;
&lt;h3&gt;Moderate Context (Feature Work)&lt;/h3&gt;
&lt;p&gt;For typical feature development, let Antigravity&apos;s Planning phase do the heavy lifting. It will:&lt;/p&gt;
&lt;ol&gt;
&lt;li&gt;Analyze the codebase to understand the current architecture&lt;/li&gt;
&lt;li&gt;Create an implementation plan for your review&lt;/li&gt;
&lt;li&gt;Execute the plan once approved&lt;/li&gt;
&lt;li&gt;Verify the changes&lt;/li&gt;
&lt;/ol&gt;
&lt;p&gt;The PLANNING &amp;gt; EXECUTION &amp;gt; VERIFICATION workflow is built into Antigravity&apos;s DNA, and each phase generates artifacts that carry context forward.&lt;/p&gt;
&lt;h3&gt;Comprehensive Context (Ongoing Projects)&lt;/h3&gt;
&lt;p&gt;For sustained work across multiple sessions, invest in Skills and ensure KIs are accumulating properly. Over time, Antigravity builds a rich knowledge base about your project that makes each subsequent session more productive.&lt;/p&gt;
&lt;h2&gt;Multi-Model Support and Context Routing&lt;/h2&gt;
&lt;p&gt;Antigravity supports multiple AI models and can use different models for different subtasks. This means context management extends to model selection: some tasks benefit from larger context windows, while others benefit from faster inference.&lt;/p&gt;
&lt;p&gt;The agent handles this transparently, but being aware of it helps you understand why some responses might take longer (larger model processing more context) while others are faster (smaller model handling a focused subtask).&lt;/p&gt;
&lt;h2&gt;Browser Recording and Visual Context&lt;/h2&gt;
&lt;p&gt;Antigravity includes a built-in browser interaction system that records all browser actions as WebP videos. This creates a unique form of context: visual proof of work that can be reviewed later.&lt;/p&gt;
&lt;p&gt;For frontend development, this means Antigravity can:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Navigate to web applications and interact with UI elements&lt;/li&gt;
&lt;li&gt;Take screenshots to verify visual changes&lt;/li&gt;
&lt;li&gt;Record step-by-step interactions for documentation&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;These recordings become part of the walkthrough artifact, providing visual evidence that changes work as intended.&lt;/p&gt;
&lt;h2&gt;Conversation History and Context Summaries&lt;/h2&gt;
&lt;p&gt;Antigravity maintains conversation logs and summaries that persist across sessions. When you start a new conversation, the system provides:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Summaries of recent conversations&lt;/li&gt;
&lt;li&gt;KI summaries with artifact paths&lt;/li&gt;
&lt;li&gt;Information about previously edited and viewed files&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;This means Antigravity starts each session with awareness of what happened in recent sessions, reducing the need to re-explain context that was covered before.&lt;/p&gt;
&lt;h2&gt;MCP Server Support&lt;/h2&gt;
&lt;p&gt;Antigravity supports MCP servers for connecting to external tools and data sources. Configuration follows the standard MCP pattern familiar from other tools.&lt;/p&gt;
&lt;h3&gt;Practical Use Cases&lt;/h3&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Database access:&lt;/strong&gt; Let Antigravity query your development database to understand schema and data&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Browser automation:&lt;/strong&gt; Verify frontend changes visually&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Git hosting:&lt;/strong&gt; Interact with GitHub or GitLab for PR management&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Documentation systems:&lt;/strong&gt; Access internal wikis or knowledge bases&lt;/li&gt;
&lt;/ul&gt;
&lt;h3&gt;When to Use MCP&lt;/h3&gt;
&lt;p&gt;Use MCP when the task requires information from outside the codebase. For code-only work, Antigravity&apos;s built-in file system tools are sufficient. MCP adds the most value for tasks that span multiple systems (for example, updating both code and documentation, or verifying a code change against a running application).&lt;/p&gt;
&lt;h2&gt;External Documents: PDFs vs. Markdown&lt;/h2&gt;
&lt;h3&gt;Markdown Is the Native Format&lt;/h3&gt;
&lt;p&gt;Skills, KIs, and artifacts are all Markdown. If you are creating context documents for Antigravity, use Markdown.&lt;/p&gt;
&lt;h3&gt;For External References&lt;/h3&gt;
&lt;p&gt;PDF documents can be provided as context through conversation uploads. However, for persistent reference material, converting to Markdown and placing it in a project directory (or as a Skill resource) provides better integration with Antigravity&apos;s context system.&lt;/p&gt;
&lt;h2&gt;Advanced Patterns&lt;/h2&gt;
&lt;h3&gt;The Skill-Driven Development Pattern&lt;/h3&gt;
&lt;p&gt;Create Skills for every major workflow in your development process:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;code&gt;deploy-staging&lt;/code&gt; for deployment&lt;/li&gt;
&lt;li&gt;&lt;code&gt;create-api-endpoint&lt;/code&gt; for new endpoints&lt;/li&gt;
&lt;li&gt;&lt;code&gt;database-migration&lt;/code&gt; for schema changes&lt;/li&gt;
&lt;li&gt;&lt;code&gt;security-audit&lt;/code&gt; for security reviews&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;When you need to perform one of these tasks, point Antigravity at the relevant Skill. This ensures consistent execution regardless of which team member is working.&lt;/p&gt;
&lt;h3&gt;The Knowledge Accumulation Pattern&lt;/h3&gt;
&lt;p&gt;Treat KIs as a growing knowledge base about your project:&lt;/p&gt;
&lt;ol&gt;
&lt;li&gt;First session: Antigravity learns the basic architecture&lt;/li&gt;
&lt;li&gt;Subsequent sessions: KIs accumulate details about specific modules, patterns, and decisions&lt;/li&gt;
&lt;li&gt;Over time: Antigravity starts with a deep understanding of your project every session&lt;/li&gt;
&lt;/ol&gt;
&lt;p&gt;This compounds over weeks and months, making the AI increasingly effective.&lt;/p&gt;
&lt;h3&gt;The Paired Review Pattern&lt;/h3&gt;
&lt;p&gt;Use Antigravity&apos;s PLANNING phase as a design review:&lt;/p&gt;
&lt;ol&gt;
&lt;li&gt;Describe the feature or change you want&lt;/li&gt;
&lt;li&gt;Review the implementation plan Antigravity creates&lt;/li&gt;
&lt;li&gt;Provide feedback and iterate on the plan&lt;/li&gt;
&lt;li&gt;Only approve execution once the plan meets your standards&lt;/li&gt;
&lt;/ol&gt;
&lt;p&gt;This catches design issues before code is written, saving significant time.&lt;/p&gt;
&lt;h3&gt;The Task Decomposition Pattern&lt;/h3&gt;
&lt;p&gt;For large features, let Antigravity break the work into multiple task boundary segments:&lt;/p&gt;
&lt;ol&gt;
&lt;li&gt;Tell Antigravity the overall goal&lt;/li&gt;
&lt;li&gt;It creates a task.md with subtasks&lt;/li&gt;
&lt;li&gt;Each subtask gets its own PLANNING &amp;gt; EXECUTION &amp;gt; VERIFICATION cycle&lt;/li&gt;
&lt;li&gt;The walkthrough artifact captures the full story for future reference&lt;/li&gt;
&lt;/ol&gt;
&lt;h2&gt;Common Mistakes&lt;/h2&gt;
&lt;ol&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Ignoring KI summaries.&lt;/strong&gt; Antigravity provides KI summaries at the start of each conversation. Skipping them leads to redundant work and missed context.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Not creating Skills for repeatable work.&lt;/strong&gt; If you find yourself explaining the same workflow multiple times, it should be a Skill.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Skipping the PLANNING phase.&lt;/strong&gt; Jumping straight to execution means no implementation plan to review. The PLANNING phase is where Antigravity aligns with your intent.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Not reviewing artifacts.&lt;/strong&gt; Implementation plans and walkthroughs are designed for human review. Skipping them defeats the purpose of Antigravity&apos;s transparency system.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Over-relying on conversation context.&lt;/strong&gt; Conversation history is ephemeral. For information that should persist, ensure it gets captured in Skills or KIs.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Not building Workflows for common tasks.&lt;/strong&gt; The &lt;code&gt;.agents/workflows/&lt;/code&gt; directory supports step-by-step guides that Antigravity follows precisely. These are particularly useful for onboarding, deployment, and maintenance tasks.&lt;/p&gt;
&lt;/li&gt;
&lt;/ol&gt;
&lt;h2&gt;Go Deeper&lt;/h2&gt;
&lt;p&gt;To learn more about working effectively with AI coding agents and managing context across development workflows, check out these resources by Alex Merced:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;
&lt;p&gt;&lt;a href=&quot;https://www.amazon.com/2026-Guide-AI-Assisted-Development-Engineering-ebook/dp/B0GQW7CTML/&quot;&gt;The 2026 Guide to AI-Assisted Development&lt;/a&gt; covers AI-assisted development workflows, prompt engineering, and context strategies for software engineers.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;a href=&quot;https://www.amazon.com/Lakehouses-Apache-Iceberg-Agentic-Hands/dp/B0GQNY21TD/&quot;&gt;The 2026 Guide to Lakehouses, Apache Iceberg and Agentic AI&lt;/a&gt; explores how AI agents are reshaping data architecture and how to build systems that support agentic workflows.&lt;/p&gt;
&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;And for a fictional take on where AI is heading:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;a href=&quot;https://www.amazon.com/Emperors-I-Valley-Novel-Future/dp/B0GQHKF4ZT/&quot;&gt;The Emperors of A.I. Valley: A Novel of Power, Code, and the War for the Future&lt;/a&gt; is a novel about the power struggles and ethical dilemmas behind the companies building the most powerful AI systems in the world.&lt;/li&gt;
&lt;/ul&gt;
</content:encoded><author>Alex Merced</author></item><item><title>Context Management Strategies for Gemini CLI: A Complete Guide to Terminal-Native AI Development</title><link>https://iceberglakehouse.com/posts/2026-03-context-gemini-cli/</link><guid isPermaLink="true">https://iceberglakehouse.com/posts/2026-03-context-gemini-cli/</guid><description>
Gemini CLI is an open-source terminal agent powered by Gemini models that operates directly in your command line. It brings Google&apos;s AI capabilities ...</description><pubDate>Sat, 07 Mar 2026 16:00:00 GMT</pubDate><content:encoded>&lt;p&gt;Gemini CLI is an open-source terminal agent powered by Gemini models that operates directly in your command line. It brings Google&apos;s AI capabilities into the environment where many developers already live, with a context management system built around hierarchical configuration files, persistent memory, MCP server integration, and direct codebase interaction. Unlike web-based tools where context is managed through uploads and conversation, Gemini CLI assembles its context from your project structure, your instruction files, and the tools you connect to it.&lt;/p&gt;
&lt;p&gt;This guide covers every context management mechanism in Gemini CLI and explains how to configure them for productive development workflows.&lt;/p&gt;
&lt;h2&gt;How Gemini CLI Assembles Context&lt;/h2&gt;
&lt;p&gt;Gemini CLI builds its working context from multiple sources, loaded in a specific hierarchy:&lt;/p&gt;
&lt;ol&gt;
&lt;li&gt;&lt;strong&gt;Global GEMINI.md&lt;/strong&gt; (&lt;code&gt;~/.gemini/GEMINI.md&lt;/code&gt;) - personal preferences that apply everywhere&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Project GEMINI.md&lt;/strong&gt; (in your project directory, walking up to the root) - project conventions&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Subdirectory GEMINI.md files&lt;/strong&gt; - component-specific instructions&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Memory entries&lt;/strong&gt; - facts you have told the CLI to remember&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;MCP server tools&lt;/strong&gt; - external data sources and capabilities&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;The current codebase&lt;/strong&gt; - files, dependencies, project structure&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;The conversation&lt;/strong&gt; - your prompts and responses in the current session&lt;/li&gt;
&lt;/ol&gt;
&lt;p&gt;More specific sources take precedence over general ones. A subdirectory GEMINI.md instruction overrides a project-level GEMINI.md instruction on the same topic.&lt;/p&gt;
&lt;h2&gt;GEMINI.md: Persistent Project Context&lt;/h2&gt;
&lt;p&gt;GEMINI.md is the foundational context mechanism. It is a Markdown file that Gemini CLI loads automatically before every interaction.&lt;/p&gt;
&lt;h3&gt;File Hierarchy&lt;/h3&gt;
&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Location&lt;/th&gt;
&lt;th&gt;Scope&lt;/th&gt;
&lt;th&gt;Purpose&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;~/.gemini/GEMINI.md&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;All projects&lt;/td&gt;
&lt;td&gt;Personal coding style, universal preferences&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;./GEMINI.md&lt;/code&gt; (project root)&lt;/td&gt;
&lt;td&gt;Current project&lt;/td&gt;
&lt;td&gt;Architecture, stack, conventions&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;./src/GEMINI.md&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;Specific directory&lt;/td&gt;
&lt;td&gt;Module-specific patterns&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;
&lt;h3&gt;What to Include&lt;/h3&gt;
&lt;pre&gt;&lt;code class=&quot;language-markdown&quot;&gt;# GEMINI.md

## Project: E-Commerce API
- Framework: Express.js on Node 22
- Database: PostgreSQL 16 with Drizzle ORM
- Testing: Vitest with supertest for API tests
- Deployment: Docker containers on Cloud Run

## Code Conventions
- Use ESM imports (no CommonJS require)
- All route handlers are async functions
- Error handling uses a centralized error middleware
- SQL migrations use Drizzle Kit

## Architecture
- Routes: src/routes/
- Services: src/services/ (business logic)
- Models: src/models/ (Drizzle schema)
- Middleware: src/middleware/
- Tests: tests/ (mirrors src/ structure)

## Do Not
- Do not use default exports
- Do not install packages without noting them
- Do not modify migration files after they have been applied
&lt;/code&gt;&lt;/pre&gt;
&lt;h3&gt;Modular GEMINI.md Files&lt;/h3&gt;
&lt;p&gt;For complex projects, GEMINI.md files can import other Markdown files. This keeps individual files focused while allowing the CLI to assemble comprehensive context:&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-markdown&quot;&gt;# GEMINI.md
See also:
- @docs/coding-standards.md
- @docs/api-conventions.md
- @docs/testing-strategy.md
&lt;/code&gt;&lt;/pre&gt;
&lt;h3&gt;The /init Command&lt;/h3&gt;
&lt;p&gt;If you are starting a new project or onboarding Gemini CLI to an existing one, run &lt;code&gt;/init&lt;/code&gt;. This command analyzes your project structure and generates a starting GEMINI.md file that captures:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Detected frameworks and languages&lt;/li&gt;
&lt;li&gt;Project structure&lt;/li&gt;
&lt;li&gt;Build and test commands&lt;/li&gt;
&lt;li&gt;Basic conventions inferred from the code&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;Review and edit the generated file. The auto-detection is a starting point, not a finished product. Add your team&apos;s conventions, architectural decisions, and quality standards to make it comprehensive. The value of /init is that it saves you from writing the boilerplate sections (project type, folder structure, detected dependencies) so you can focus on the human-knowledge sections.&lt;/p&gt;
&lt;h2&gt;Memory: Persistent Facts Across Sessions&lt;/h2&gt;
&lt;p&gt;Gemini CLI&apos;s memory system stores persistent facts that apply across all sessions and projects (when stored globally).&lt;/p&gt;
&lt;h3&gt;Adding Memories&lt;/h3&gt;
&lt;pre&gt;&lt;code&gt;/memory add We use the Google Python Style Guide for all Python code
/memory add Our PostgreSQL database runs on port 5433, not the default 5432
/memory add Always use UTC timestamps in database columns
&lt;/code&gt;&lt;/pre&gt;
&lt;h3&gt;Viewing Memories&lt;/h3&gt;
&lt;pre&gt;&lt;code&gt;/memory show
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;This displays all active memories, including those from GEMINI.md files and explicit memory entries.&lt;/p&gt;
&lt;h3&gt;Refreshing Context&lt;/h3&gt;
&lt;p&gt;If you update GEMINI.md files outside of the current session, use:&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;/memory refresh
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;This reloads all context files without restarting the CLI.&lt;/p&gt;
&lt;h3&gt;Memory Best Practices&lt;/h3&gt;
&lt;ul&gt;
&lt;li&gt;Use memory for facts that are true across projects (your personal conventions)&lt;/li&gt;
&lt;li&gt;Use GEMINI.md for project-specific context&lt;/li&gt;
&lt;li&gt;Keep memories concise: &amp;quot;Use Ruff for Python linting&amp;quot; rather than a paragraph explaining why&lt;/li&gt;
&lt;li&gt;Review memories periodically with &lt;code&gt;/memory show&lt;/code&gt; and remove outdated entries&lt;/li&gt;
&lt;/ul&gt;
&lt;h2&gt;Direct Context Injection with @&lt;/h2&gt;
&lt;p&gt;The &lt;code&gt;@&lt;/code&gt; command lets you inject specific files or directories directly into a prompt:&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;@src/models/user.ts How should I add a preferences field to this model?
&lt;/code&gt;&lt;/pre&gt;
&lt;pre&gt;&lt;code&gt;@src/routes/ Review all route handlers for consistent error handling
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;This is the most direct way to give Gemini CLI context about specific files. Unlike other tools that require uploads, the @ command reads from your local file system in real time.&lt;/p&gt;
&lt;h3&gt;When to Use @&lt;/h3&gt;
&lt;ul&gt;
&lt;li&gt;When your question relates to specific files that Gemini CLI might not automatically discover&lt;/li&gt;
&lt;li&gt;When you want to ensure the agent reads the latest version of a file&lt;/li&gt;
&lt;li&gt;When you want to focus the agent on a particular section of the codebase&lt;/li&gt;
&lt;/ul&gt;
&lt;h2&gt;MCP Server Support&lt;/h2&gt;
&lt;p&gt;Gemini CLI supports MCP through its &lt;code&gt;settings.json&lt;/code&gt; configuration. MCP servers extend the CLI&apos;s capabilities by connecting it to external tools and data sources.&lt;/p&gt;
&lt;h3&gt;Configuration&lt;/h3&gt;
&lt;p&gt;MCP servers are configured in &lt;code&gt;settings.json&lt;/code&gt;:&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-json&quot;&gt;{
  &amp;quot;mcpServers&amp;quot;: {
    &amp;quot;github&amp;quot;: {
      &amp;quot;command&amp;quot;: &amp;quot;npx&amp;quot;,
      &amp;quot;args&amp;quot;: [&amp;quot;-y&amp;quot;, &amp;quot;@anthropic/mcp-server-github&amp;quot;]
    },
    &amp;quot;postgres&amp;quot;: {
      &amp;quot;httpUrl&amp;quot;: &amp;quot;http://localhost:3001/mcp&amp;quot;
    },
    &amp;quot;custom-tool&amp;quot;: {
      &amp;quot;command&amp;quot;: &amp;quot;python&amp;quot;,
      &amp;quot;args&amp;quot;: [&amp;quot;./scripts/my-mcp-server.py&amp;quot;],
      &amp;quot;env&amp;quot;: {
        &amp;quot;API_KEY&amp;quot;: &amp;quot;${MY_API_KEY}&amp;quot;
      }
    }
  }
}
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Note the environment variable expansion (&lt;code&gt;${MY_API_KEY}&lt;/code&gt;), which lets you keep credentials out of configuration files.&lt;/p&gt;
&lt;h3&gt;Transport Options&lt;/h3&gt;
&lt;p&gt;Gemini CLI supports three MCP transport mechanisms:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;stdio:&lt;/strong&gt; The server runs as a local process (most common for development)&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;SSE (Server-Sent Events):&lt;/strong&gt; For remote servers using the &lt;code&gt;url&lt;/code&gt; property&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;HTTP Streaming:&lt;/strong&gt; For modern HTTP-based servers using the &lt;code&gt;httpUrl&lt;/code&gt; property&lt;/li&gt;
&lt;/ul&gt;
&lt;h3&gt;MCP Prompts as Slash Commands&lt;/h3&gt;
&lt;p&gt;MCP servers can expose predefined prompts as slash commands. If a connected server exposes a prompt named &amp;quot;analyze-performance,&amp;quot; you can invoke it with &lt;code&gt;/analyze-performance&lt;/code&gt; directly in the CLI.&lt;/p&gt;
&lt;h3&gt;When to Use MCP&lt;/h3&gt;
&lt;p&gt;&lt;strong&gt;Use MCP for:&lt;/strong&gt; Database access, GitHub integration, browser automation, accessing internal APIs, connecting to project management tools&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Skip MCP when:&lt;/strong&gt; The task is code-only and the files are already on your local system. Gemini CLI can read files and run terminal commands directly without MCP.&lt;/p&gt;
&lt;h2&gt;Dynamic Shell Context&lt;/h2&gt;
&lt;p&gt;One of Gemini CLI&apos;s unique strengths is its ability to execute shell commands to gather real-time context. This means the agent can check the actual state of your system rather than relying on static descriptions.&lt;/p&gt;
&lt;h3&gt;Practical Use Cases&lt;/h3&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Check current Git state:&lt;/strong&gt; The agent can run &lt;code&gt;git status&lt;/code&gt; or &lt;code&gt;git log&lt;/code&gt; to understand what has changed recently, which branch you are on, and what commits are pending&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Inspect running services:&lt;/strong&gt; Commands like &lt;code&gt;docker ps&lt;/code&gt; or &lt;code&gt;kubectl get pods&lt;/code&gt; give the agent visibility into your running infrastructure&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Read live configuration:&lt;/strong&gt; The agent can check environment variables, read &lt;code&gt;.env&lt;/code&gt; files, or inspect running process configurations&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Verify test results:&lt;/strong&gt; Running your test suite and analyzing the output gives the agent concrete data about what is passing and what is failing&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;This dynamic context is especially valuable for debugging workflows, where the agent needs to understand both the code and the runtime environment.&lt;/p&gt;
&lt;h2&gt;Automatic Codebase Exploration&lt;/h2&gt;
&lt;p&gt;Gemini CLI automatically explores your project structure using tools that respect &lt;code&gt;.gitignore&lt;/code&gt; patterns. It will not waste context on &lt;code&gt;node_modules/&lt;/code&gt;, &lt;code&gt;__pycache__/&lt;/code&gt;, or build output. It also detects project types from configuration files (for example, finding &lt;code&gt;package.json&lt;/code&gt; tells it this is a Node.js project).&lt;/p&gt;
&lt;p&gt;This automatic exploration means you can ask broad questions like &amp;quot;What database does this project use?&amp;quot; and the agent will find the answer by scanning relevant configuration files. However, GEMINI.md files significantly improve results by providing context that cannot be inferred from code alone: team decisions, architectural rationale, and development philosophy.&lt;/p&gt;
&lt;h2&gt;Choosing Gemini CLI vs. Other Terminal Agents&lt;/h2&gt;
&lt;p&gt;&lt;strong&gt;Choose Gemini CLI over Claude Code when:&lt;/strong&gt; You prefer Google&apos;s Gemini models, need the hierarchical GEMINI.md system, or want MCP prompts exposed as slash commands.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Choose Gemini CLI over OpenCode when:&lt;/strong&gt; You want a simpler, more focused tool without OpenCode&apos;s TUI interface, or you are already invested in the Google ecosystem.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Choose Gemini CLI over Codex CLI when:&lt;/strong&gt; You want an open-source tool you can inspect and modify, or you prefer interactive terminal sessions over Codex&apos;s sandbox model.&lt;/p&gt;
&lt;h2&gt;Thinking About the Right Level of Context&lt;/h2&gt;
&lt;h3&gt;For Quick Questions&lt;/h3&gt;
&lt;p&gt;Just ask. Gemini CLI can explore your codebase on its own:&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;What database ORM does this project use?
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;The CLI will scan your project files, find the relevant configuration, and answer.&lt;/p&gt;
&lt;h3&gt;For Targeted Changes&lt;/h3&gt;
&lt;p&gt;Provide file references and constraints:&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;@src/services/auth.ts Add rate limiting to the login function. 
Use express-rate-limit with a 100-request-per-minute window.
&lt;/code&gt;&lt;/pre&gt;
&lt;h3&gt;For Large Features&lt;/h3&gt;
&lt;p&gt;Invest in GEMINI.md, set up relevant MCP servers, and use the multi-step approach: plan first, then implement.&lt;/p&gt;
&lt;h2&gt;External Documents: PDFs vs. Markdown&lt;/h2&gt;
&lt;h3&gt;Markdown Is Native&lt;/h3&gt;
&lt;p&gt;GEMINI.md files, memory entries, and context documents should all be Markdown. The format is native to Gemini CLI&apos;s context system.&lt;/p&gt;
&lt;h3&gt;PDFs Need Conversion&lt;/h3&gt;
&lt;p&gt;Gemini CLI primarily works with text-based formats. If you have reference material in PDF form, extract the relevant sections into Markdown files and place them in your project directory. This makes them accessible via @ references and GEMINI.md imports.&lt;/p&gt;
&lt;h2&gt;Advanced Patterns&lt;/h2&gt;
&lt;h3&gt;The Context-Aware Shell Script&lt;/h3&gt;
&lt;p&gt;Create shell scripts that set up project context before launching Gemini CLI:&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-bash&quot;&gt;#!/bin/bash
# Start Gemini CLI with project-specific context
cd ~/projects/my-api
export DB_URL=&amp;quot;postgresql://dev@localhost:5433/mydb&amp;quot;
gemini
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;This ensures the CLI starts in the right directory with the right environment variables, reducing context-switching overhead.&lt;/p&gt;
&lt;h3&gt;The Exploration-First Pattern&lt;/h3&gt;
&lt;p&gt;Before starting a new feature:&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;Analyze the current authentication system.
Describe the flow from login to token validation.
Do not make any changes.
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Review the analysis, correct any misunderstandings, and then proceed with the implementation task.&lt;/p&gt;
&lt;h3&gt;The Automated Context Generation Pattern&lt;/h3&gt;
&lt;p&gt;Use the &lt;code&gt;/init&lt;/code&gt; command periodically (or a custom script) to regenerate your GEMINI.md file as the project evolves. This keeps the context file synchronized with the actual state of the codebase.&lt;/p&gt;
&lt;h2&gt;Common Mistakes&lt;/h2&gt;
&lt;ol&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;No GEMINI.md.&lt;/strong&gt; Without it, the CLI starts with no project context. It can still explore your codebase, but it will make assumptions that may not match your conventions.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Stale GEMINI.md.&lt;/strong&gt; A GEMINI.md that references frameworks or patterns you no longer use creates confusion. Update it when you make significant changes.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Overloading memory.&lt;/strong&gt; Memory is for brief, stable facts. Do not try to store entire documents as memory entries.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Adding unnecessary MCP servers.&lt;/strong&gt; Each connected server adds tools that the CLI must evaluate. Only connect servers you actively use.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Not using @ for targeted questions.&lt;/strong&gt; Pointing Gemini CLI at specific files with @ produces more focused results than letting it search the entire project.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Ignoring /init.&lt;/strong&gt; For new projects, /init generates a solid starting GEMINI.md in seconds. Review and refine it rather than writing from scratch.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Forgetting to refresh after external edits.&lt;/strong&gt; If you edit GEMINI.md files in your text editor, run &lt;code&gt;/memory refresh&lt;/code&gt; so the CLI picks up the changes immediately.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Writing overly long GEMINI.md files.&lt;/strong&gt; GEMINI.md should be focused and scannable. If it exceeds 500 lines, consider splitting it into modular imported files. A concise GEMINI.md with clear sections is more effective than a sprawling document.&lt;/p&gt;
&lt;/li&gt;
&lt;/ol&gt;
&lt;h2&gt;Go Deeper&lt;/h2&gt;
&lt;p&gt;To learn more about working effectively with AI coding agents and context management strategies, check out these resources by Alex Merced:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;
&lt;p&gt;&lt;a href=&quot;https://www.amazon.com/2026-Guide-AI-Assisted-Development-Engineering-ebook/dp/B0GQW7CTML/&quot;&gt;The 2026 Guide to AI-Assisted Development&lt;/a&gt; covers AI-assisted development workflows, prompt engineering, and context strategies for software engineers.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;a href=&quot;https://www.amazon.com/Lakehouses-Apache-Iceberg-Agentic-Hands/dp/B0GQNY21TD/&quot;&gt;The 2026 Guide to Lakehouses, Apache Iceberg and Agentic AI&lt;/a&gt; explores how AI agents are reshaping data architecture and how to build systems that support agentic workflows.&lt;/p&gt;
&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;And for a fictional take on where AI is heading:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;a href=&quot;https://www.amazon.com/Emperors-I-Valley-Novel-Future/dp/B0GQHKF4ZT/&quot;&gt;The Emperors of A.I. Valley: A Novel of Power, Code, and the War for the Future&lt;/a&gt; is a novel about the power struggles and ethical dilemmas behind the companies building the most powerful AI systems in the world.&lt;/li&gt;
&lt;/ul&gt;
</content:encoded><author>Alex Merced</author></item><item><title>Context Management Strategies for Gemini Web and NotebookLM: A Complete Guide to Google&apos;s AI Knowledge Ecosystem</title><link>https://iceberglakehouse.com/posts/2026-03-context-gemini-web-notebooklm/</link><guid isPermaLink="true">https://iceberglakehouse.com/posts/2026-03-context-gemini-web-notebooklm/</guid><description>
Google&apos;s AI ecosystem for knowledge work consists of two deeply integrated tools: Gemini (the conversational AI at gemini.google.com) and NotebookLM ...</description><pubDate>Sat, 07 Mar 2026 15:00:00 GMT</pubDate><content:encoded>&lt;p&gt;Google&apos;s AI ecosystem for knowledge work consists of two deeply integrated tools: Gemini (the conversational AI at gemini.google.com) and NotebookLM (the research-focused assistant at notebooklm.google.com). In early 2026, these two platforms became interoperable, allowing Gemini to access information stored in NotebookLM notebooks. This integration creates something unique in the AI landscape: a persistent knowledge infrastructure where documents you upload once become available across both conversational and research interfaces.&lt;/p&gt;
&lt;p&gt;This guide covers context management strategies for both Gemini Web and NotebookLM, with a focus on how to use them together for maximum effectiveness.&lt;/p&gt;
&lt;h2&gt;Gemini Web: Context Management Fundamentals&lt;/h2&gt;
&lt;h3&gt;The Context Window&lt;/h3&gt;
&lt;p&gt;Gemini supports one of the largest context windows available, with models like Gemini 3 Pro and Gemini 2.5 Pro offering up to 2 million tokens. This is approximately 1.5 million words of input capacity, enough to process entire books, large codebases, or years of financial data in a single conversation.&lt;/p&gt;
&lt;p&gt;The context window includes everything: your system instructions, conversation history, uploaded files, and Gemini&apos;s responses. While 2 million tokens is enormous, strategic context management still matters because relevance, not volume, determines response quality.&lt;/p&gt;
&lt;h3&gt;Custom Instructions&lt;/h3&gt;
&lt;p&gt;Gemini supports custom instructions that shape how it responds across conversations. Access these through Gemini&apos;s settings. Effective custom instructions include:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Your professional background and expertise level&lt;/li&gt;
&lt;li&gt;Preferred response style (concise vs. detailed, formal vs. conversational)&lt;/li&gt;
&lt;li&gt;Output format preferences (bullet points, structured sections, code formatting)&lt;/li&gt;
&lt;li&gt;Domain-specific terminology or constraints&lt;/li&gt;
&lt;/ul&gt;
&lt;h3&gt;Gems: Specialized AI Assistants&lt;/h3&gt;
&lt;p&gt;Gems are custom AI mini-apps within Gemini. You create a Gem by defining its purpose, instructions, and behavior. Unlike custom instructions (which apply globally), each Gem operates with its own specialized context.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Use Gems for repeatable workflows:&lt;/strong&gt;&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;A &amp;quot;Technical Writer&amp;quot; Gem with your style guide and terminology baked in&lt;/li&gt;
&lt;li&gt;A &amp;quot;Data Analyst&amp;quot; Gem that knows your preferred visualization tools and analysis frameworks&lt;/li&gt;
&lt;li&gt;A &amp;quot;Meeting Prep&amp;quot; Gem that generates agendas and briefing documents in your format&lt;/li&gt;
&lt;li&gt;A &amp;quot;Code Reviewer&amp;quot; Gem that applies your team&apos;s coding standards consistently&lt;/li&gt;
&lt;li&gt;A &amp;quot;Content Editor&amp;quot; Gem that checks for brand voice compliance&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;To create a Gem, navigate to the Gems section in Gemini, define its instructions, and optionally upload knowledge files. Once created, you can invoke the Gem anytime without re-establishing its context.&lt;/p&gt;
&lt;h3&gt;Building Effective Gems&lt;/h3&gt;
&lt;p&gt;The quality of a Gem depends entirely on the quality of its instructions. Write Gem instructions as if you are onboarding a new team member to a specific role:&lt;/p&gt;
&lt;ol&gt;
&lt;li&gt;&lt;strong&gt;Define the Gem&apos;s role&lt;/strong&gt; (&amp;quot;You are a technical documentation editor for a developer tools company&amp;quot;)&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Specify the audience&lt;/strong&gt; (&amp;quot;Write for senior developers who know the basics but need guidance on advanced topics&amp;quot;)&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Set quality standards&lt;/strong&gt; (&amp;quot;Every section must include at least one code example, use active voice, and stay under 300 words per subsection&amp;quot;)&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Include anti-patterns&lt;/strong&gt; (&amp;quot;Never use jargon without defining it first, never assume the reader has used this tool before&amp;quot;)&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Provide examples&lt;/strong&gt; of the desired output style when possible&lt;/li&gt;
&lt;/ol&gt;
&lt;h3&gt;Notebooks (Projects)&lt;/h3&gt;
&lt;p&gt;Gemini is rolling out &amp;quot;Notebooks&amp;quot; (an evolution of its Projects feature) that let you group conversations by topic and set per-notebook custom instructions. This mirrors the Project concept in other AI tools: a workspace where context persists across conversations.&lt;/p&gt;
&lt;p&gt;Within a Notebook:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Set instructions specific to the topic or project&lt;/li&gt;
&lt;li&gt;Upload files that Gemini can reference in every conversation&lt;/li&gt;
&lt;li&gt;Maintain a collection of related conversations without losing context between them&lt;/li&gt;
&lt;/ul&gt;
&lt;h3&gt;File Uploads&lt;/h3&gt;
&lt;p&gt;Gemini Web supports direct file uploads in conversations:&lt;/p&gt;
&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;File Type&lt;/th&gt;
&lt;th&gt;Use Case&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;PDF&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Research papers, specifications, reports&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Documents&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Google Docs, Word files for editing or analysis&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Spreadsheets&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Data analysis, financial modeling&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Images&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Visual context, screenshots, diagrams&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Audio&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Transcription and analysis&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Video&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Visual content analysis&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;
&lt;h3&gt;Google Workspace Integration&lt;/h3&gt;
&lt;p&gt;A distinctive Gemini feature is its integration with Google Workspace. With &amp;quot;Personal Intelligence&amp;quot; (available in 2026), Gemini can securely access your Gmail, Drive, Docs, and Calendar to provide context-aware responses grounded in your actual work data. This means Gemini can:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Search your email history for relevant communications&lt;/li&gt;
&lt;li&gt;Reference documents in your Google Drive&lt;/li&gt;
&lt;li&gt;Check your calendar when you ask about scheduling&lt;/li&gt;
&lt;li&gt;Pull data from your spreadsheets for analysis&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;This integration effectively makes your entire Google Workspace a context source, something no other AI platform currently matches.&lt;/p&gt;
&lt;h2&gt;NotebookLM: Deep Research Context Management&lt;/h2&gt;
&lt;p&gt;NotebookLM is purpose-built for research and knowledge work. Its context management is centered around &amp;quot;notebooks,&amp;quot; each of which contains sources (your uploaded documents) and a conversation interface grounded in those sources.&lt;/p&gt;
&lt;h3&gt;How NotebookLM Handles Context&lt;/h3&gt;
&lt;p&gt;Unlike Gemini (which can draw on its entire training data), NotebookLM responses are grounded exclusively in the sources you upload. This is a feature, not a limitation. When you need answers based specifically on your documents (not the model&apos;s general knowledge), NotebookLM provides citation-backed responses that reference specific sections of your sources.&lt;/p&gt;
&lt;h3&gt;Source Types&lt;/h3&gt;
&lt;p&gt;NotebookLM supports a wide range of source types:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;PDFs:&lt;/strong&gt; Research papers, reports, legal documents&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Google Docs:&lt;/strong&gt; Your own writing, notes, and drafts&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Google Slides:&lt;/strong&gt; Presentation content&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Web URLs:&lt;/strong&gt; Articles, documentation, and reference pages&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;YouTube videos:&lt;/strong&gt; Automatic transcription and analysis&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Audio files:&lt;/strong&gt; Podcast episodes, interviews, lectures&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Text files:&lt;/strong&gt; Any plaintext content&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;&lt;strong&gt;Free tier:&lt;/strong&gt; Up to 50 sources per notebook (500,000 words or 200MB per source)
&lt;strong&gt;NotebookLM Pro:&lt;/strong&gt; Up to 300 sources per notebook&lt;/p&gt;
&lt;h3&gt;Custom Instructions in NotebookLM&lt;/h3&gt;
&lt;p&gt;NotebookLM supports per-notebook custom instructions. You can set:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Response style (&amp;quot;Explain like I am new to this field&amp;quot;)&lt;/li&gt;
&lt;li&gt;Response length preferences&lt;/li&gt;
&lt;li&gt;Tone (academic, conversational, technical)&lt;/li&gt;
&lt;li&gt;Specific focus areas within your sources&lt;/li&gt;
&lt;/ul&gt;
&lt;h3&gt;Audio Overviews&lt;/h3&gt;
&lt;p&gt;NotebookLM&apos;s Audio Overview feature generates podcast-style discussions of your uploaded sources. This is a unique context consumption approach: instead of reading AI-generated summaries, you listen to a natural conversation about your documents. Audio Overviews are useful for:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Getting a high-level understanding of dense material before deep-reading&lt;/li&gt;
&lt;li&gt;Reviewing research while multitasking&lt;/li&gt;
&lt;li&gt;Sharing knowledge with colleagues who prefer audio formats&lt;/li&gt;
&lt;/ul&gt;
&lt;h2&gt;Using Gemini and NotebookLM Together&lt;/h2&gt;
&lt;p&gt;The integration between Gemini and NotebookLM is where the real power emerges.&lt;/p&gt;
&lt;h3&gt;The Knowledge Flow&lt;/h3&gt;
&lt;ol&gt;
&lt;li&gt;&lt;strong&gt;Upload sources to NotebookLM:&lt;/strong&gt; Research papers, reports, specifications&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Let NotebookLM build a grounded knowledge base:&lt;/strong&gt; Ask questions, generate summaries, create Audio Overviews&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Import that notebook into Gemini:&lt;/strong&gt; Gemini gains access to all your NotebookLM sources&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Use Gemini for broader analysis:&lt;/strong&gt; Gemini combines your specific sources with its general knowledge and web search&lt;/li&gt;
&lt;/ol&gt;
&lt;p&gt;This workflow gives you both grounded, citation-backed analysis (NotebookLM) and broader contextual understanding (Gemini) from the same source material.&lt;/p&gt;
&lt;h3&gt;When to Use Each&lt;/h3&gt;
&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Need&lt;/th&gt;
&lt;th&gt;Use&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Answers grounded strictly in your documents&lt;/td&gt;
&lt;td&gt;NotebookLM&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Broad research with web search integration&lt;/td&gt;
&lt;td&gt;Gemini Web&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Citation-backed analysis of specific papers&lt;/td&gt;
&lt;td&gt;NotebookLM&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Creative ideation and brainstorming&lt;/td&gt;
&lt;td&gt;Gemini Web&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Audio summaries of research material&lt;/td&gt;
&lt;td&gt;NotebookLM&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Integration with Google Workspace data&lt;/td&gt;
&lt;td&gt;Gemini Web&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Side-by-side comparison of source documents&lt;/td&gt;
&lt;td&gt;NotebookLM&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;
&lt;h3&gt;Gems as Auto-Syncing Brains&lt;/h3&gt;
&lt;p&gt;When you create a Gem that is linked to a NotebookLM notebook, the Gem automatically stays in sync with the notebook&apos;s sources. Add a new document to the notebook, and the Gem can reference it immediately. This creates a &amp;quot;specialized brain&amp;quot; that continuously learns from your latest research without requiring you to re-upload files or restate context.&lt;/p&gt;
&lt;h2&gt;External Documents: PDFs vs. Markdown&lt;/h2&gt;
&lt;h3&gt;In NotebookLM&lt;/h3&gt;
&lt;p&gt;NotebookLM works well with PDFs because it extracts and indexes the content for citation-backed retrieval. Since NotebookLM&apos;s primary job is to ground responses in specific documents, PDFs are perfectly suited for this use case.&lt;/p&gt;
&lt;p&gt;However, for your own notes, outlines, and structured reference material, Google Docs or Markdown files (uploaded as text) provide cleaner parsing and are easier to update.&lt;/p&gt;
&lt;h3&gt;In Gemini Web&lt;/h3&gt;
&lt;p&gt;Gemini handles both PDFs and text-based formats well, but the same general rule applies: Markdown and plaintext provide the cleanest AI-parseable context. Use PDFs for published documents you received from others, and Markdown or Google Docs for context you author yourself.&lt;/p&gt;
&lt;h2&gt;MCP Server Support&lt;/h2&gt;
&lt;p&gt;As of early 2026, Gemini Web and NotebookLM do not support MCP (Model Context Protocol) server connections. MCP support is available in the Gemini CLI, which is covered in a separate guide.&lt;/p&gt;
&lt;p&gt;For web-based Gemini usage, the Google Workspace integration provides similar benefits to MCP for many use cases: live access to your email, documents, spreadsheets, and calendar. If you need connections to non-Google services (databases, third-party APIs), use the Gemini CLI instead.&lt;/p&gt;
&lt;h2&gt;Advanced Patterns&lt;/h2&gt;
&lt;h3&gt;The Research Pipeline&lt;/h3&gt;
&lt;ol&gt;
&lt;li&gt;&lt;strong&gt;Collect sources in NotebookLM&lt;/strong&gt; (upload papers, articles, reports)&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Generate an Audio Overview&lt;/strong&gt; for high-level understanding&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Ask targeted questions in NotebookLM&lt;/strong&gt; for citation-backed answers&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Import the notebook to Gemini&lt;/strong&gt; for broader analysis&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Use Gemini with web search&lt;/strong&gt; to find related work not in your sources&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Draft your output in Gemini&lt;/strong&gt; using both grounded sources and general knowledge&lt;/li&gt;
&lt;/ol&gt;
&lt;h3&gt;The Knowledge Base Strategy&lt;/h3&gt;
&lt;p&gt;Use NotebookLM notebooks as persistent knowledge bases for different domains:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;&amp;quot;Industry Research&amp;quot;&lt;/strong&gt; notebook with market reports and analyst papers&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;&amp;quot;Technical Reference&amp;quot;&lt;/strong&gt; notebook with API docs and architecture papers&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;&amp;quot;Competitive Intelligence&amp;quot;&lt;/strong&gt; notebook with competitor materials&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;Each notebook becomes a specialized resource that you can query independently or combine with Gemini for cross-domain analysis.&lt;/p&gt;
&lt;h3&gt;The Document Synthesis Pattern&lt;/h3&gt;
&lt;p&gt;When you need to synthesize multiple long documents:&lt;/p&gt;
&lt;ol&gt;
&lt;li&gt;Upload all documents to a single NotebookLM notebook&lt;/li&gt;
&lt;li&gt;Ask NotebookLM to summarize each document individually&lt;/li&gt;
&lt;li&gt;Ask it to identify common themes across all documents&lt;/li&gt;
&lt;li&gt;Ask it to highlight contradictions or disagreements between documents&lt;/li&gt;
&lt;li&gt;Use the results in Gemini for a final synthesized analysis&lt;/li&gt;
&lt;/ol&gt;
&lt;p&gt;This approach leverages NotebookLM&apos;s grounding capability for accurate summarization and Gemini&apos;s broader intelligence for synthesis.&lt;/p&gt;
&lt;h2&gt;Structuring Context for Gemini and NotebookLM&lt;/h2&gt;
&lt;h3&gt;In Gemini: Lead with Purpose&lt;/h3&gt;
&lt;p&gt;Because Gemini has such a large context window, it is tempting to dump everything in and hope for the best. Resist this. Structure your inputs:&lt;/p&gt;
&lt;ol&gt;
&lt;li&gt;&lt;strong&gt;State your goal first&lt;/strong&gt; (&amp;quot;I need a comparison table of three database solutions&amp;quot;)&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Provide the relevant data&lt;/strong&gt; (paste or reference uploaded files)&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Specify the output format&lt;/strong&gt; (&amp;quot;Create a markdown table with columns for Feature, Solution A, Solution B, Solution C&amp;quot;)&lt;/li&gt;
&lt;/ol&gt;
&lt;p&gt;This pattern works because Gemini prioritizes recent and explicit instructions over ambient context.&lt;/p&gt;
&lt;h3&gt;In NotebookLM: Trust the Grounding&lt;/h3&gt;
&lt;p&gt;NotebookLM is designed to answer from your sources. You do not need to paste content into the chat because the sources are already indexed. Instead, ask specific questions that require the AI to synthesize across your documents:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&amp;quot;Compare how Document A and Document B define the term &apos;data mesh&apos;&amp;quot;&lt;/li&gt;
&lt;li&gt;&amp;quot;What evidence in my sources supports the claim that real-time processing reduces costs?&amp;quot;&lt;/li&gt;
&lt;li&gt;&amp;quot;Identify contradictions between the 2024 and 2025 reports on this topic&amp;quot;&lt;/li&gt;
&lt;/ul&gt;
&lt;h2&gt;Common Mistakes&lt;/h2&gt;
&lt;ol&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Using Gemini when you need citations.&lt;/strong&gt; If you need responses backed by specific sources, use NotebookLM. Gemini&apos;s general knowledge is powerful but cannot provide page-level citations from your documents.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Overloading a single NotebookLM notebook.&lt;/strong&gt; While Pro supports 300 sources, having too many unrelated documents in one notebook dilutes the AI&apos;s focus. Create separate notebooks for distinct topics.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Not using Audio Overviews.&lt;/strong&gt; Audio Overviews are one of NotebookLM&apos;s most underused features. They provide an efficient way to internalize complex material, especially before you start asking detailed questions.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Ignoring the Gemini-NotebookLM integration.&lt;/strong&gt; Using these tools in isolation means you miss the most powerful workflow: grounded research in NotebookLM feeding into broader analysis in Gemini.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Skipping custom instructions.&lt;/strong&gt; Both Gemini and NotebookLM support per-workspace custom instructions. Setting these up takes minutes and saves hours of course-correcting the AI.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Not using Gems for repeatable tasks.&lt;/strong&gt; If you find yourself giving Gemini the same instructions repeatedly, create a Gem and save that context permanently.&lt;/p&gt;
&lt;/li&gt;
&lt;/ol&gt;
&lt;h2&gt;Go Deeper&lt;/h2&gt;
&lt;p&gt;To learn more about context management strategies for AI tools, research workflows, and agentic systems, check out these resources by Alex Merced:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;
&lt;p&gt;&lt;a href=&quot;https://www.amazon.com/2026-Guide-AI-Assisted-Development-Engineering-ebook/dp/B0GQW7CTML/&quot;&gt;The 2026 Guide to AI-Assisted Development&lt;/a&gt; covers AI-assisted development workflows, prompt engineering, and context strategies for software engineers.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;a href=&quot;https://www.amazon.com/Lakehouses-Apache-Iceberg-Agentic-Hands/dp/B0GQNY21TD/&quot;&gt;The 2026 Guide to Lakehouses, Apache Iceberg and Agentic AI&lt;/a&gt; explores how AI agents are reshaping data architecture and how to build systems that support agentic workflows.&lt;/p&gt;
&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;And for a fictional take on where AI is heading:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;a href=&quot;https://www.amazon.com/Emperors-I-Valley-Novel-Future/dp/B0GQHKF4ZT/&quot;&gt;The Emperors of A.I. Valley: A Novel of Power, Code, and the War for the Future&lt;/a&gt; is a novel about the power struggles and ethical dilemmas behind the companies building the most powerful AI systems in the world.&lt;/li&gt;
&lt;/ul&gt;
</content:encoded><author>Alex Merced</author></item><item><title>Context Management Strategies for Claude Code: A Complete Guide for Developers</title><link>https://iceberglakehouse.com/posts/2026-03-context-claude-code/</link><guid isPermaLink="true">https://iceberglakehouse.com/posts/2026-03-context-claude-code/</guid><description>
Claude Code is a terminal-native agentic coding assistant that lives in your command line and operates directly on your codebase. Unlike chat-based i...</description><pubDate>Sat, 07 Mar 2026 14:00:00 GMT</pubDate><content:encoded>&lt;p&gt;Claude Code is a terminal-native agentic coding assistant that lives in your command line and operates directly on your codebase. Unlike chat-based interfaces where you copy and paste code snippets, Claude Code reads your files, explores your project structure, runs commands, executes tests, and commits changes. Context management in Claude Code is about configuring the agent&apos;s persistent knowledge of your project so it can operate effectively without constant direction.&lt;/p&gt;
&lt;p&gt;This guide covers every context management mechanism in Claude Code, from the foundational CLAUDE.md file to MCP integrations and multi-agent orchestration.&lt;/p&gt;
&lt;h2&gt;How Claude Code Manages Context&lt;/h2&gt;
&lt;p&gt;Claude Code builds its context from several sources, layered from most persistent to most ephemeral:&lt;/p&gt;
&lt;ol&gt;
&lt;li&gt;&lt;strong&gt;CLAUDE.md files&lt;/strong&gt; (permanent project instructions)&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;MEMORY.md&lt;/strong&gt; (automatically maintained session memory)&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;MCP server connections&lt;/strong&gt; (live external data)&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;The codebase itself&lt;/strong&gt; (files, dependencies, project structure)&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;The current conversation&lt;/strong&gt; (your commands and the agent&apos;s responses)&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Command output&lt;/strong&gt; (terminal results, test output, error messages)&lt;/li&gt;
&lt;/ol&gt;
&lt;p&gt;The agent combines all of these into a working context that informs how it approaches tasks. The most effective Claude Code users invest time in the persistent layers (CLAUDE.md, MEMORY.md, MCP) so that every conversation starts with a solid foundation.&lt;/p&gt;
&lt;h2&gt;CLAUDE.md: Your Project&apos;s Instruction Manual&lt;/h2&gt;
&lt;p&gt;CLAUDE.md is the primary mechanism for giving Claude Code persistent context about your project. It is a Markdown file that Claude reads at the start of every session.&lt;/p&gt;
&lt;h3&gt;File Locations and Hierarchy&lt;/h3&gt;
&lt;p&gt;Claude Code loads CLAUDE.md files from multiple locations, combining them into a single instruction set:&lt;/p&gt;
&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Location&lt;/th&gt;
&lt;th&gt;Scope&lt;/th&gt;
&lt;th&gt;Use For&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;~/.claude/CLAUDE.md&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;Global (all projects)&lt;/td&gt;
&lt;td&gt;Personal preferences, universal standards&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;./CLAUDE.md&lt;/code&gt; (project root)&lt;/td&gt;
&lt;td&gt;Project-wide&lt;/td&gt;
&lt;td&gt;Architecture, coding standards, testing strategy&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;./src/CLAUDE.md&lt;/code&gt; (subdirectory)&lt;/td&gt;
&lt;td&gt;Component-specific&lt;/td&gt;
&lt;td&gt;Module-specific patterns, API conventions&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;
&lt;p&gt;More specific files supplement more general ones. If your global CLAUDE.md says &amp;quot;use 2-space indentation&amp;quot; but your project CLAUDE.md says &amp;quot;use 4-space indentation,&amp;quot; the project-level instruction takes precedence.&lt;/p&gt;
&lt;h3&gt;What to Include in CLAUDE.md&lt;/h3&gt;
&lt;pre&gt;&lt;code class=&quot;language-markdown&quot;&gt;# CLAUDE.md

## Project Overview
This is a Python FastAPI application with a React frontend.
Backend: Python 3.12, FastAPI, SQLAlchemy, PostgreSQL 16
Frontend: TypeScript, React 19, Vite 6, Zustand
Testing: pytest (backend), Vitest (frontend)

## Build and Run Commands
- Backend: `uvicorn app.main:app --reload`
- Frontend: `npm run dev`
- Tests: `pytest` (backend), `npm test` (frontend)
- Lint: `ruff check .` (backend), `npm run lint` (frontend)

## Code Conventions
- Use type hints for all function signatures
- Use Pydantic models for API request/response schemas
- Use async functions for all database operations
- Prefer composition over inheritance
- Keep functions under 30 lines; extract helpers for longer logic

## Testing Requirements
- Every new endpoint needs integration tests
- Every utility function needs unit tests
- Mock external services; never hit real APIs in tests
- Use factories (not fixtures) for test data creation

## Architecture Decisions
- We use the repository pattern for database access
- All business logic lives in the service layer, not in route handlers
- Frontend state is managed exclusively through Zustand stores
- API responses follow the JSON:API specification
&lt;/code&gt;&lt;/pre&gt;
&lt;h3&gt;CLAUDE.md Best Practices&lt;/h3&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Be specific and actionable.&lt;/strong&gt; &amp;quot;Write clean code&amp;quot; is useless. &amp;quot;Functions should have a single responsibility and no side effects&amp;quot; is useful.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Include build and test commands.&lt;/strong&gt; Claude Code will run these commands to verify its work. If it does not know your test command, it cannot validate changes.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Document your architecture.&lt;/strong&gt; Tell Claude Code where things live. &amp;quot;Database models are in &lt;code&gt;app/models/&lt;/code&gt;&amp;quot; saves the agent from exploring your entire project structure.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Use negative constraints.&lt;/strong&gt; &amp;quot;Do not use class-based views&amp;quot; and &amp;quot;Never import directly from internal modules; use the public API&amp;quot; prevent common mistakes.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Keep it current.&lt;/strong&gt; An outdated CLAUDE.md with references to deprecated patterns causes more harm than having no CLAUDE.md at all.&lt;/li&gt;
&lt;/ul&gt;
&lt;h2&gt;MEMORY.md: Automatic Session Memory&lt;/h2&gt;
&lt;p&gt;MEMORY.md is a file that Claude Code creates and maintains automatically to persist important context across sessions. When you share information that Claude determines is worth remembering (project decisions, your preferences, issue resolutions), it writes that information to MEMORY.md.&lt;/p&gt;
&lt;h3&gt;How MEMORY.md Works&lt;/h3&gt;
&lt;ul&gt;
&lt;li&gt;Claude Code creates &lt;code&gt;~/.claude/MEMORY.md&lt;/code&gt; automatically&lt;/li&gt;
&lt;li&gt;During conversations, when you share important context, Claude offers to save it&lt;/li&gt;
&lt;li&gt;In subsequent sessions, Claude reads MEMORY.md before starting work&lt;/li&gt;
&lt;li&gt;You can also manually edit MEMORY.md to add or remove memories&lt;/li&gt;
&lt;/ul&gt;
&lt;h3&gt;What Gets Stored&lt;/h3&gt;
&lt;p&gt;Typical MEMORY.md entries include:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Project preferences you have stated (&amp;quot;I prefer named exports over default exports&amp;quot;)&lt;/li&gt;
&lt;li&gt;Decisions you have made (&amp;quot;We chose Redis for session storage because of its TTL support&amp;quot;)&lt;/li&gt;
&lt;li&gt;Debugging discoveries (&amp;quot;The auth middleware requires the Authorization header in lowercase&amp;quot;)&lt;/li&gt;
&lt;li&gt;Workflow notes (&amp;quot;Always run migrations before testing database changes&amp;quot;)&lt;/li&gt;
&lt;/ul&gt;
&lt;h3&gt;Managing MEMORY.md&lt;/h3&gt;
&lt;p&gt;Review MEMORY.md periodically. Like any persistent context, stale entries can lead the agent astray. Remove entries that no longer apply and update ones that have changed.&lt;/p&gt;
&lt;p&gt;You can also use the &lt;code&gt;/memory&lt;/code&gt; slash command during a session to view what Claude currently remembers.&lt;/p&gt;
&lt;h2&gt;Slash Commands: Real-Time Context Control&lt;/h2&gt;
&lt;p&gt;Claude Code provides several slash commands for managing context during a session:&lt;/p&gt;
&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Command&lt;/th&gt;
&lt;th&gt;Purpose&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;/context&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;Show all active context sources&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;/clear&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;Clear conversation history (keeps CLAUDE.md and MEMORY.md)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;/agent&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;Spawn a sub-agent for a specific task&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;/memory&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;View and manage session memories&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;/help&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;List available commands&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;
&lt;h3&gt;Using /clear Strategically&lt;/h3&gt;
&lt;p&gt;Long sessions accumulate irrelevant context that can degrade Claude Code&apos;s focus. Use &lt;code&gt;/clear&lt;/code&gt; when:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;You are switching to a different part of the codebase&lt;/li&gt;
&lt;li&gt;The conversation has gotten long and the agent seems confused&lt;/li&gt;
&lt;li&gt;You want to start a focused task without the baggage of previous exchanges&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;Note that &lt;code&gt;/clear&lt;/code&gt; preserves your CLAUDE.md and MEMORY.md context. Only the conversation history is reset.&lt;/p&gt;
&lt;h3&gt;Using /agent for Sub-Tasks&lt;/h3&gt;
&lt;p&gt;The &lt;code&gt;/agent&lt;/code&gt; command spawns a sub-agent that operates independently with its own context. This is useful for:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Exploring a part of the codebase without polluting your main conversation&lt;/li&gt;
&lt;li&gt;Running a time-consuming task (like a full test suite analysis) in parallel&lt;/li&gt;
&lt;li&gt;Dividing a large feature into independent pieces&lt;/li&gt;
&lt;/ul&gt;
&lt;h2&gt;MCP Server Support&lt;/h2&gt;
&lt;p&gt;Claude Code supports MCP through the &lt;code&gt;claude mcp&lt;/code&gt; command, allowing you to connect external tools and data sources.&lt;/p&gt;
&lt;h3&gt;Configuration&lt;/h3&gt;
&lt;pre&gt;&lt;code class=&quot;language-bash&quot;&gt;# Add a database MCP server
claude mcp add postgres -- npx @anthropic/mcp-server-postgres

# Add a filesystem MCP server
claude mcp add files -- npx @anthropic/mcp-server-filesystem /path/to/project

# List active MCP servers
claude mcp list

# Remove an MCP server
claude mcp remove postgres
&lt;/code&gt;&lt;/pre&gt;
&lt;h3&gt;Practical MCP Use Cases for Developers&lt;/h3&gt;
&lt;p&gt;&lt;strong&gt;Development database:&lt;/strong&gt; Let Claude Code query your dev database to understand schema, check data state, and verify migrations.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Browser testing:&lt;/strong&gt; Connect a Playwright MCP server so Claude Code can verify frontend changes by interacting with a running application.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Git hosting:&lt;/strong&gt; Connect a GitHub or GitLab MCP server for creating pull requests, checking CI status, and reviewing code.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Documentation systems:&lt;/strong&gt; Access internal docs or wikis that provide context not in the codebase.&lt;/p&gt;
&lt;h3&gt;When to Use MCP vs. Direct Commands&lt;/h3&gt;
&lt;p&gt;Claude Code can already run terminal commands. If you just need to see &lt;code&gt;git log&lt;/code&gt; or &lt;code&gt;psql -c &amp;quot;SELECT * FROM users&amp;quot;&lt;/code&gt;, Claude Code can run those directly. MCP is more useful when:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;The interaction is structured and repeatable (not ad-hoc commands)&lt;/li&gt;
&lt;li&gt;You want Claude to have persistent access to a service across the entire session&lt;/li&gt;
&lt;li&gt;The MCP server provides tools that are safer or more convenient than raw commands&lt;/li&gt;
&lt;/ul&gt;
&lt;h2&gt;External Documents: When to Use PDFs vs. Markdown&lt;/h2&gt;
&lt;h3&gt;For Codebase Context: Always Markdown&lt;/h3&gt;
&lt;p&gt;CLAUDE.md, MEMORY.md, and any reference documents you create for Claude Code should be Markdown. The format is native to Claude Code&apos;s context system, version-controllable, and parses without ambiguity.&lt;/p&gt;
&lt;h3&gt;For External Specifications: Convert When Possible&lt;/h3&gt;
&lt;p&gt;If you have API specifications, design documents, or architecture diagrams in PDF form, consider extracting the relevant sections into Markdown and placing them in your repository. This way Claude Code can access them through normal file reading rather than requiring file upload.&lt;/p&gt;
&lt;h3&gt;For One-Off References&lt;/h3&gt;
&lt;p&gt;If you need Claude Code to reference a specific document during a session, paste the relevant content directly into the conversation. Claude Code&apos;s context window is large enough to handle substantial text inclusions.&lt;/p&gt;
&lt;h2&gt;Advanced Patterns&lt;/h2&gt;
&lt;h3&gt;The Test-Driven Context Pattern&lt;/h3&gt;
&lt;ol&gt;
&lt;li&gt;Write failing tests that describe the behavior you want&lt;/li&gt;
&lt;li&gt;Tell Claude Code: &amp;quot;Make these tests pass&amp;quot;&lt;/li&gt;
&lt;li&gt;The tests themselves become the context for the implementation&lt;/li&gt;
&lt;/ol&gt;
&lt;p&gt;This is one of the most effective strategies because tests are unambiguous specifications. Claude Code does not need to interpret your prose when it has concrete pass/fail criteria.&lt;/p&gt;
&lt;h3&gt;The Progressive Codebase Understanding Pattern&lt;/h3&gt;
&lt;p&gt;When onboarding Claude Code to a new project:&lt;/p&gt;
&lt;ol&gt;
&lt;li&gt;Start with CLAUDE.md covering the basics (stack, structure, commands)&lt;/li&gt;
&lt;li&gt;Ask Claude to explore the codebase and describe what it finds&lt;/li&gt;
&lt;li&gt;Correct any misunderstandings and add clarifications to CLAUDE.md&lt;/li&gt;
&lt;li&gt;Gradually delegate more complex tasks as the context matures&lt;/li&gt;
&lt;/ol&gt;
&lt;p&gt;This iterative approach builds a robust CLAUDE.md faster than trying to write everything from scratch.&lt;/p&gt;
&lt;h3&gt;The Multi-Agent Feature Pattern&lt;/h3&gt;
&lt;p&gt;For large features with independent components:&lt;/p&gt;
&lt;ol&gt;
&lt;li&gt;Use &lt;code&gt;/agent&lt;/code&gt; to spawn a sub-agent for each component&lt;/li&gt;
&lt;li&gt;Main agent: coordinates the overall architecture&lt;/li&gt;
&lt;li&gt;Sub-agent 1: implements the database layer&lt;/li&gt;
&lt;li&gt;Sub-agent 2: implements the API endpoints&lt;/li&gt;
&lt;li&gt;Sub-agent 3: implements the frontend components&lt;/li&gt;
&lt;li&gt;Main agent: integrates the results and runs full tests&lt;/li&gt;
&lt;/ol&gt;
&lt;p&gt;Each sub-agent operates with focused context, producing better results than one agent trying to build everything sequentially.&lt;/p&gt;
&lt;h3&gt;The Code Review Pattern&lt;/h3&gt;
&lt;p&gt;Use Claude Code as a reviewer before submitting your own PRs:&lt;/p&gt;
&lt;p&gt;&amp;quot;Review the changes in the current branch compared to main. Check for: security issues, performance problems, missing error handling, test coverage gaps, and style guide violations from CLAUDE.md.&amp;quot;&lt;/p&gt;
&lt;p&gt;The persistent CLAUDE.md context means the review applies your project&apos;s specific standards, not generic best practices.&lt;/p&gt;
&lt;h2&gt;When to Choose Claude Code Over Other Tools&lt;/h2&gt;
&lt;p&gt;&lt;strong&gt;Choose Claude Code over Claude Web/Desktop when:&lt;/strong&gt; Your task is code-centric and benefits from direct file system access, terminal command execution, and test running.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Choose Claude Code over OpenAI Codex when:&lt;/strong&gt; You prefer a terminal-native interactive workflow over Codex&apos;s sandbox-and-PR model, or your project uses the Claude model family.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Choose Claude Code over Cursor or Windsurf when:&lt;/strong&gt; You want a lightweight terminal agent without the overhead of a full IDE, or you work primarily in the terminal.&lt;/p&gt;
&lt;h2&gt;Common Mistakes&lt;/h2&gt;
&lt;ol&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;No CLAUDE.md.&lt;/strong&gt; Claude Code still works without one, but it will make assumptions about your project that may not match reality. Ten minutes spent writing CLAUDE.md saves hours of corrections.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Stale CLAUDE.md.&lt;/strong&gt; A CLAUDE.md that references a framework you migrated away from six months ago actively misleads the agent. Keep it current.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Not using /clear.&lt;/strong&gt; Long sessions accumulate noise. Clear the conversation when switching tasks or when the agent seems to be losing focus.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Over-relying on MCP.&lt;/strong&gt; If Claude Code can accomplish a task through direct file access and terminal commands, adding an MCP server is unnecessary overhead.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Ignoring MEMORY.md.&lt;/strong&gt; Review it periodically. Claude Code&apos;s auto-generated memories are usually accurate, but occasionally they capture outdated or incorrect information.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Micro-managing the agent.&lt;/strong&gt; Claude Code is designed for autonomous task execution. Give it a clear objective, ensure the context is correct, and let it work. Interrupting with constant corrections breaks the agent&apos;s flow.&lt;/p&gt;
&lt;/li&gt;
&lt;/ol&gt;
&lt;h2&gt;Go Deeper&lt;/h2&gt;
&lt;p&gt;To learn more about working effectively with AI coding agents and context management strategies, check out these resources by Alex Merced:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;
&lt;p&gt;&lt;a href=&quot;https://www.amazon.com/2026-Guide-AI-Assisted-Development-Engineering-ebook/dp/B0GQW7CTML/&quot;&gt;The 2026 Guide to AI-Assisted Development&lt;/a&gt; covers AI-assisted development workflows, prompt engineering, and context strategies for software engineers.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;a href=&quot;https://www.amazon.com/Lakehouses-Apache-Iceberg-Agentic-Hands/dp/B0GQNY21TD/&quot;&gt;The 2026 Guide to Lakehouses, Apache Iceberg and Agentic AI&lt;/a&gt; explores how AI agents are reshaping data architecture and how to build systems that support agentic workflows.&lt;/p&gt;
&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;And for a fictional take on where AI is heading:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;a href=&quot;https://www.amazon.com/Emperors-I-Valley-Novel-Future/dp/B0GQHKF4ZT/&quot;&gt;The Emperors of A.I. Valley: A Novel of Power, Code, and the War for the Future&lt;/a&gt; is a novel about the power struggles and ethical dilemmas behind the companies building the most powerful AI systems in the world.&lt;/li&gt;
&lt;/ul&gt;
</content:encoded><author>Alex Merced</author></item><item><title>Context Management Strategies for Claude CoWork: A Complete Guide for Knowledge Workers</title><link>https://iceberglakehouse.com/posts/2026-03-context-claude-cowork/</link><guid isPermaLink="true">https://iceberglakehouse.com/posts/2026-03-context-claude-cowork/</guid><description>
Claude CoWork represents a fundamentally different approach to AI context management. Unlike chat interfaces where you send messages and receive resp...</description><pubDate>Sat, 07 Mar 2026 13:00:00 GMT</pubDate><content:encoded>&lt;p&gt;Claude CoWork represents a fundamentally different approach to AI context management. Unlike chat interfaces where you send messages and receive responses, CoWork is an autonomous agent that works on your local machine, reads and writes files directly, and executes multi-step tasks with minimal supervision. For knowledge workers who spend their days in documents, spreadsheets, and presentations, CoWork replaces the constant back-and-forth of copy-paste workflows with direct delegation.&lt;/p&gt;
&lt;p&gt;This guide covers how to manage context effectively in CoWork, from setting up folder-level instructions to creating reusable workflows that run on schedule.&lt;/p&gt;
&lt;h2&gt;How CoWork Differs from Other Claude Interfaces&lt;/h2&gt;
&lt;p&gt;CoWork runs as part of the Claude Desktop application but operates in a distinct mode. The differences matter for context management:&lt;/p&gt;
&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Capability&lt;/th&gt;
&lt;th&gt;Claude Web/Desktop Chat&lt;/th&gt;
&lt;th&gt;Claude CoWork&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Interaction model&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Conversational (you send, it responds)&lt;/td&gt;
&lt;td&gt;Autonomous (you delegate, it executes)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;File access&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Upload or MCP server&lt;/td&gt;
&lt;td&gt;Direct local read/write&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Output location&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;In the chat window&lt;/td&gt;
&lt;td&gt;On your file system&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Task duration&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Minutes (conversational)&lt;/td&gt;
&lt;td&gt;Minutes to hours (autonomous)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Scheduling&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Manual only&lt;/td&gt;
&lt;td&gt;Scheduled or on-demand&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Sub-agents&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;No&lt;/td&gt;
&lt;td&gt;Yes (parallel task decomposition)&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;
&lt;p&gt;Because CoWork operates autonomously on your local files, context management is less about what you say in a conversation and more about how you structure your file system, instructions, and task definitions.&lt;/p&gt;
&lt;h2&gt;Thinking About Context for Autonomous Tasks&lt;/h2&gt;
&lt;p&gt;When delegating to CoWork, the context equation changes. In a chat, you can course-correct in real time. With CoWork, you define the context upfront and the agent executes on its own. This means your context needs to be more complete and more explicit than in conversational interfaces.&lt;/p&gt;
&lt;h3&gt;Before Delegating, Ask:&lt;/h3&gt;
&lt;ol&gt;
&lt;li&gt;&lt;strong&gt;Does the task have a clear, verifiable end state?&lt;/strong&gt; &amp;quot;Organize these files by date&amp;quot; is clear. &amp;quot;Make these files better&amp;quot; is not.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Can I describe the success criteria in writing?&lt;/strong&gt; If you cannot articulate what &amp;quot;done&amp;quot; looks like, CoWork will struggle too.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Does CoWork have access to everything it needs?&lt;/strong&gt; Files, folders, reference material, and formatting instructions should all be accessible before you start.&lt;/li&gt;
&lt;/ol&gt;
&lt;h3&gt;The Delegation Spectrum&lt;/h3&gt;
&lt;p&gt;&lt;strong&gt;Simple delegation (minimal context):&lt;/strong&gt; &amp;quot;Create a summary of every PDF in the /reports folder and save it as summary.md&amp;quot;&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Moderate delegation:&lt;/strong&gt; &amp;quot;Generate a weekly status report using the data in /projects/metrics.csv. Follow the format in /templates/weekly-report.md. Save the output to /reports/week-12-report.md&amp;quot;&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Complex delegation:&lt;/strong&gt; &amp;quot;Research the competitive landscape for product X by reading the documents in /research/competitors/. Create a presentation in PowerPoint format that covers market positioning, pricing comparison, and feature gaps. Use the company brand guidelines in /brand/style-guide.pdf for formatting.&amp;quot;&lt;/p&gt;
&lt;p&gt;Each level requires progressively more context, but all of it is provided through file access and instructions rather than conversation.&lt;/p&gt;
&lt;h2&gt;Global and Folder Instructions&lt;/h2&gt;
&lt;p&gt;CoWork uses a layered instruction system that lets you set context at different scopes.&lt;/p&gt;
&lt;h3&gt;Global Instructions&lt;/h3&gt;
&lt;p&gt;Global instructions apply across all CoWork tasks regardless of which folder or project you are working in. Set these for:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Your preferred writing style and tone&lt;/li&gt;
&lt;li&gt;Output format preferences (bullet points vs. prose, heading structure)&lt;/li&gt;
&lt;li&gt;General constraints (&amp;quot;Always use metric units,&amp;quot; &amp;quot;Write in American English&amp;quot;)&lt;/li&gt;
&lt;li&gt;Your role and expertise level&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;These function similarly to Custom Instructions in ChatGPT but are specific to CoWork&apos;s autonomous execution mode.&lt;/p&gt;
&lt;h3&gt;Folder Instructions&lt;/h3&gt;
&lt;p&gt;Folder-level instructions apply when CoWork operates within a specific directory. This is where context management gets powerful. You can create different instruction sets for different projects:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;code&gt;/work/project-alpha/&lt;/code&gt; might have instructions about project-specific terminology and formatting&lt;/li&gt;
&lt;li&gt;&lt;code&gt;/work/blog-drafts/&lt;/code&gt; might have instructions about your blog style guide and target audience&lt;/li&gt;
&lt;li&gt;&lt;code&gt;/work/financial-reports/&lt;/code&gt; might have instructions about compliance requirements and number formatting&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;Folder instructions override global instructions when they conflict, giving you precise control over CoWork&apos;s behavior in each context.&lt;/p&gt;
&lt;h3&gt;Writing Effective Instructions&lt;/h3&gt;
&lt;p&gt;Focus on what CoWork needs to know to complete tasks autonomously:&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-markdown&quot;&gt;## Project Context
This folder contains marketing materials for Product X.
Target audience: enterprise IT decision-makers.
Tone: professional, authoritative, not salesy.

## File Organization
- /drafts/ contains work-in-progress documents
- /final/ contains approved, publication-ready content
- /assets/ contains images, charts, and data files
- /templates/ contains formatting templates

## Quality Standards
- All claims must be supported by data from the /assets/ folder
- Final documents must follow the template in /templates/standard.docx
- Run a readability check: target Flesch-Kincaid grade 10-12
&lt;/code&gt;&lt;/pre&gt;
&lt;h2&gt;MCP Server Integration&lt;/h2&gt;
&lt;p&gt;CoWork supports MCP (Model Context Protocol) through the Claude Desktop application&apos;s MCP configuration. MCP servers expand what CoWork can access beyond the local file system.&lt;/p&gt;
&lt;h3&gt;Useful MCP Servers for Knowledge Workers&lt;/h3&gt;
&lt;p&gt;&lt;strong&gt;Google Drive or OneDrive:&lt;/strong&gt; Access cloud-stored documents without downloading them first&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Notion or Confluence:&lt;/strong&gt; Read from and write to your team&apos;s knowledge base&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Slack:&lt;/strong&gt; Pull conversation context or post updates about completed tasks&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Calendar:&lt;/strong&gt; Check scheduling context when preparing meeting materials&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Email:&lt;/strong&gt; Draft responses based on incoming email content&lt;/p&gt;
&lt;h3&gt;When MCP Adds Value for CoWork&lt;/h3&gt;
&lt;p&gt;MCP is most valuable when CoWork needs information from systems outside your local file system. If you are creating a report that combines local data with information from your company wiki, an MCP server for that wiki lets CoWork access both sources in a single task.&lt;/p&gt;
&lt;p&gt;However, for purely local tasks (organizing files, generating documents from local data, processing spreadsheets), MCP adds unnecessary complexity. If the data is already on your machine, direct file access is simpler and faster.&lt;/p&gt;
&lt;h2&gt;Scheduled Tasks: Context That Runs Automatically&lt;/h2&gt;
&lt;p&gt;One of CoWork&apos;s distinctive features is task scheduling. You can define tasks that run at specific intervals (daily, weekly, monthly), and CoWork executes them with the same context every time.&lt;/p&gt;
&lt;h3&gt;Use Cases for Scheduled Tasks&lt;/h3&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Weekly report generation:&lt;/strong&gt; Compile data from multiple sources into a formatted report every Monday&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Daily email drafts:&lt;/strong&gt; Prepare responses to routine communications based on templates&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Monthly file organization:&lt;/strong&gt; Sort and archive documents that have accumulated in download or inbox folders&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Data processing:&lt;/strong&gt; Transform incoming CSV exports into formatted spreadsheets at regular intervals&lt;/li&gt;
&lt;/ul&gt;
&lt;h3&gt;Context for Scheduled Tasks&lt;/h3&gt;
&lt;p&gt;Scheduled tasks need to be fully self-contained. The context must include:&lt;/p&gt;
&lt;ol&gt;
&lt;li&gt;&lt;strong&gt;Where to find inputs&lt;/strong&gt; (file paths, folders to scan)&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;What to do with them&lt;/strong&gt; (the processing logic)&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Where to put outputs&lt;/strong&gt; (destination paths)&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;What quality checks to apply&lt;/strong&gt; (validation rules)&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;What to do when something unexpected happens&lt;/strong&gt; (error handling)&lt;/li&gt;
&lt;/ol&gt;
&lt;p&gt;Because you are not present during execution, the instructions must anticipate edge cases. For example: &amp;quot;If no new files are found in /inbox/, skip processing and do not create an empty report.&amp;quot;&lt;/p&gt;
&lt;h2&gt;Sub-Agent Delegation&lt;/h2&gt;
&lt;p&gt;CoWork can decompose complex tasks into subtasks and execute them in parallel using sub-agents. This is particularly useful for tasks that involve independent workstreams.&lt;/p&gt;
&lt;h3&gt;How Sub-Agents Improve Context Management&lt;/h3&gt;
&lt;p&gt;Instead of providing one massive context for a complex task, CoWork breaks it into smaller, focused contexts:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Sub-agent 1:&lt;/strong&gt; &amp;quot;Summarize the financial data in /data/q3-financials.csv&amp;quot;&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Sub-agent 2:&lt;/strong&gt; &amp;quot;Extract key quotes from the customer interviews in /research/interviews/&amp;quot;&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Sub-agent 3:&lt;/strong&gt; &amp;quot;Create a chart comparing year-over-year growth using the data in /data/growth.csv&amp;quot;&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;Each sub-agent gets a focused context, which typically produces better results than one agent trying to handle everything.&lt;/p&gt;
&lt;h3&gt;Monitoring Sub-Agent Progress&lt;/h3&gt;
&lt;p&gt;CoWork surfaces its reasoning and progress as it works. You can observe the plan, see which sub-agents are active, and intervene if something goes off track. This transparency is a context management feature itself because it lets you assess whether the agent&apos;s understanding matches your intent before it completes the task.&lt;/p&gt;
&lt;h2&gt;Working with External Documents&lt;/h2&gt;
&lt;h3&gt;PDFs&lt;/h3&gt;
&lt;p&gt;CoWork can read PDFs directly from your file system. Use PDFs for:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Published specifications and standards&lt;/li&gt;
&lt;li&gt;Research papers and reports from external sources&lt;/li&gt;
&lt;li&gt;Contracts, legal documents, or compliance materials&lt;/li&gt;
&lt;li&gt;Documents you received from others in PDF format&lt;/li&gt;
&lt;/ul&gt;
&lt;h3&gt;Markdown Files&lt;/h3&gt;
&lt;p&gt;CoWork excels with Markdown because the structure is unambiguous. Use Markdown for:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Your own notes, outlines, and instructions&lt;/li&gt;
&lt;li&gt;Style guides and formatting templates&lt;/li&gt;
&lt;li&gt;Context documents you create specifically for CoWork&lt;/li&gt;
&lt;li&gt;Any document you plan to update frequently&lt;/li&gt;
&lt;/ul&gt;
&lt;h3&gt;The Hybrid Strategy&lt;/h3&gt;
&lt;p&gt;Keep critical reference material as Markdown in well-organized project folders. Use PDFs for external documents you cannot control. This gives CoWork the cleanest possible context for the documents you author and reasonable access to everything else.&lt;/p&gt;
&lt;h2&gt;Advanced Patterns&lt;/h2&gt;
&lt;h3&gt;The Template-Driven Workflow&lt;/h3&gt;
&lt;p&gt;Create a template folder with examples of your desired output format. In your folder instructions, reference these templates. CoWork will pattern-match against them when generating new content.&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;/project/
  /templates/
    blog-post-template.md
    report-template.md
    email-template.md
  /instructions.md (folder instructions referencing templates)
  /output/
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;This approach gives CoWork concrete examples of &amp;quot;what good looks like&amp;quot; for every type of output it might produce.&lt;/p&gt;
&lt;h3&gt;The Progressive Delegation Pattern&lt;/h3&gt;
&lt;p&gt;Start with simple tasks to build confidence in CoWork&apos;s understanding of your context:&lt;/p&gt;
&lt;ol&gt;
&lt;li&gt;&lt;strong&gt;Week 1:&lt;/strong&gt; File organization and simple summaries&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Week 2:&lt;/strong&gt; Document generation from templates&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Week 3:&lt;/strong&gt; Multi-source research and synthesis&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Week 4:&lt;/strong&gt; Complex deliverables with scheduled execution&lt;/li&gt;
&lt;/ol&gt;
&lt;p&gt;Each phase lets you refine your instructions based on how CoWork interprets them.&lt;/p&gt;
&lt;h3&gt;The Quality Gate Pattern&lt;/h3&gt;
&lt;p&gt;For high-stakes outputs, set up a two-stage workflow:&lt;/p&gt;
&lt;ol&gt;
&lt;li&gt;&lt;strong&gt;Stage 1:&lt;/strong&gt; CoWork generates a draft and saves it to &lt;code&gt;/drafts/&lt;/code&gt;&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Stage 2:&lt;/strong&gt; You review the draft and provide feedback&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Stage 3:&lt;/strong&gt; CoWork revises based on your feedback and saves to &lt;code&gt;/final/&lt;/code&gt;&lt;/li&gt;
&lt;/ol&gt;
&lt;p&gt;This pattern combines autonomous execution with human review, giving you the efficiency of delegation without sacrificing quality control.&lt;/p&gt;
&lt;h2&gt;When to Use CoWork vs. Other Claude Interfaces&lt;/h2&gt;
&lt;p&gt;CoWork is not always the right choice. Here is how it compares for different scenarios:&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Use CoWork when:&lt;/strong&gt;&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;The task involves creating, transforming, or organizing files on your local machine&lt;/li&gt;
&lt;li&gt;The work can be defined upfront with clear success criteria&lt;/li&gt;
&lt;li&gt;You want to delegate entirely and come back to a finished result&lt;/li&gt;
&lt;li&gt;The task is repeatable and benefits from scheduling&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;&lt;strong&gt;Use Claude Web when:&lt;/strong&gt;&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;You want an interactive conversation to explore ideas or get feedback&lt;/li&gt;
&lt;li&gt;The task is primarily knowledge-based (brainstorming, research questions, analysis)&lt;/li&gt;
&lt;li&gt;You need artifacts like code demos or documents that persist in a conversation&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;&lt;strong&gt;Use Claude Desktop chat when:&lt;/strong&gt;&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;You need MCP access to external services during an interactive conversation&lt;/li&gt;
&lt;li&gt;You want Computer Use to interact with desktop applications&lt;/li&gt;
&lt;li&gt;You need the conversational interaction model with live external data&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;&lt;strong&gt;Use Claude Code when:&lt;/strong&gt;&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;You are working on a software codebase&lt;/li&gt;
&lt;li&gt;You need the agent to navigate code, run tests, and make pull requests&lt;/li&gt;
&lt;li&gt;You want terminal-level interaction with coding-specific tools&lt;/li&gt;
&lt;/ul&gt;
&lt;h2&gt;Common Mistakes&lt;/h2&gt;
&lt;ol&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Vague task definitions.&lt;/strong&gt; &amp;quot;Make these documents better&amp;quot; gives CoWork nothing to work with. Specify what &amp;quot;better&amp;quot; means: more concise, better formatted, restructured for a different audience, updated with new data.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Skipping folder instructions.&lt;/strong&gt; Without instructions, CoWork uses only global context and its general training. Folder instructions are what make CoWork effective for your specific workflow.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Over-scoping tasks.&lt;/strong&gt; A single task that says &amp;quot;create an entire marketing strategy&amp;quot; is too broad. Break it into research, analysis, drafting, and formatting phases.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Not reviewing outputs.&lt;/strong&gt; CoWork runs autonomously, but that does not mean blindly accepting its output. Always review, especially for scheduled tasks that run without your active oversight.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Ignoring the file system.&lt;/strong&gt; CoWork works with files. If your files are disorganized, CoWork&apos;s output will be disorganized. Invest in clean folder structures before delegating.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Underusing sub-agents.&lt;/strong&gt; If a task has independent workstreams, let CoWork decompose it. Trying to force everything into a single linear execution path is slower and produces worse results.&lt;/p&gt;
&lt;/li&gt;
&lt;/ol&gt;
&lt;h2&gt;Go Deeper&lt;/h2&gt;
&lt;p&gt;To learn more about working effectively with AI agents and context management strategies, check out these resources by Alex Merced:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;
&lt;p&gt;&lt;a href=&quot;https://www.amazon.com/2026-Guide-AI-Assisted-Development-Engineering-ebook/dp/B0GQW7CTML/&quot;&gt;The 2026 Guide to AI-Assisted Development&lt;/a&gt; covers AI-assisted development workflows, prompt engineering, and context strategies for software engineers.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;a href=&quot;https://www.amazon.com/Lakehouses-Apache-Iceberg-Agentic-Hands/dp/B0GQNY21TD/&quot;&gt;The 2026 Guide to Lakehouses, Apache Iceberg and Agentic AI&lt;/a&gt; explores how AI agents are reshaping data architecture and how to build systems that support agentic workflows.&lt;/p&gt;
&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;And for a fictional take on where AI is heading:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;a href=&quot;https://www.amazon.com/Emperors-I-Valley-Novel-Future/dp/B0GQHKF4ZT/&quot;&gt;The Emperors of A.I. Valley: A Novel of Power, Code, and the War for the Future&lt;/a&gt; is a novel about the power struggles and ethical dilemmas behind the companies building the most powerful AI systems in the world.&lt;/li&gt;
&lt;/ul&gt;
</content:encoded><author>Alex Merced</author></item><item><title>Context Management Strategies for Claude Desktop: A Complete Guide to MCP, Computer Use, and Local File Access</title><link>https://iceberglakehouse.com/posts/2026-03-context-claude-desktop/</link><guid isPermaLink="true">https://iceberglakehouse.com/posts/2026-03-context-claude-desktop/</guid><description>
Claude Desktop takes everything available in Claude Web and adds three capabilities that fundamentally change how you manage context: MCP server conn...</description><pubDate>Sat, 07 Mar 2026 12:00:00 GMT</pubDate><content:encoded>&lt;p&gt;Claude Desktop takes everything available in Claude Web and adds three capabilities that fundamentally change how you manage context: MCP server connections that link Claude to external tools and data sources, direct local file access that eliminates the upload-download cycle, and Computer Use that lets Claude interact with your desktop environment. These additions make Claude Desktop the right choice when your work requires live data, local file system access, or integration with tools that Claude Web cannot reach.&lt;/p&gt;
&lt;p&gt;This guide explains how to leverage each of Claude Desktop&apos;s context management features, when to use them, and how they complement the Projects, artifacts, and conversation patterns covered in the Claude Web guide.&lt;/p&gt;
&lt;h2&gt;What Claude Desktop Adds Over Claude Web&lt;/h2&gt;
&lt;p&gt;Claude Desktop shares the same core features as Claude Web: Projects with instructions and knowledge files, artifacts, and the same large context windows (up to 1 million tokens). The key additions are:&lt;/p&gt;
&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Feature&lt;/th&gt;
&lt;th&gt;Claude Web&lt;/th&gt;
&lt;th&gt;Claude Desktop&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Projects&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Yes&lt;/td&gt;
&lt;td&gt;Yes&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Artifacts&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Yes&lt;/td&gt;
&lt;td&gt;Yes&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Knowledge files&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Yes&lt;/td&gt;
&lt;td&gt;Yes&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;MCP servers&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;No&lt;/td&gt;
&lt;td&gt;Yes&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Local file access&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Upload only&lt;/td&gt;
&lt;td&gt;Direct read/write&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Computer Use&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;No&lt;/td&gt;
&lt;td&gt;Yes (beta)&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;
&lt;p&gt;If your work is purely knowledge-based (writing, research, analysis), Claude Web is sufficient. Switch to Claude Desktop when you need to connect Claude to your local environment or external services.&lt;/p&gt;
&lt;h2&gt;MCP Servers: The Core Differentiator&lt;/h2&gt;
&lt;p&gt;The Model Context Protocol (MCP) is what makes Claude Desktop a genuinely different tool from the web interface. MCP is an open standard that allows Claude to connect to external services, databases, file systems, and tools through standardized server implementations.&lt;/p&gt;
&lt;h3&gt;How MCP Works in Claude Desktop&lt;/h3&gt;
&lt;p&gt;Claude Desktop acts as the MCP host. You configure MCP servers in the application settings, and Claude gains access to the tools those servers expose. When Claude needs information from an external source, it calls the appropriate MCP tool, receives the results, and incorporates them into its response.&lt;/p&gt;
&lt;h3&gt;Practical MCP Use Cases&lt;/h3&gt;
&lt;p&gt;&lt;strong&gt;Database Access:&lt;/strong&gt;
Connect a database MCP server to let Claude query your development database directly. Instead of copying and pasting query results, Claude can run queries itself:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Explore schema to understand your data model&lt;/li&gt;
&lt;li&gt;Run diagnostic queries when debugging&lt;/li&gt;
&lt;li&gt;Verify data after explaining a migration plan&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;&lt;strong&gt;File System Access:&lt;/strong&gt;
Connect a filesystem MCP server to give Claude access to specific directories on your machine. This is especially useful for:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Browsing project directories without manually uploading each file&lt;/li&gt;
&lt;li&gt;Reading configuration files, logs, or data files&lt;/li&gt;
&lt;li&gt;Writing output files (reports, generated code, processed data) directly to disk&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;&lt;strong&gt;Version Control:&lt;/strong&gt;
Connect a Git MCP server to let Claude interact with your repository:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Review recent commits and diffs&lt;/li&gt;
&lt;li&gt;Understand the project&apos;s change history&lt;/li&gt;
&lt;li&gt;Create branches or commits (with your approval)&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;&lt;strong&gt;API Integration:&lt;/strong&gt;
Connect MCP servers for services your workflow depends on:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Jira or Linear for project management context&lt;/li&gt;
&lt;li&gt;Notion or Confluence for internal documentation&lt;/li&gt;
&lt;li&gt;Slack for team communication context&lt;/li&gt;
&lt;/ul&gt;
&lt;h3&gt;Setting Up MCP Servers&lt;/h3&gt;
&lt;p&gt;MCP servers are configured in Claude Desktop&apos;s settings as JSON:&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-json&quot;&gt;{
  &amp;quot;mcpServers&amp;quot;: {
    &amp;quot;filesystem&amp;quot;: {
      &amp;quot;command&amp;quot;: &amp;quot;npx&amp;quot;,
      &amp;quot;args&amp;quot;: [&amp;quot;-y&amp;quot;, &amp;quot;@anthropic/mcp-server-filesystem&amp;quot;, &amp;quot;/path/to/project&amp;quot;]
    },
    &amp;quot;postgres&amp;quot;: {
      &amp;quot;command&amp;quot;: &amp;quot;npx&amp;quot;,
      &amp;quot;args&amp;quot;: [&amp;quot;-y&amp;quot;, &amp;quot;@anthropic/mcp-server-postgres&amp;quot;],
      &amp;quot;env&amp;quot;: {
        &amp;quot;DATABASE_URL&amp;quot;: &amp;quot;postgresql://user:pass@localhost:5432/mydb&amp;quot;
      }
    }
  }
}
&lt;/code&gt;&lt;/pre&gt;
&lt;h3&gt;When to Use MCP&lt;/h3&gt;
&lt;p&gt;&lt;strong&gt;Use MCP when:&lt;/strong&gt;&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Your task requires data that is not (and should not be) in the conversation or project files&lt;/li&gt;
&lt;li&gt;You need Claude to interact with live systems (databases, APIs, file systems)&lt;/li&gt;
&lt;li&gt;You want Claude to verify its work against real systems&lt;/li&gt;
&lt;li&gt;The data changes frequently and uploading snapshots is impractical&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;&lt;strong&gt;Do not use MCP when:&lt;/strong&gt;&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;The task is self-contained (writing, brainstorming, planning)&lt;/li&gt;
&lt;li&gt;You can provide the needed context by pasting or uploading files&lt;/li&gt;
&lt;li&gt;You are working with sensitive production systems (connect to dev/staging only)&lt;/li&gt;
&lt;li&gt;The MCP server adds latency that slows your workflow&lt;/li&gt;
&lt;/ul&gt;
&lt;h3&gt;MCP Security Considerations&lt;/h3&gt;
&lt;p&gt;MCP servers run locally and can access real systems. Best practices:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Only connect to development or staging environments, never production&lt;/li&gt;
&lt;li&gt;Use read-only database credentials when possible&lt;/li&gt;
&lt;li&gt;Limit filesystem access to specific directories using the server&apos;s configuration&lt;/li&gt;
&lt;li&gt;Review Claude&apos;s MCP calls before approving actions that modify data&lt;/li&gt;
&lt;li&gt;Use environment variables for credentials rather than hardcoding them in configuration&lt;/li&gt;
&lt;li&gt;Audit your MCP server configurations periodically to remove servers you no longer use&lt;/li&gt;
&lt;/ul&gt;
&lt;h3&gt;Choosing the Right MCP Servers&lt;/h3&gt;
&lt;p&gt;Not every project needs every MCP server. Start with the minimum set and add more as your workflow demands:&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Solo developers:&lt;/strong&gt; Filesystem + database (if applicable)
&lt;strong&gt;Frontend developers:&lt;/strong&gt; Filesystem + browser automation (Playwright)
&lt;strong&gt;Backend developers:&lt;/strong&gt; Filesystem + database + API testing
&lt;strong&gt;Full-stack teams:&lt;/strong&gt; Filesystem + database + Git + project management&lt;/p&gt;
&lt;p&gt;Adding servers you do not actively use wastes Claude&apos;s attention. Each connected server expands the list of available tools Claude must evaluate for every request.&lt;/p&gt;
&lt;h2&gt;Computer Use: Desktop-Level Interaction&lt;/h2&gt;
&lt;p&gt;Computer Use (currently in beta) allows Claude to interact with your desktop environment by capturing screenshots, controlling the mouse, and providing keyboard input. This enables Claude to use applications that do not have APIs or MCP servers.&lt;/p&gt;
&lt;h3&gt;When Computer Use Helps with Context&lt;/h3&gt;
&lt;p&gt;Computer Use is a context-gathering tool in addition to being an interaction tool. Sometimes the easiest way to give Claude context is to let it look at what you are looking at:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;GUI applications:&lt;/strong&gt; Show Claude your IDE, database tools, or monitoring dashboards&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Web applications:&lt;/strong&gt; Let Claude navigate internal tools that require authentication&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Design tools:&lt;/strong&gt; Have Claude reference designs in Figma or Sketch directly&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Spreadsheets:&lt;/strong&gt; Let Claude read complex Excel layouts that do not convert cleanly to CSV&lt;/li&gt;
&lt;/ul&gt;
&lt;h3&gt;Practical Workflow&lt;/h3&gt;
&lt;ol&gt;
&lt;li&gt;Ask Claude to take a screenshot of the current screen&lt;/li&gt;
&lt;li&gt;Claude analyzes the visual context and incorporates it into the conversation&lt;/li&gt;
&lt;li&gt;You can direct Claude to interact with specific UI elements&lt;/li&gt;
&lt;/ol&gt;
&lt;p&gt;This is particularly useful when the relevant context is in a visual format that is difficult to describe in text.&lt;/p&gt;
&lt;h3&gt;Computer Use Limitations&lt;/h3&gt;
&lt;p&gt;Computer Use is slower than MCP-based interactions because it relies on visual processing rather than structured data exchange. Use it as a fallback for tools that lack MCP servers or APIs, not as your primary context mechanism. For anything that can be done through MCP (database queries, file access, API calls), MCP is faster and more reliable.&lt;/p&gt;
&lt;h2&gt;Local File Access: Eliminating the Upload Cycle&lt;/h2&gt;
&lt;p&gt;Claude Desktop can read from and write to your local file system directly (via MCP filesystem server), eliminating the need to manually upload and download files.&lt;/p&gt;
&lt;h3&gt;Advantages Over Web Uploads&lt;/h3&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;No file size workarounds:&lt;/strong&gt; Access files of any size without upload limits&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Live files:&lt;/strong&gt; Claude reads the current version of a file, not a snapshot uploaded hours ago&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Write capability:&lt;/strong&gt; Claude can save outputs directly to your file system&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Directory browsing:&lt;/strong&gt; Claude can explore project structures to understand organization&lt;/li&gt;
&lt;/ul&gt;
&lt;h3&gt;Best Practices for Local File Access&lt;/h3&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Scope access narrowly.&lt;/strong&gt; Point the filesystem MCP server at the specific project directory, not your home folder.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Use it for exploration.&lt;/strong&gt; Let Claude browse your project structure to build understanding, then focus on specific files.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Combine with Projects.&lt;/strong&gt; Use Project instructions to set context and local file access to provide the actual content. This gives Claude both the &amp;quot;how&amp;quot; (instructions) and the &amp;quot;what&amp;quot; (files).&lt;/li&gt;
&lt;/ul&gt;
&lt;h2&gt;External Documents: PDFs and Markdown in Claude Desktop&lt;/h2&gt;
&lt;p&gt;Claude Desktop handles external documents the same way as Claude Web: through Project knowledge files and conversation uploads. However, the addition of local file access changes the strategy.&lt;/p&gt;
&lt;h3&gt;The Hybrid Approach&lt;/h3&gt;
&lt;p&gt;&lt;strong&gt;For persistent reference material:&lt;/strong&gt; Upload to Project knowledge files (PDFs or Markdown). These are always available in every conversation within the Project.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;For working documents:&lt;/strong&gt; Access via the filesystem MCP server. This way Claude reads the live version of your files without requiring re-uploads when content changes.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;For published specifications:&lt;/strong&gt; Upload PDFs to Project knowledge files. These do not change, so the snapshot approach works fine.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;For your own documentation:&lt;/strong&gt; Keep it in Markdown files on disk and access via MCP. This way both you and Claude are always working with the latest version.&lt;/p&gt;
&lt;h2&gt;Building an Effective Claude Desktop Workflow&lt;/h2&gt;
&lt;h3&gt;Step 1: Set Up Your Project&lt;/h3&gt;
&lt;p&gt;Create a Claude Desktop Project with:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Project Instructions covering your role, style, constraints, and terminology&lt;/li&gt;
&lt;li&gt;Knowledge files for stable reference material (style guides, specifications, standards)&lt;/li&gt;
&lt;/ul&gt;
&lt;h3&gt;Step 2: Configure MCP Servers&lt;/h3&gt;
&lt;p&gt;Add MCP servers for the external systems you work with regularly:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Filesystem server pointing at your project directory&lt;/li&gt;
&lt;li&gt;Database server connected to your development database (if applicable)&lt;/li&gt;
&lt;li&gt;Any service-specific MCP servers for tools you use daily&lt;/li&gt;
&lt;/ul&gt;
&lt;h3&gt;Step 3: Use the Right Tool for Each Context Need&lt;/h3&gt;
&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Context Need&lt;/th&gt;
&lt;th&gt;Best Approach&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Project conventions and style&lt;/td&gt;
&lt;td&gt;Project Instructions&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Stable reference documents&lt;/td&gt;
&lt;td&gt;Project Knowledge Files&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Current code and config files&lt;/td&gt;
&lt;td&gt;Filesystem MCP&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Database state and schema&lt;/td&gt;
&lt;td&gt;Database MCP&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Visual UI or application state&lt;/td&gt;
&lt;td&gt;Computer Use&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;One-off data or examples&lt;/td&gt;
&lt;td&gt;Paste in conversation&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;
&lt;h3&gt;Step 4: Manage Conversation Threads&lt;/h3&gt;
&lt;p&gt;Even with MCP and local file access, conversation management matters:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Start new conversations for new topics (Project context persists)&lt;/li&gt;
&lt;li&gt;Use artifacts for important outputs you want to reference later&lt;/li&gt;
&lt;li&gt;Summarize progress when starting fresh threads&lt;/li&gt;
&lt;/ul&gt;
&lt;h2&gt;Advanced Patterns&lt;/h2&gt;
&lt;h3&gt;The Live Debugging Pattern&lt;/h3&gt;
&lt;p&gt;When debugging an issue:&lt;/p&gt;
&lt;ol&gt;
&lt;li&gt;Let Claude read the relevant source code via filesystem MCP&lt;/li&gt;
&lt;li&gt;Let Claude query the database to check data state&lt;/li&gt;
&lt;li&gt;Let Claude read log files to identify error patterns&lt;/li&gt;
&lt;li&gt;Have a conversation where Claude synthesizes all of this context into a diagnosis&lt;/li&gt;
&lt;/ol&gt;
&lt;p&gt;This approach gives Claude real-time access to a broader context than you could reasonably paste into a conversation.&lt;/p&gt;
&lt;h3&gt;The Document Generation Pipeline&lt;/h3&gt;
&lt;p&gt;For creating documents that reference live data:&lt;/p&gt;
&lt;ol&gt;
&lt;li&gt;Claude reads data via MCP (database stats, API responses, configuration)&lt;/li&gt;
&lt;li&gt;Claude generates the document in a conversation&lt;/li&gt;
&lt;li&gt;Claude writes the output directly to a file on disk&lt;/li&gt;
&lt;li&gt;You review and iterate&lt;/li&gt;
&lt;/ol&gt;
&lt;p&gt;This eliminates the copy-paste cycle between Claude and your file system.&lt;/p&gt;
&lt;h3&gt;The Research and Synthesis Pattern&lt;/h3&gt;
&lt;p&gt;For research projects spanning multiple sources:&lt;/p&gt;
&lt;ol&gt;
&lt;li&gt;Upload academic papers and specifications as Project knowledge files&lt;/li&gt;
&lt;li&gt;Connect a web-search MCP server for current information&lt;/li&gt;
&lt;li&gt;Use filesystem MCP to read your existing notes and drafts&lt;/li&gt;
&lt;li&gt;Claude synthesizes across all sources, referencing each by name&lt;/li&gt;
&lt;/ol&gt;
&lt;h2&gt;Common Mistakes&lt;/h2&gt;
&lt;ol&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Connecting production databases.&lt;/strong&gt; Always use development or staging credentials. Even read-only production access introduces risk.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Over-scoping filesystem access.&lt;/strong&gt; Do not give Claude access to your entire home directory. Point the filesystem server at the specific project folder.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Using MCP for everything.&lt;/strong&gt; If you just need Claude to reference a style guide, upload it to Project knowledge files. MCP is for live, changing data.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Forgetting Project Instructions.&lt;/strong&gt; MCP and local file access do not replace the need for clear instructions. Claude still needs to know your style, constraints, and output format.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Not reviewing MCP actions.&lt;/strong&gt; When Claude performs actions through MCP (writing files, running queries), review them. The protocol provides transparency, but you need to exercise your approval authority.&lt;/p&gt;
&lt;/li&gt;
&lt;/ol&gt;
&lt;h2&gt;Go Deeper&lt;/h2&gt;
&lt;p&gt;To learn more about context management strategies for AI tools and agentic workflows, check out these resources by Alex Merced:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;
&lt;p&gt;&lt;a href=&quot;https://www.amazon.com/2026-Guide-AI-Assisted-Development-Engineering-ebook/dp/B0GQW7CTML/&quot;&gt;The 2026 Guide to AI-Assisted Development&lt;/a&gt; covers AI-assisted development workflows, prompt engineering, and context strategies for software engineers.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;a href=&quot;https://www.amazon.com/Lakehouses-Apache-Iceberg-Agentic-Hands/dp/B0GQNY21TD/&quot;&gt;The 2026 Guide to Lakehouses, Apache Iceberg and Agentic AI&lt;/a&gt; explores how AI agents are reshaping data architecture and how to build systems that support agentic workflows.&lt;/p&gt;
&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;And for a fictional take on where AI is heading:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;a href=&quot;https://www.amazon.com/Emperors-I-Valley-Novel-Future/dp/B0GQHKF4ZT/&quot;&gt;The Emperors of A.I. Valley: A Novel of Power, Code, and the War for the Future&lt;/a&gt; is a novel about the power struggles and ethical dilemmas behind the companies building the most powerful AI systems in the world.&lt;/li&gt;
&lt;/ul&gt;
</content:encoded><author>Alex Merced</author></item><item><title>Context Management Strategies for Claude Web: A Complete Guide to Projects, Artifacts, and Intelligent Context</title><link>https://iceberglakehouse.com/posts/2026-03-context-claude-web/</link><guid isPermaLink="true">https://iceberglakehouse.com/posts/2026-03-context-claude-web/</guid><description>
Claude&apos;s web interface at claude.ai combines one of the largest context windows in the industry with a structured Project system that makes it genuin...</description><pubDate>Sat, 07 Mar 2026 11:00:00 GMT</pubDate><content:encoded>&lt;p&gt;Claude&apos;s web interface at claude.ai combines one of the largest context windows in the industry with a structured Project system that makes it genuinely useful for sustained, complex work. While many AI chat interfaces are limited to one-off conversations, Claude Web is designed for ongoing engagement where the AI accumulates understanding of your work over time. The key to unlocking that potential is managing context deliberately rather than treating each conversation as a blank slate.&lt;/p&gt;
&lt;p&gt;This guide covers every context management strategy available in Claude Web, from basic conversation techniques to advanced Project workflows that make Claude function as a persistent research and development partner.&lt;/p&gt;
&lt;h2&gt;How Claude Web Handles Context&lt;/h2&gt;
&lt;p&gt;Claude Web uses the conversation thread as its primary context unit. Every message you send, every response Claude generates, every file you upload, and every artifact Claude creates stays in the conversation&apos;s context window. Models like Claude Sonnet 4.5 and Opus 4.6 support context windows up to 1 million tokens, which means Claude can hold the equivalent of roughly 750,000 words of conversation, documents, and code in memory at once.&lt;/p&gt;
&lt;p&gt;But a large context window does not eliminate the need for context management. In fact, it makes it more important. With 1 million tokens available, it is easy to fill the window with irrelevant information that dilutes Claude&apos;s attention. The goal is not to maximize how much context you provide, but to maximize how relevant that context is.&lt;/p&gt;
&lt;h3&gt;The Context Priority Hierarchy&lt;/h3&gt;
&lt;p&gt;Claude pays the most attention to:&lt;/p&gt;
&lt;ol&gt;
&lt;li&gt;&lt;strong&gt;System instructions&lt;/strong&gt; (Project instructions)&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;The most recent messages&lt;/strong&gt; in the conversation&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Uploaded files&lt;/strong&gt; referenced in the conversation&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Earlier conversation history&lt;/strong&gt;&lt;/li&gt;
&lt;/ol&gt;
&lt;p&gt;This means that if important context appeared 50 messages ago, Claude may not weight it as heavily as something you said in the last 3 messages. Understanding this hierarchy helps you decide when to re-state important constraints versus trusting that Claude still has them in context.&lt;/p&gt;
&lt;h2&gt;Thinking About the Right Level of Context&lt;/h2&gt;
&lt;h3&gt;Quick Questions (Minimal Context)&lt;/h3&gt;
&lt;p&gt;For factual questions, brainstorming, or one-off tasks, just ask. Claude&apos;s training data provides sufficient background for most general-knowledge queries. Adding unnecessary context (&amp;quot;I am a senior engineer with 15 years of experience, and I have a question about Python lists&amp;quot;) wastes tokens and does not improve the response.&lt;/p&gt;
&lt;h3&gt;Focused Work (Moderate Context)&lt;/h3&gt;
&lt;p&gt;For drafting, editing, code review, or analysis, provide the specific material Claude needs to work with. Paste the code you want reviewed, the text you want edited, or the data you want analyzed. State your requirements clearly: what format you want, what constraints apply, what style to follow.&lt;/p&gt;
&lt;h3&gt;Extended Projects (Comprehensive Context)&lt;/h3&gt;
&lt;p&gt;For ongoing work spanning multiple conversations, use Claude&apos;s Projects feature. Upload reference documents, set Project instructions, and let Claude maintain continuity across sessions. This is where context management becomes a genuine productivity multiplier.&lt;/p&gt;
&lt;h2&gt;Projects: Claude Web&apos;s Most Powerful Context Tool&lt;/h2&gt;
&lt;p&gt;Projects create persistent workspaces that carry context across conversations. When you create a Project, you define instructions and upload knowledge files that apply to every conversation within that Project.&lt;/p&gt;
&lt;h3&gt;Setting Up a Project&lt;/h3&gt;
&lt;ol&gt;
&lt;li&gt;Navigate to &lt;strong&gt;Projects&lt;/strong&gt; in the Claude sidebar&lt;/li&gt;
&lt;li&gt;Create a new Project with a descriptive name&lt;/li&gt;
&lt;li&gt;Add &lt;strong&gt;Project Instructions&lt;/strong&gt;: Custom system-level instructions that Claude follows in every conversation within this Project&lt;/li&gt;
&lt;li&gt;Upload &lt;strong&gt;Knowledge Files&lt;/strong&gt;: Documents that Claude can reference across all conversations in the Project&lt;/li&gt;
&lt;/ol&gt;
&lt;h3&gt;Project Instructions&lt;/h3&gt;
&lt;p&gt;Project instructions function as a system prompt that persists across every conversation in the Project. This is the most important piece of context you configure, because it shapes every response Claude gives.&lt;/p&gt;
&lt;p&gt;Effective Project Instructions include:&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-markdown&quot;&gt;# Project: Data Pipeline Documentation

## Your Role
You are a technical writer helping document a real-time data pipeline
built with Apache Kafka, Apache Flink, and Apache Iceberg.

## Audience
The documentation is for data engineers with 2-5 years of experience
who are familiar with batch ETL but new to stream processing.

## Style Requirements
- Use active voice
- Include code examples in Python and SQL
- Explain concepts before showing implementation
- Each section should be self-contained (readers may jump between sections)

## Terminology
- Use &amp;quot;data pipeline&amp;quot; not &amp;quot;ETL pipeline&amp;quot; or &amp;quot;data flow&amp;quot;
- Use &amp;quot;event&amp;quot; not &amp;quot;message&amp;quot; when referring to Kafka records
- Use &amp;quot;table&amp;quot; not &amp;quot;dataset&amp;quot; when referencing Iceberg tables

## Output Format
- Use H2 for section headers, H3 for subsections
- Include a &amp;quot;Key Takeaways&amp;quot; box at the end of each section
- Code blocks should include language identifiers
&lt;/code&gt;&lt;/pre&gt;
&lt;h3&gt;Knowledge Files&lt;/h3&gt;
&lt;p&gt;You can upload various file types as project knowledge:&lt;/p&gt;
&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;File Type&lt;/th&gt;
&lt;th&gt;Best For&lt;/th&gt;
&lt;th&gt;Notes&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;PDF&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Research papers, specs, published docs&lt;/td&gt;
&lt;td&gt;Claude extracts text; complex layouts may lose formatting&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Markdown&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Style guides, outlines, structured notes&lt;/td&gt;
&lt;td&gt;Cleanest parsing, best for AI consumption&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Text&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Code files, logs, configuration&lt;/td&gt;
&lt;td&gt;Direct text ingestion&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;CSV&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Data samples, reference tables&lt;/td&gt;
&lt;td&gt;Claude can analyze and query the data&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Images&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Diagrams, screenshots, mockups&lt;/td&gt;
&lt;td&gt;Claude can describe and reference visual content&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;
&lt;h3&gt;When to Use PDFs vs. Markdown&lt;/h3&gt;
&lt;p&gt;&lt;strong&gt;Use PDFs when:&lt;/strong&gt;&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;You have published documents that already exist in PDF format&lt;/li&gt;
&lt;li&gt;The document includes complex tables, figures, or formatting that matters&lt;/li&gt;
&lt;li&gt;You do not want to spend time converting the document&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;&lt;strong&gt;Use Markdown when:&lt;/strong&gt;&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;You are creating a context document specifically for Claude&lt;/li&gt;
&lt;li&gt;You want maximum parsing accuracy (no PDF extraction artifacts)&lt;/li&gt;
&lt;li&gt;The document will be updated frequently&lt;/li&gt;
&lt;li&gt;You care about precise structure (headings, code blocks, lists)&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;Markdown is the better choice when you have the option. PDF extraction can introduce artifacts: garbled tables, merged paragraphs, lost code formatting. If accuracy matters, convert your reference documents to Markdown.&lt;/p&gt;
&lt;h3&gt;Managing Knowledge Files Effectively&lt;/h3&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Name files descriptively.&lt;/strong&gt; &amp;quot;api-reference-v3.md&amp;quot; is better than &amp;quot;document.pdf&amp;quot;&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Add a summary at the top of each file.&lt;/strong&gt; Claude can navigate large files more effectively when they start with an overview.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Keep files focused.&lt;/strong&gt; Five 20-page documents work better than one 100-page document, because Claude can identify which file is relevant to a specific question.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Remove outdated files.&lt;/strong&gt; Stale information in your knowledge base leads to stale responses.&lt;/li&gt;
&lt;/ul&gt;
&lt;h2&gt;Artifacts: Context That Claude Creates&lt;/h2&gt;
&lt;p&gt;Artifacts are a distinct Claude Web feature where Claude creates standalone documents, code files, diagrams, or interactive components during a conversation. Unlike regular responses, artifacts persist as discrete objects that you can reference, edit, and reuse.&lt;/p&gt;
&lt;h3&gt;How Artifacts Enhance Context Management&lt;/h3&gt;
&lt;p&gt;Artifacts serve as shared reference points between you and Claude. When Claude creates a code artifact, for example, both of you can reference it by name in subsequent messages. This is more efficient than scrolling through conversation history to find the relevant code block.&lt;/p&gt;
&lt;p&gt;Common artifact types:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Code files:&lt;/strong&gt; Complete, runnable code that Claude creates and iterates on&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Documents:&lt;/strong&gt; Formatted text (reports, drafts, plans) that can be edited in place&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Diagrams:&lt;/strong&gt; Mermaid or SVG diagrams that visualize architectures or workflows&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Interactive components:&lt;/strong&gt; React components that render in the browser&lt;/li&gt;
&lt;/ul&gt;
&lt;h3&gt;Using Artifacts for Context Persistence&lt;/h3&gt;
&lt;p&gt;When working on a complex deliverable, ask Claude to create artifacts for each component. This keeps the working documents visible and accessible without being buried in conversation history. You can then reference specific artifacts (&amp;quot;Update the database schema artifact to include the new user_preferences table&amp;quot;) rather than re-describing what you need.&lt;/p&gt;
&lt;h2&gt;MCP Server Support on Claude Web&lt;/h2&gt;
&lt;p&gt;Claude Web supports MCP (Model Context Protocol) through remote MCP servers. This allows the web interface to connect to external tools and data sources without requiring a local desktop application.&lt;/p&gt;
&lt;h3&gt;How MCP Works on Claude Web&lt;/h3&gt;
&lt;p&gt;To connect a remote MCP server:&lt;/p&gt;
&lt;ol&gt;
&lt;li&gt;Navigate to &lt;strong&gt;Settings &amp;gt; Connectors&lt;/strong&gt; in the Claude web interface&lt;/li&gt;
&lt;li&gt;Add a custom connector by providing the remote MCP server&apos;s URL&lt;/li&gt;
&lt;li&gt;The MCP server&apos;s tools become available within your conversations&lt;/li&gt;
&lt;/ol&gt;
&lt;p&gt;Claude Web supports remote MCP servers across all plan tiers (Free, Pro, Max, Team, Enterprise), though free users may have limitations on the number of connections.&lt;/p&gt;
&lt;h3&gt;What MCP Enables on Claude Web&lt;/h3&gt;
&lt;p&gt;With MCP connectors, Claude Web can interact with external services directly:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Productivity tools:&lt;/strong&gt; Google Drive, Slack, Asana, monday.com&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Developer tools:&lt;/strong&gt; GitHub, Sentry, Linear&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Creative tools:&lt;/strong&gt; Canva, Figma&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Custom APIs:&lt;/strong&gt; Any service exposed through a remote MCP server&lt;/li&gt;
&lt;/ul&gt;
&lt;h3&gt;MCP Apps&lt;/h3&gt;
&lt;p&gt;Claude Web also supports MCP Apps, an extension of the protocol that allows MCP servers to provide interactive user interfaces directly within the Claude interface. This means tools connected via MCP can render visual components (dashboards, project boards, design canvases) inside your Claude conversation, reducing the need to switch between applications.&lt;/p&gt;
&lt;h3&gt;Claude Web vs. Claude Desktop MCP&lt;/h3&gt;
&lt;p&gt;Claude Web connects to &lt;strong&gt;remote&lt;/strong&gt; MCP servers (cloud-hosted, accessed via URL). Claude Desktop supports both remote and &lt;strong&gt;local&lt;/strong&gt; MCP servers (processes running on your machine via STDIO). If you need to connect to local databases, local file systems, or services that are not exposed to the internet, use Claude Desktop. For cloud-hosted services and APIs, Claude Web&apos;s remote MCP support is sufficient.&lt;/p&gt;
&lt;h2&gt;Structuring Context for Maximum Impact&lt;/h2&gt;
&lt;h3&gt;The Briefing Pattern&lt;/h3&gt;
&lt;p&gt;At the start of a new conversation within a Project, briefly re-state the current focus:&lt;/p&gt;
&lt;p&gt;&amp;quot;We are working on Chapter 3 of the documentation, covering Flink job deployment. The outline is in the project files. I want to draft the section on checkpoint configuration.&amp;quot;&lt;/p&gt;
&lt;p&gt;This grounds Claude immediately without requiring it to search through the full conversation history or project files.&lt;/p&gt;
&lt;h3&gt;The Explicit Reference Pattern&lt;/h3&gt;
&lt;p&gt;When you want Claude to use specific information from your project files, reference them directly:&lt;/p&gt;
&lt;p&gt;&amp;quot;Based on the API reference document I uploaded, write example code that demonstrates the batch ingestion endpoint. Follow the code style shown in the style guide document.&amp;quot;&lt;/p&gt;
&lt;p&gt;Explicit references help Claude prioritize the right source material rather than relying on its general knowledge.&lt;/p&gt;
&lt;h3&gt;The Iterative Refinement Pattern&lt;/h3&gt;
&lt;p&gt;For complex outputs, work in stages:&lt;/p&gt;
&lt;ol&gt;
&lt;li&gt;&lt;strong&gt;Outline first:&lt;/strong&gt; &amp;quot;Create an outline for this section covering X, Y, and Z&amp;quot;&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Draft section by section:&lt;/strong&gt; &amp;quot;Write the first section based on the outline&amp;quot;&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Review and refine:&lt;/strong&gt; &amp;quot;The technical content is good but the tone is too formal. Make it conversational.&amp;quot;&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Consistency check:&lt;/strong&gt; &amp;quot;Review the full draft for consistency in terminology and style&amp;quot;&lt;/li&gt;
&lt;/ol&gt;
&lt;p&gt;Each stage keeps Claude&apos;s focus narrow, which produces better results than asking for a complete deliverable in one shot.&lt;/p&gt;
&lt;h3&gt;Managing Long Conversations&lt;/h3&gt;
&lt;p&gt;Even with a 1-million-token context window, very long conversations can degrade quality. When a conversation starts feeling unfocused:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Start a new conversation&lt;/strong&gt; within the same Project (your files and instructions carry over)&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Summarize progress&lt;/strong&gt; at the start of the new conversation&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Create artifacts&lt;/strong&gt; for important outputs so they are easy to reference in the new thread&lt;/li&gt;
&lt;/ul&gt;
&lt;h2&gt;Advanced Patterns&lt;/h2&gt;
&lt;h3&gt;The Multi-Perspective Analysis&lt;/h3&gt;
&lt;p&gt;Ask Claude to analyze a problem from multiple angles in a single conversation:&lt;/p&gt;
&lt;p&gt;&amp;quot;First, analyze this architecture from a performance perspective. Then, analyze it from a cost perspective. Finally, analyze it from a maintainability perspective. Structure each analysis as a separate section.&amp;quot;&lt;/p&gt;
&lt;p&gt;This leverages Claude&apos;s large context window to produce comprehensive analysis while keeping the output organized.&lt;/p&gt;
&lt;h3&gt;The Living Document Workflow&lt;/h3&gt;
&lt;p&gt;Use a Project with a master document artifact that Claude updates throughout the engagement:&lt;/p&gt;
&lt;ol&gt;
&lt;li&gt;Create an initial artifact (e.g., &amp;quot;Project Plan v1&amp;quot;)&lt;/li&gt;
&lt;li&gt;As work progresses, ask Claude to update the artifact&lt;/li&gt;
&lt;li&gt;The artifact becomes a living record of the project&apos;s evolution&lt;/li&gt;
&lt;/ol&gt;
&lt;p&gt;This is particularly effective for research, planning, and documentation work.&lt;/p&gt;
&lt;h3&gt;The Expert Panel Pattern&lt;/h3&gt;
&lt;p&gt;Give Claude multiple &amp;quot;hats&amp;quot; to wear within a Project:&lt;/p&gt;
&lt;p&gt;&amp;quot;In this Project, I want you to evaluate ideas from three perspectives: (1) a cautious security engineer, (2) an enthusiastic product manager, and (3) a pragmatic senior developer. When I present an idea, respond with all three perspectives.&amp;quot;&lt;/p&gt;
&lt;p&gt;This turns a single Claude conversation into a simulated review process.&lt;/p&gt;
&lt;h2&gt;Common Mistakes&lt;/h2&gt;
&lt;ol&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Not using Projects for project work.&lt;/strong&gt; If you have more than 3 conversations about the same topic, you should be using a Project. Without it, you lose continuity between sessions.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Uploading too many files without organization.&lt;/strong&gt; Quality beats quantity. Upload the files Claude actually needs, name them well, and include summaries.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Ignoring Project Instructions.&lt;/strong&gt; Many users create Projects but skip the instructions. This is like hiring a consultant but never briefing them. The instructions are the single highest-impact piece of context you can provide.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Not starting fresh conversations.&lt;/strong&gt; Long conversations accumulate noise. When you shift to a new subtopic, start a new conversation within the Project.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Not connecting MCP servers when you need live data.&lt;/strong&gt; Claude Web supports remote MCP servers through Settings &amp;gt; Connectors. If your task requires live connections to cloud services, set up the relevant MCP connectors. For local services not exposed to the internet, use Claude Desktop instead.&lt;/p&gt;
&lt;/li&gt;
&lt;/ol&gt;
&lt;h2&gt;Go Deeper&lt;/h2&gt;
&lt;p&gt;To learn more about context management strategies for AI tools and agentic workflows, check out these resources by Alex Merced:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;
&lt;p&gt;&lt;a href=&quot;https://www.amazon.com/2026-Guide-AI-Assisted-Development-Engineering-ebook/dp/B0GQW7CTML/&quot;&gt;The 2026 Guide to AI-Assisted Development&lt;/a&gt; covers AI-assisted development workflows, prompt engineering, and context strategies for software engineers.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;a href=&quot;https://www.amazon.com/Lakehouses-Apache-Iceberg-Agentic-Hands/dp/B0GQNY21TD/&quot;&gt;The 2026 Guide to Lakehouses, Apache Iceberg and Agentic AI&lt;/a&gt; explores how AI agents are reshaping data architecture and how to build systems that support agentic workflows.&lt;/p&gt;
&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;And for a fictional take on where AI is heading:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;a href=&quot;https://www.amazon.com/Emperors-I-Valley-Novel-Future/dp/B0GQHKF4ZT/&quot;&gt;The Emperors of A.I. Valley: A Novel of Power, Code, and the War for the Future&lt;/a&gt; is a novel about the power struggles and ethical dilemmas behind the companies building the most powerful AI systems in the world.&lt;/li&gt;
&lt;/ul&gt;
</content:encoded><author>Alex Merced</author></item><item><title>Context Management Strategies for OpenAI Codex: A Complete Guide Across Browser, CLI, and App</title><link>https://iceberglakehouse.com/posts/2026-03-context-openai-codex/</link><guid isPermaLink="true">https://iceberglakehouse.com/posts/2026-03-context-openai-codex/</guid><description>
OpenAI Codex is not a chatbot. It is an autonomous software engineering agent that runs tasks in isolated cloud sandboxes, operates across a browser ...</description><pubDate>Sat, 07 Mar 2026 10:00:00 GMT</pubDate><content:encoded>&lt;p&gt;OpenAI Codex is not a chatbot. It is an autonomous software engineering agent that runs tasks in isolated cloud sandboxes, operates across a browser interface, a command-line tool, and a dedicated macOS app, and can work on multiple tasks in parallel. Because of this architecture, context management in Codex works fundamentally differently from ChatGPT or traditional coding assistants. Instead of conversational context windows, you manage context through persistent configuration files, skill definitions, and project-level instructions that shape how the agent approaches your codebase.&lt;/p&gt;
&lt;p&gt;This guide covers every context management mechanism Codex provides, explains when to use each one, and walks through practical strategies for getting the agent to produce reliable, project-aligned results across all three interfaces.&lt;/p&gt;
&lt;h2&gt;Understanding How Codex Handles Context&lt;/h2&gt;
&lt;p&gt;Codex operates with a large context window (approximately 192,000 tokens), which means it can reason about substantial portions of a codebase in a single task. But context in Codex is not just conversation history. The agent assembles its context dynamically from multiple sources:&lt;/p&gt;
&lt;ol&gt;
&lt;li&gt;&lt;strong&gt;Your repository:&lt;/strong&gt; Codex clones your repo into a sandboxed environment for each task&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;AGENTS.md files:&lt;/strong&gt; Persistent instructions that live in your repository&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Skills:&lt;/strong&gt; Reusable bundles of instructions, templates, and scripts&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Task prompt:&lt;/strong&gt; Your natural language description of what to do&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Previous interactions:&lt;/strong&gt; In the desktop app, persistent project memory carries context across sessions&lt;/li&gt;
&lt;/ol&gt;
&lt;p&gt;The key insight is that most of Codex&apos;s context comes from your repository itself, not from conversational back-and-forth. This makes context management a matter of preparing your repo and configuration files rather than crafting perfect prompts.&lt;/p&gt;
&lt;h2&gt;Thinking About the Right Level of Context&lt;/h2&gt;
&lt;h3&gt;Minimal Context (Quick Tasks)&lt;/h3&gt;
&lt;p&gt;For simple, self-contained tasks like &amp;quot;add input validation to this function&amp;quot; or &amp;quot;write unit tests for utils.py,&amp;quot; the task prompt and the codebase itself provide sufficient context. Codex will explore the relevant files, understand the patterns, and produce targeted changes. You do not need to provide extensive background.&lt;/p&gt;
&lt;h3&gt;Moderate Context (Targeted Changes)&lt;/h3&gt;
&lt;p&gt;For tasks that require understanding project conventions, architectural decisions, or specific technical requirements, provide that context in your AGENTS.md file or in the task prompt. For example: &amp;quot;Refactor the authentication module to use JWT instead of session cookies. Our API follows REST conventions and uses Express 5 middleware patterns.&amp;quot;&lt;/p&gt;
&lt;h3&gt;Comprehensive Context (Large Features or Ongoing Work)&lt;/h3&gt;
&lt;p&gt;For multi-step features, large refactors, or ongoing development work, invest in Skills and detailed AGENTS.md files. These provide the agent with your coding standards, architectural patterns, testing requirements, and deployment constraints. The desktop app&apos;s persistent project memory also helps here by retaining context across sessions.&lt;/p&gt;
&lt;h2&gt;AGENTS.md: The Foundation of Codex Context&lt;/h2&gt;
&lt;p&gt;AGENTS.md is the most important context management tool for Codex. It is a Markdown file that lives in your repository and provides persistent instructions to the agent. Codex reads AGENTS.md at the beginning of every task.&lt;/p&gt;
&lt;h3&gt;How It Works&lt;/h3&gt;
&lt;p&gt;Place an &lt;code&gt;AGENTS.md&lt;/code&gt; file at the root of your repository. Codex loads it automatically before starting any task. Think of it as a briefing document that tells the agent everything it needs to know about your project.&lt;/p&gt;
&lt;h3&gt;What to Include&lt;/h3&gt;
&lt;pre&gt;&lt;code class=&quot;language-markdown&quot;&gt;# AGENTS.md

## Project Overview
This is a Next.js 15 application with a Python FastAPI backend.
The frontend uses TypeScript, Tailwind CSS, and Zustand for state management.
The backend uses SQLAlchemy with PostgreSQL.

## Coding Standards
- Use functional components with hooks (no class components)
- All API endpoints must include input validation using Pydantic
- Write tests for every new function using pytest (backend) and Vitest (frontend)
- Use conventional commit messages: feat:, fix:, refactor:, docs:, test:

## Architecture
- Frontend routes are in src/app/ (App Router)
- API routes are in backend/api/routes/
- Database models are in backend/models/
- Shared types are in shared/types/

## Constraints
- Do not modify the database schema without explicit approval
- Do not add new dependencies without noting them in the PR description
- All environment variables must be documented in .env.example
&lt;/code&gt;&lt;/pre&gt;
&lt;h3&gt;Hierarchical AGENTS.md Files&lt;/h3&gt;
&lt;p&gt;For monorepos or large projects, you can place AGENTS.md files at different levels:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Root level:&lt;/strong&gt; Global project instructions&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Service directories:&lt;/strong&gt; Service-specific conventions (e.g., &lt;code&gt;backend/AGENTS.md&lt;/code&gt;, &lt;code&gt;frontend/AGENTS.md&lt;/code&gt;)&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Global:&lt;/strong&gt; &lt;code&gt;~/.codex/AGENTS.md&lt;/code&gt; for personal preferences that apply across all projects&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;More specific files supplement (not replace) more general ones. The agent combines all applicable AGENTS.md files when executing a task.&lt;/p&gt;
&lt;h3&gt;Best Practices&lt;/h3&gt;
&lt;ul&gt;
&lt;li&gt;Keep it updated. Stale AGENTS.md instructions lead to stale agent behavior.&lt;/li&gt;
&lt;li&gt;Be specific about constraints. &amp;quot;Follow best practices&amp;quot; is meaningless to an agent. &amp;quot;All database queries must use parameterized statements, never string interpolation&amp;quot; is actionable.&lt;/li&gt;
&lt;li&gt;Include examples of your code style. Show the agent what &amp;quot;good&amp;quot; looks like in your codebase.&lt;/li&gt;
&lt;li&gt;Document your testing strategy. Tell the agent which test framework to use, where tests live, and what coverage expectations you have.&lt;/li&gt;
&lt;/ul&gt;
&lt;h2&gt;Skills: Reusable Workflow Bundles&lt;/h2&gt;
&lt;p&gt;Skills are a step beyond AGENTS.md. They are reusable bundles that package instructions, code templates, API configurations, and scripts into a single invocable unit. Skills let you codify complex workflows so the agent can execute them reliably.&lt;/p&gt;
&lt;h3&gt;When to Use Skills&lt;/h3&gt;
&lt;ul&gt;
&lt;li&gt;You have a repeatable workflow (deploying to staging, onboarding a new API endpoint, migrating a database)&lt;/li&gt;
&lt;li&gt;The workflow requires multiple steps that need to happen in a specific order&lt;/li&gt;
&lt;li&gt;You want consistency across team members using Codex&lt;/li&gt;
&lt;/ul&gt;
&lt;h3&gt;Creating a Skill&lt;/h3&gt;
&lt;p&gt;Skills are defined as structured folders with a manifest file:&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-markdown&quot;&gt;# SKILL.md

---
name: create-api-endpoint
description: Creates a new REST API endpoint with validation, tests, and documentation
---

## Steps
1. Create the route file in backend/api/routes/
2. Define the Pydantic request/response models in backend/api/schemas/
3. Implement the business logic in backend/services/
4. Write pytest tests in backend/tests/
5. Add the endpoint to the OpenAPI documentation
6. Update the API changelog

## Templates
Use the existing endpoint at backend/api/routes/users.py as the reference pattern.

## Validation
- Run pytest after creating the endpoint
- Verify the OpenAPI spec is valid
- Check that all response codes are documented
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Skills can be invoked explicitly by name or triggered automatically when the agent detects a task that matches the skill&apos;s description.&lt;/p&gt;
&lt;h2&gt;The Three Interfaces: Context Differences&lt;/h2&gt;
&lt;h3&gt;Browser (ChatGPT Sidebar)&lt;/h3&gt;
&lt;p&gt;The browser interface runs Codex from within the ChatGPT web application. Context management here is straightforward:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Repository:&lt;/strong&gt; Select which repo the agent works on&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Task prompt:&lt;/strong&gt; Describe what you want done&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;AGENTS.md:&lt;/strong&gt; Loaded automatically from the repo&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Results:&lt;/strong&gt; The agent produces a diff or pull request for review&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;This interface is best for individual tasks that you want to review before merging. Context is session-scoped; each task gets a fresh sandbox.&lt;/p&gt;
&lt;h3&gt;CLI (Command Line)&lt;/h3&gt;
&lt;p&gt;The Codex CLI (&lt;code&gt;codex&lt;/code&gt;) runs in your terminal and operates on your local codebase. It offers more control over context:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Approval modes:&lt;/strong&gt; Choose between Chat (interactive), Agent (approval for writes), and Full Access (autonomous)&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;MCP servers:&lt;/strong&gt; The CLI supports MCP server integration for connecting external tools&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;File references:&lt;/strong&gt; Point the agent at specific files or directories&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Image inputs:&lt;/strong&gt; Pass screenshots or design mockups alongside prompts&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Interactive mode:&lt;/strong&gt; Have a conversation with the agent about your codebase&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;The CLI is the most flexible interface for context management because you can combine AGENTS.md, MCP servers, and direct file references in a single session.&lt;/p&gt;
&lt;h3&gt;Desktop App (macOS)&lt;/h3&gt;
&lt;p&gt;The desktop app is the most powerful interface for sustained work:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Persistent project memory:&lt;/strong&gt; The app retains project history and context across sessions, so you do not have to re-establish context every time&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Multi-agent orchestration:&lt;/strong&gt; Run multiple agents on different tasks simultaneously, each in its own Git worktree&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Visual task management:&lt;/strong&gt; See all running and completed tasks in a unified interface&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Skills management:&lt;/strong&gt; Create, organize, and invoke Skills from the app&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;The desktop app is best for ongoing development work where you are regularly delegating tasks to Codex throughout your day.&lt;/p&gt;
&lt;h2&gt;MCP Server Support&lt;/h2&gt;
&lt;p&gt;The Codex CLI supports the Model Context Protocol (MCP), allowing you to connect external tools and data sources to the agent.&lt;/p&gt;
&lt;h3&gt;What MCP Enables&lt;/h3&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Database access:&lt;/strong&gt; Let the agent query your development database to understand schema and data patterns&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Browser automation:&lt;/strong&gt; Connect a Playwright MCP server so the agent can test frontend changes by interacting with a real browser&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;API integration:&lt;/strong&gt; Give the agent access to your project management tools, documentation systems, or monitoring dashboards&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Custom tools:&lt;/strong&gt; Build MCP servers that expose your organization&apos;s internal tools to the agent&lt;/li&gt;
&lt;/ul&gt;
&lt;h3&gt;When to Use MCP&lt;/h3&gt;
&lt;p&gt;MCP is most valuable when the agent needs information that is not in the repository:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Understanding runtime behavior (logs, database state, API responses)&lt;/li&gt;
&lt;li&gt;Verifying changes against a running application&lt;/li&gt;
&lt;li&gt;Accessing external specifications or documentation&lt;/li&gt;
&lt;li&gt;Interacting with CI/CD systems or deployment tools&lt;/li&gt;
&lt;/ul&gt;
&lt;h3&gt;When NOT to Use MCP&lt;/h3&gt;
&lt;p&gt;For tasks that are purely code-level (refactoring, writing tests, fixing type errors), MCP adds unnecessary complexity. The codebase itself provides sufficient context. Use MCP when the agent needs to interact with the world outside the code.&lt;/p&gt;
&lt;h3&gt;Configuration&lt;/h3&gt;
&lt;p&gt;MCP servers are configured through the CLI:&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-bash&quot;&gt;# Add a Playwright MCP server for browser testing
codex mcp add playwright

# Add a custom database MCP server
codex mcp add my-db-server --command &amp;quot;node /path/to/db-mcp.js&amp;quot;
&lt;/code&gt;&lt;/pre&gt;
&lt;h2&gt;External Documents: When to Use PDFs vs. Markdown&lt;/h2&gt;
&lt;p&gt;Codex primarily operates on code, but there are situations where providing external documents improves results.&lt;/p&gt;
&lt;h3&gt;Use Markdown When:&lt;/h3&gt;
&lt;ul&gt;
&lt;li&gt;Writing AGENTS.md or Skills (required format)&lt;/li&gt;
&lt;li&gt;Providing architectural decision records (ADRs)&lt;/li&gt;
&lt;li&gt;Sharing coding standards or style guides&lt;/li&gt;
&lt;li&gt;Documenting API specifications&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;Markdown is the native format for Codex context. It parses cleanly, supports code blocks, and is version-controllable in Git.&lt;/p&gt;
&lt;h3&gt;Use PDFs When:&lt;/h3&gt;
&lt;ul&gt;
&lt;li&gt;Referencing published specifications (RFC documents, protocol specs)&lt;/li&gt;
&lt;li&gt;Sharing design documents with diagrams that do not translate well to Markdown&lt;/li&gt;
&lt;li&gt;Providing compliance or regulatory requirements that exist in PDF form&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;In practice, Markdown is almost always the better choice for Codex. If you have a PDF specification, consider extracting the relevant sections into a Markdown file in your repository.&lt;/p&gt;
&lt;h2&gt;Automations: Scheduled Context Processing&lt;/h2&gt;
&lt;p&gt;Codex supports Automations, which are scheduled tasks that run in the background. These allow you to set up recurring agent work that automatically processes your codebase with predefined context.&lt;/p&gt;
&lt;h3&gt;Use Cases&lt;/h3&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Daily code reviews:&lt;/strong&gt; Schedule the agent to review new PRs every morning&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Dependency audits:&lt;/strong&gt; Weekly check for outdated or vulnerable dependencies&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Documentation updates:&lt;/strong&gt; Automatically update API documentation after code changes&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Test maintenance:&lt;/strong&gt; Periodically scan for broken or flaky tests&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;Automations use the same AGENTS.md and Skills context as manual tasks, ensuring consistency between scheduled and ad-hoc work.&lt;/p&gt;
&lt;h2&gt;Advanced Patterns&lt;/h2&gt;
&lt;h3&gt;The Context Layering Strategy&lt;/h3&gt;
&lt;p&gt;Combine multiple context sources for complex tasks:&lt;/p&gt;
&lt;ol&gt;
&lt;li&gt;&lt;strong&gt;Global AGENTS.md&lt;/strong&gt; (in &lt;code&gt;~/.codex/&lt;/code&gt;): Personal preferences and universal standards&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Project AGENTS.md&lt;/strong&gt; (in repo root): Project architecture and conventions&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Directory AGENTS.md&lt;/strong&gt; (in subdirectories): Component-specific patterns&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Skills:&lt;/strong&gt; Repeatable workflows for common tasks&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Task prompt:&lt;/strong&gt; The specific thing you want done now&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;MCP servers:&lt;/strong&gt; Live external data for verification&lt;/li&gt;
&lt;/ol&gt;
&lt;p&gt;Each layer adds specificity without overriding the layers above it.&lt;/p&gt;
&lt;h3&gt;The Multi-Agent Pattern&lt;/h3&gt;
&lt;p&gt;Use the desktop app to run parallel agents on different aspects of a feature:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Agent 1: Implements the backend API endpoint&lt;/li&gt;
&lt;li&gt;Agent 2: Writes the frontend component&lt;/li&gt;
&lt;li&gt;Agent 3: Creates integration tests&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;Each agent runs in its own Git worktree, so their changes do not conflict. Review and merge the results when all agents complete.&lt;/p&gt;
&lt;h3&gt;The Exploration-First Pattern&lt;/h3&gt;
&lt;p&gt;Before giving Codex a complex task, use a &amp;quot;planning&amp;quot; prompt:&lt;/p&gt;
&lt;p&gt;&amp;quot;Analyze the authentication module in backend/auth/. Describe the current architecture, identify potential issues, and suggest improvements. Do not make any changes.&amp;quot;&lt;/p&gt;
&lt;p&gt;Review the agent&apos;s analysis, then use it as context for the actual implementation task. This prevents the agent from making changes based on incomplete understanding.&lt;/p&gt;
&lt;h2&gt;Common Mistakes&lt;/h2&gt;
&lt;ol&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Skipping AGENTS.md:&lt;/strong&gt; Without AGENTS.md, the agent has no guidance on project conventions and will produce code that technically works but does not match your style.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Overly broad tasks:&lt;/strong&gt; &amp;quot;Improve the application&amp;quot; is too vague. &amp;quot;Add rate limiting to the /api/users endpoint using express-rate-limit with a 100-request-per-minute window&amp;quot; gives the agent clear parameters.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Ignoring the review step:&lt;/strong&gt; Codex produces diffs and PRs for a reason. Always review the output, especially for tasks involving security, database changes, or public-facing features.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Not using Skills for repeatable work:&lt;/strong&gt; If you find yourself writing the same type of task prompt repeatedly, extract it into a Skill.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Using MCP when you do not need it:&lt;/strong&gt; Adding MCP servers increases complexity and potential failure points. Only connect external tools when the task genuinely requires external data.&lt;/p&gt;
&lt;/li&gt;
&lt;/ol&gt;
&lt;h2&gt;Go Deeper&lt;/h2&gt;
&lt;p&gt;To learn more about working effectively with AI coding tools, context engineering, and agentic development workflows, check out these resources by Alex Merced:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;
&lt;p&gt;&lt;a href=&quot;https://www.amazon.com/2026-Guide-AI-Assisted-Development-Engineering-ebook/dp/B0GQW7CTML/&quot;&gt;The 2026 Guide to AI-Assisted Development&lt;/a&gt; covers AI-assisted development workflows, prompt engineering, and context strategies for software engineers.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;a href=&quot;https://www.amazon.com/Lakehouses-Apache-Iceberg-Agentic-Hands/dp/B0GQNY21TD/&quot;&gt;The 2026 Guide to Lakehouses, Apache Iceberg and Agentic AI&lt;/a&gt; explores how AI agents are reshaping data architecture and how to build systems that support agentic workflows.&lt;/p&gt;
&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;And for a fictional take on where AI is heading:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;a href=&quot;https://www.amazon.com/Emperors-I-Valley-Novel-Future/dp/B0GQHKF4ZT/&quot;&gt;The Emperors of A.I. Valley: A Novel of Power, Code, and the War for the Future&lt;/a&gt; is a novel about the power struggles and ethical dilemmas behind the companies building the most powerful AI systems in the world.&lt;/li&gt;
&lt;/ul&gt;
</content:encoded><author>Alex Merced</author></item><item><title>Context Management Strategies for ChatGPT: A Complete Guide to Getting Better Results</title><link>https://iceberglakehouse.com/posts/2026-03-context-chatgpt/</link><guid isPermaLink="true">https://iceberglakehouse.com/posts/2026-03-context-chatgpt/</guid><description>
Getting consistently useful results from ChatGPT requires more than writing good prompts. The real differentiator is how you manage context: the back...</description><pubDate>Sat, 07 Mar 2026 09:00:00 GMT</pubDate><content:encoded>&lt;p&gt;Getting consistently useful results from ChatGPT requires more than writing good prompts. The real differentiator is how you manage context: the background information, instructions, documents, and accumulated knowledge that shapes every response ChatGPT generates. Without deliberate context management, you end up repeating yourself, getting generic answers, and wasting time course-correcting the AI.&lt;/p&gt;
&lt;p&gt;This guide covers every context management tool ChatGPT offers in 2026, from basic custom instructions to advanced Project workflows, and explains when to use each one.&lt;/p&gt;
&lt;h2&gt;What Is Context Management and Why Does It Matter?&lt;/h2&gt;
&lt;p&gt;Context management is the practice of controlling what information an AI model has access to when generating a response. Every time you interact with ChatGPT, the model processes a &amp;quot;context window,&amp;quot; basically the sum of all text it can see at once, including your conversation history, uploaded files, system instructions, and memory. The quality of the response depends directly on how well you curate that window.&lt;/p&gt;
&lt;p&gt;Poor context management looks like this:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Repeating your role, preferences, and constraints in every new conversation&lt;/li&gt;
&lt;li&gt;Uploading the same reference documents over and over&lt;/li&gt;
&lt;li&gt;Getting responses that ignore your project&apos;s specific terminology or conventions&lt;/li&gt;
&lt;li&gt;Spending more time correcting the AI than doing actual work&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;Good context management means ChatGPT already knows your background, has access to relevant documents, follows your preferred style, and builds on previous conversations without you manually re-establishing all of that every time.&lt;/p&gt;
&lt;h2&gt;Thinking About the Right Level of Context&lt;/h2&gt;
&lt;p&gt;Before configuring any tools, think about what level of context a given task actually needs. Not every conversation requires the same depth.&lt;/p&gt;
&lt;h3&gt;Minimal Context (Quick Questions)&lt;/h3&gt;
&lt;p&gt;For simple factual questions, brainstorming, or one-off tasks, you often need zero setup. Just ask the question. Adding unnecessary context actually dilutes the model&apos;s attention and can lead to worse responses. If you are asking &amp;quot;What is the difference between TCP and UDP?&amp;quot; you do not need to upload your network architecture docs.&lt;/p&gt;
&lt;h3&gt;Moderate Context (Focused Work)&lt;/h3&gt;
&lt;p&gt;For tasks like drafting emails, reviewing code snippets, or writing sections of a document, provide the immediately relevant information in the conversation. Paste the specific text you are working with, reference the specific style or tone you want, and state any constraints. This keeps the model focused without overwhelming it.&lt;/p&gt;
&lt;h3&gt;Comprehensive Context (Extended Projects)&lt;/h3&gt;
&lt;p&gt;For ongoing projects, research, or multi-session work, use ChatGPT&apos;s structured context tools (Projects, CustomGPTs, Memory). This is where deliberate context management pays the biggest dividends. You define the context once and it persists across every conversation in that workspace.&lt;/p&gt;
&lt;h3&gt;How to Decide&lt;/h3&gt;
&lt;p&gt;Ask yourself: &amp;quot;If I handed this task to a knowledgeable colleague, what would I need to tell them before they could start?&amp;quot; If the answer is &amp;quot;nothing, just the question,&amp;quot; use minimal context. If you would need to hand them a style guide, a codebase overview, and three reference documents, set up a Project.&lt;/p&gt;
&lt;h2&gt;Custom Instructions: Your Global Defaults&lt;/h2&gt;
&lt;p&gt;Custom Instructions are the most basic and most overlooked context management tool in ChatGPT. They apply to every conversation you have (unless you use a Project or CustomGPT with its own instructions).&lt;/p&gt;
&lt;h3&gt;How They Work&lt;/h3&gt;
&lt;p&gt;Navigate to &lt;strong&gt;Settings &amp;gt; Personalization &amp;gt; Custom Instructions&lt;/strong&gt;. You get two fields:&lt;/p&gt;
&lt;ol&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;About you:&lt;/strong&gt; Tell ChatGPT who you are, what you do, and what background knowledge to assume. For example: &amp;quot;I am a senior data engineer working with Apache Iceberg, Spark, and Python. I build data lakehouse architectures for financial services companies.&amp;quot;&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;How to respond:&lt;/strong&gt; Define your preferred output format, tone, and constraints. For example: &amp;quot;Be concise. Use code examples in Python unless I specify otherwise. Skip basic explanations of concepts I already know. Never use em dashes.&amp;quot;&lt;/p&gt;
&lt;/li&gt;
&lt;/ol&gt;
&lt;h3&gt;Best Practices&lt;/h3&gt;
&lt;ul&gt;
&lt;li&gt;Keep instructions specific and actionable. &amp;quot;Be helpful&amp;quot; is useless. &amp;quot;When I ask about SQL, always format queries with uppercase keywords and include comments explaining each join&amp;quot; is useful.&lt;/li&gt;
&lt;li&gt;Update them as your needs change. If you switch projects or roles, update your instructions.&lt;/li&gt;
&lt;li&gt;Use negative constraints. Telling ChatGPT what NOT to do is often more effective than listing everything it should do.&lt;/li&gt;
&lt;li&gt;Do not overload them. Custom Instructions have a character limit. Use them for universal preferences, not project-specific details.&lt;/li&gt;
&lt;/ul&gt;
&lt;h3&gt;Limitations&lt;/h3&gt;
&lt;p&gt;Custom Instructions are global. They apply everywhere unless overridden by a Project or CustomGPT. If you work across multiple domains (coding, writing, research), your instructions need to be general enough to help everywhere without being so vague they help nowhere. For domain-specific work, use Projects instead.&lt;/p&gt;
&lt;h2&gt;Memory: Persistent Knowledge Across Conversations&lt;/h2&gt;
&lt;p&gt;ChatGPT&apos;s Memory feature allows the model to remember facts, preferences, and context across conversations without you re-stating them.&lt;/p&gt;
&lt;h3&gt;How Memory Works&lt;/h3&gt;
&lt;p&gt;When enabled (Settings &amp;gt; Personalization &amp;gt; Memory), ChatGPT can save information you share during conversations. It stores these as discrete facts: &amp;quot;User prefers Python over JavaScript,&amp;quot; &amp;quot;User&apos;s company uses PostgreSQL 15,&amp;quot; &amp;quot;User is writing a book about data engineering.&amp;quot;&lt;/p&gt;
&lt;p&gt;You can explicitly tell ChatGPT to remember things: &amp;quot;Remember that my team uses the Google style guide for Python.&amp;quot; You can also ask it what it remembers (&amp;quot;What do you know about me?&amp;quot;) and delete specific memories or clear them all.&lt;/p&gt;
&lt;h3&gt;When to Use Memory&lt;/h3&gt;
&lt;p&gt;Memory is best for facts that apply broadly across conversations:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Your technical stack and preferences&lt;/li&gt;
&lt;li&gt;Your role and expertise level&lt;/li&gt;
&lt;li&gt;Recurring project names or team members&lt;/li&gt;
&lt;li&gt;Style preferences that should persist everywhere&lt;/li&gt;
&lt;/ul&gt;
&lt;h3&gt;When NOT to Use Memory&lt;/h3&gt;
&lt;p&gt;Memory is not a substitute for Projects or file uploads. It stores brief facts, not documents or complex context. Do not try to make ChatGPT &amp;quot;memorize&amp;quot; an entire API specification through Memory. Use file uploads for that.&lt;/p&gt;
&lt;h3&gt;Temporary Chats&lt;/h3&gt;
&lt;p&gt;If you want a conversation without Memory recall (for example, helping someone else with their problem or exploring a sensitive topic), use &lt;strong&gt;Temporary Chat&lt;/strong&gt;. This creates a blank-slate conversation that does not read from or write to Memory.&lt;/p&gt;
&lt;h2&gt;Projects: Dedicated Workspaces for Focused Work&lt;/h2&gt;
&lt;p&gt;Projects are ChatGPT&apos;s most powerful context management feature for sustained work. A Project is a dedicated workspace that groups related conversations, uploaded files, and custom instructions.&lt;/p&gt;
&lt;h3&gt;Setting Up a Project&lt;/h3&gt;
&lt;ol&gt;
&lt;li&gt;Click &lt;strong&gt;Projects&lt;/strong&gt; in the sidebar&lt;/li&gt;
&lt;li&gt;Create a new Project with a descriptive name&lt;/li&gt;
&lt;li&gt;Add &lt;strong&gt;Project Instructions&lt;/strong&gt;: These override or supplement your global Custom Instructions for every conversation within this Project&lt;/li&gt;
&lt;li&gt;Upload &lt;strong&gt;files&lt;/strong&gt;: Up to 20 files per Project (PDFs, CSVs, images, text files). ChatGPT can reference these across all conversations in the Project.&lt;/li&gt;
&lt;li&gt;Start conversations within the Project&lt;/li&gt;
&lt;/ol&gt;
&lt;h3&gt;Project Instructions vs. Custom Instructions&lt;/h3&gt;
&lt;p&gt;Project Instructions are scoped to the Project. They are the right place for:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Project-specific terminology and conventions&lt;/li&gt;
&lt;li&gt;The structure or outline of what you are building&lt;/li&gt;
&lt;li&gt;Style guides or formatting requirements specific to this work&lt;/li&gt;
&lt;li&gt;Background context about the domain&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;Think of Custom Instructions as your personal defaults and Project Instructions as the briefing document for a specific engagement.&lt;/p&gt;
&lt;h3&gt;File Management in Projects&lt;/h3&gt;
&lt;p&gt;You can upload various file types to a Project&apos;s knowledge base:&lt;/p&gt;
&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;File Type&lt;/th&gt;
&lt;th&gt;Best For&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;PDF&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Reference documentation, research papers, specifications&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;CSV/Excel&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Data samples, structured reference data&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Text/Markdown&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Style guides, code snippets, outlines, notes&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Images&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Diagrams, mockups, screenshots for visual context&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;
&lt;h3&gt;When to Use PDFs vs. Markdown&lt;/h3&gt;
&lt;p&gt;This is a practical question that matters more than most people realize.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Use PDFs when:&lt;/strong&gt;&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;The document is a published specification, whitepaper, or research paper&lt;/li&gt;
&lt;li&gt;Layout and formatting matter (tables, figures, page references)&lt;/li&gt;
&lt;li&gt;You have the document in PDF form and do not want to convert it&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;&lt;strong&gt;Use Markdown when:&lt;/strong&gt;&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;You are creating context documents specifically for ChatGPT&lt;/li&gt;
&lt;li&gt;You want the AI to parse the content with maximum accuracy&lt;/li&gt;
&lt;li&gt;The content is structured text (code standards, API docs, outlines)&lt;/li&gt;
&lt;li&gt;You plan to update the document frequently&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;Markdown is generally a better format for AI consumption. The structure is unambiguous, there are no encoding issues from PDF extraction, and the content is more reliably parsed. If you are creating a reference document from scratch to guide ChatGPT, write it in Markdown.&lt;/p&gt;
&lt;h3&gt;Project Sharing&lt;/h3&gt;
&lt;p&gt;Projects can be shared with other ChatGPT users. When you share a Project, collaborators get access to the uploaded files, Project Instructions, and conversation history. This makes Projects useful for team workflows where multiple people need the AI to have the same context.&lt;/p&gt;
&lt;h2&gt;CustomGPTs: Specialized Assistants for Repeatable Tasks&lt;/h2&gt;
&lt;p&gt;CustomGPTs let you create purpose-built AI assistants with specific instructions, knowledge bases, and capabilities. They are the right tool when you have a repeatable workflow that requires specialized context.&lt;/p&gt;
&lt;h3&gt;When to Use a CustomGPT vs. a Project&lt;/h3&gt;
&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Feature&lt;/th&gt;
&lt;th&gt;Project&lt;/th&gt;
&lt;th&gt;CustomGPT&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Best for&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Extended work on a specific project&lt;/td&gt;
&lt;td&gt;Repeatable tasks across different projects&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Context scope&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;One body of work&lt;/td&gt;
&lt;td&gt;One type of task&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Shareable&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Yes (collaborators)&lt;/td&gt;
&lt;td&gt;Yes (public or private)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Custom actions&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;No&lt;/td&gt;
&lt;td&gt;Yes (API integrations)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Example&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;&amp;quot;Q3 Marketing Campaign&amp;quot;&lt;/td&gt;
&lt;td&gt;&amp;quot;Technical Blog Editor&amp;quot;&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;
&lt;p&gt;A CustomGPT is like hiring a specialist. A Project is like setting up a war room for a specific mission.&lt;/p&gt;
&lt;h3&gt;Building an Effective CustomGPT&lt;/h3&gt;
&lt;ol&gt;
&lt;li&gt;&lt;strong&gt;Instructions:&lt;/strong&gt; Write detailed behavioral instructions. Include the role, tone, output format, and constraints. Be as specific as your best Custom Instructions, but scoped to this GPT&apos;s purpose.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Knowledge files:&lt;/strong&gt; Upload reference documents that the GPT should always have access to. These function like Project files but are permanently attached to the GPT.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Actions:&lt;/strong&gt; Connect external APIs so the GPT can fetch real-time data, submit forms, or interact with your tools.&lt;/li&gt;
&lt;/ol&gt;
&lt;h3&gt;Knowledge Base Best Practices&lt;/h3&gt;
&lt;ul&gt;
&lt;li&gt;Name your files descriptively. &amp;quot;company-style-guide-2026.md&amp;quot; is better than &amp;quot;doc1.pdf.&amp;quot;&lt;/li&gt;
&lt;li&gt;Include a table of contents or summary at the top of large documents. This helps ChatGPT navigate the content.&lt;/li&gt;
&lt;li&gt;Keep individual files focused. Ten small, focused files work better than one 200-page PDF.&lt;/li&gt;
&lt;li&gt;Test your GPT after uploading. Ask questions that require it to reference specific sections of your documents to verify it is parsing them correctly.&lt;/li&gt;
&lt;/ul&gt;
&lt;h2&gt;MCP Server Support&lt;/h2&gt;
&lt;p&gt;ChatGPT added support for the Model Context Protocol (MCP) in September 2025 through a &amp;quot;Developer Mode&amp;quot; feature. This is available to paying users on Plus, Pro, Team, Enterprise, and Education plans.&lt;/p&gt;
&lt;h3&gt;How MCP Works in ChatGPT&lt;/h3&gt;
&lt;p&gt;With Developer Mode enabled, ChatGPT can connect to MCP servers that expose external tools and data sources. This means ChatGPT can interact with services like Jira, Google Calendar, databases, and custom APIs directly from the chat interface. MCP connections are configured through the ChatGPT settings under Developer Mode, where you specify the MCP server endpoints.&lt;/p&gt;
&lt;h3&gt;What MCP Enables&lt;/h3&gt;
&lt;p&gt;MCP in ChatGPT goes beyond read-only data access. It supports both read and write operations, meaning ChatGPT can:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Fetch data from external systems (database queries, API lookups)&lt;/li&gt;
&lt;li&gt;Update external systems (create tickets, send messages, update records)&lt;/li&gt;
&lt;li&gt;Interact with local files and applications when using the desktop app&lt;/li&gt;
&lt;/ul&gt;
&lt;h3&gt;MCP vs. CustomGPT Actions&lt;/h3&gt;
&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Feature&lt;/th&gt;
&lt;th&gt;MCP Servers&lt;/th&gt;
&lt;th&gt;CustomGPT Actions&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Protocol&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Standardized (MCP)&lt;/td&gt;
&lt;td&gt;Custom API definitions&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Setup&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Configure via Developer Mode&lt;/td&gt;
&lt;td&gt;Build into a CustomGPT&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Portability&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Works across MCP-compatible tools&lt;/td&gt;
&lt;td&gt;ChatGPT only&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Operations&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Read and write&lt;/td&gt;
&lt;td&gt;Read and write&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;
&lt;p&gt;MCP servers offer the advantage of portability: the same MCP server you use with ChatGPT works with Claude Desktop, Cursor, and other MCP-compatible tools. CustomGPT Actions are ChatGPT-specific but offer tighter integration within the CustomGPT workflow.&lt;/p&gt;
&lt;h3&gt;Security Considerations&lt;/h3&gt;
&lt;p&gt;OpenAI has cautioned that using Developer Mode with write operations is powerful but carries risk. Always test MCP server connections carefully, especially for servers that can modify external systems. Be aware of potential prompt injection risks when connecting to untrusted data sources.&lt;/p&gt;
&lt;h2&gt;Structuring Context for Maximum Effectiveness&lt;/h2&gt;
&lt;p&gt;Beyond the tools themselves, how you structure the information you give ChatGPT matters significantly.&lt;/p&gt;
&lt;h3&gt;The Inverted Pyramid&lt;/h3&gt;
&lt;p&gt;Put the most important context first. ChatGPT pays more attention to the beginning and end of its context window. Structure your information like a news article:&lt;/p&gt;
&lt;ol&gt;
&lt;li&gt;&lt;strong&gt;Lead:&lt;/strong&gt; The task, constraint, and desired output format&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Body:&lt;/strong&gt; Supporting details, reference material, examples&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Background:&lt;/strong&gt; Nice-to-have context that might help but is not critical&lt;/li&gt;
&lt;/ol&gt;
&lt;h3&gt;Be Explicit About What You Want&lt;/h3&gt;
&lt;p&gt;Vague requests get vague results. Compare:&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Vague:&lt;/strong&gt; &amp;quot;Help me with my database.&amp;quot;
&lt;strong&gt;Specific:&lt;/strong&gt; &amp;quot;Review this PostgreSQL query for performance issues. The table has 50 million rows, is partitioned by date, and has indexes on customer_id and order_date. Suggest index changes or query rewrites that would reduce execution time.&amp;quot;&lt;/p&gt;
&lt;p&gt;The specific version gives ChatGPT enough context to provide actionable advice. The vague version will produce a generic tutorial.&lt;/p&gt;
&lt;h3&gt;Use Reference Examples&lt;/h3&gt;
&lt;p&gt;When you want a specific output format or style, give ChatGPT an example. &amp;quot;Write a commit message in this style: [example]&amp;quot; is far more effective than describing the style in abstract terms. Examples are compressed context. One good example communicates more than a paragraph of description.&lt;/p&gt;
&lt;h3&gt;Manage Conversation Length&lt;/h3&gt;
&lt;p&gt;Long conversations degrade response quality. As the conversation history grows, ChatGPT has less room in its context window for your actual question and the reasoning needed to answer it. For extended work:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Start new conversations for new topics, even within the same Project&lt;/li&gt;
&lt;li&gt;Summarize progress before starting a new conversation (&amp;quot;Here is where we left off: [summary]&amp;quot;)&lt;/li&gt;
&lt;li&gt;Use Projects so you do not lose the files and instructions when you start fresh&lt;/li&gt;
&lt;/ul&gt;
&lt;h2&gt;Advanced Patterns&lt;/h2&gt;
&lt;h3&gt;The Briefing Document Pattern&lt;/h3&gt;
&lt;p&gt;Create a Markdown file that serves as a comprehensive briefing for ChatGPT. Include:&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-markdown&quot;&gt;# Project: [Name]

## Overview
[2-3 sentence summary of what this project is]

## Goals
- [Specific goal 1]
- [Specific goal 2]

## Constraints
- [Technical constraints]
- [Style/format constraints]

## Key Terminology
- **Term 1:** Definition specific to this project
- **Term 2:** Definition specific to this project

## Current Status
[Where the project stands right now]

## What I Need Help With
[Specific areas where ChatGPT should focus]
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Upload this as a Project file or paste it at the start of key conversations. It gives ChatGPT a structured, scannable overview that dramatically improves response relevance.&lt;/p&gt;
&lt;h3&gt;The Iterative Refinement Loop&lt;/h3&gt;
&lt;p&gt;For complex outputs (long documents, code architectures, research reports):&lt;/p&gt;
&lt;ol&gt;
&lt;li&gt;Start with a high-level outline and get ChatGPT&apos;s feedback&lt;/li&gt;
&lt;li&gt;Refine the outline based on the feedback&lt;/li&gt;
&lt;li&gt;Generate content section by section, reviewing each before moving on&lt;/li&gt;
&lt;li&gt;Use follow-up prompts to refine specific sections&lt;/li&gt;
&lt;li&gt;Do a final consistency pass&lt;/li&gt;
&lt;/ol&gt;
&lt;p&gt;This approach keeps the context focused at each step rather than asking ChatGPT to hold the entire deliverable in mind at once.&lt;/p&gt;
&lt;h3&gt;Multi-GPT Workflows&lt;/h3&gt;
&lt;p&gt;For complex projects, use different CustomGPTs for different aspects of the work:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;A &amp;quot;Research GPT&amp;quot; with academic papers and data sources&lt;/li&gt;
&lt;li&gt;A &amp;quot;Writing GPT&amp;quot; with your style guide and brand voice instructions&lt;/li&gt;
&lt;li&gt;A &amp;quot;Code Review GPT&amp;quot; with your codebase standards and architecture docs&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;Feed the output of one into the next. This keeps each GPT focused on what it does best instead of trying to make one GPT do everything.&lt;/p&gt;
&lt;h2&gt;Common Mistakes to Avoid&lt;/h2&gt;
&lt;ol&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Overloading context:&lt;/strong&gt; More is not always better. If you upload 20 files but your question only relates to one, the AI may pull irrelevant information from the other 19. Be selective.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Ignoring Custom Instructions:&lt;/strong&gt; Many users never set them up, then wonder why ChatGPT gives generic responses. Spending 10 minutes on Custom Instructions saves hours of correction.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Not using Projects for project work:&lt;/strong&gt; Having 50 disconnected conversations about the same project means ChatGPT has no persistent context. Use Projects.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Treating Memory as a database:&lt;/strong&gt; Memory stores brief facts, not documents. If you need ChatGPT to reference a 30-page specification, upload it as a file.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Never clearing context:&lt;/strong&gt; Sometimes the best thing to do is start a fresh conversation. If ChatGPT seems confused or is repeating mistakes, the conversation history may be working against you.&lt;/p&gt;
&lt;/li&gt;
&lt;/ol&gt;
&lt;h2&gt;Recommended Workflow for New Users&lt;/h2&gt;
&lt;ol&gt;
&lt;li&gt;&lt;strong&gt;First 10 minutes:&lt;/strong&gt; Set up Custom Instructions with your role, expertise level, and response preferences&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;First project:&lt;/strong&gt; Create a Project, upload 2-3 key reference documents, and write Project Instructions&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;First week:&lt;/strong&gt; Enable Memory and let it accumulate useful facts. Review and edit memories periodically.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;First month:&lt;/strong&gt; If you find yourself doing the same type of task repeatedly, build a CustomGPT for it.&lt;/li&gt;
&lt;/ol&gt;
&lt;h2&gt;Go Deeper&lt;/h2&gt;
&lt;p&gt;To learn more about working effectively with AI tools, including detailed context management strategies for coding, research, and professional workflows, check out these resources by Alex Merced:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;
&lt;p&gt;&lt;a href=&quot;https://www.amazon.com/2026-Guide-AI-Assisted-Development-Engineering-ebook/dp/B0GQW7CTML/&quot;&gt;The 2026 Guide to AI-Assisted Development&lt;/a&gt; covers AI-assisted development workflows, prompt engineering, and context strategies for software engineers.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;a href=&quot;https://www.amazon.com/Lakehouses-Apache-Iceberg-Agentic-Hands/dp/B0GQNY21TD/&quot;&gt;The 2026 Guide to Lakehouses, Apache Iceberg and Agentic AI&lt;/a&gt; explores how AI agents are reshaping data architecture and how to build systems that support agentic workflows.&lt;/p&gt;
&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;And for a fictional take on where AI is heading:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;a href=&quot;https://www.amazon.com/Emperors-I-Valley-Novel-Future/dp/B0GQHKF4ZT/&quot;&gt;The Emperors of A.I. Valley: A Novel of Power, Code, and the War for the Future&lt;/a&gt; is a novel about the power struggles and ethical dilemmas behind the companies building the most powerful AI systems in the world.&lt;/li&gt;
&lt;/ul&gt;
</content:encoded><author>Alex Merced</author></item><item><title>How to Use Dremio with Zed: Connect, Query, and Build Data Apps</title><link>https://iceberglakehouse.com/posts/2026-03-aitool-zed/</link><guid isPermaLink="true">https://iceberglakehouse.com/posts/2026-03-aitool-zed/</guid><description>
Zed is an open-source, GPU-accelerated code editor written in Rust. It is designed for speed and collaboration, with a built-in AI assistant that sup...</description><pubDate>Thu, 05 Mar 2026 21:00:00 GMT</pubDate><content:encoded>&lt;p&gt;Zed is an open-source, GPU-accelerated code editor written in Rust. It is designed for speed and collaboration, with a built-in AI assistant that supports multiple LLM providers and an agent mode for autonomous multi-step development. Dremio is a unified lakehouse platform that provides business context through its semantic layer, universal data access through query federation, and interactive speed through Reflections and Apache Arrow.&lt;/p&gt;
&lt;p&gt;Connecting them gives Zed&apos;s AI agent the context it needs to write accurate Dremio SQL, generate data pipelines, and build applications against your lakehouse. Zed&apos;s performance advantage is significant for data work: its GPU-accelerated rendering handles large result sets and complex code without the lag common in Electron-based editors.&lt;/p&gt;
&lt;p&gt;Zed supports MCP through its settings, uses &lt;code&gt;AGENTS.md&lt;/code&gt; as its primary context file, and provides agent profiles for scoping tool access to specific workflows.&lt;/p&gt;
&lt;p&gt;This post covers four approaches, ordered from quickest setup to most customizable.&lt;/p&gt;
&lt;p&gt;&lt;img src=&quot;/assets/images/aitoolblogs/13/zed-dremio-architecture.png&quot; alt=&quot;Zed code editor AI assistant connecting to Dremio Agentic Lakehouse via MCP&quot;&gt;&lt;/p&gt;
&lt;h2&gt;Setting Up Zed&lt;/h2&gt;
&lt;p&gt;If you do not already have Zed installed:&lt;/p&gt;
&lt;ol&gt;
&lt;li&gt;&lt;strong&gt;Download Zed&lt;/strong&gt; from &lt;a href=&quot;https://zed.dev/&quot;&gt;zed.dev&lt;/a&gt; (available for macOS and Linux).&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Install it&lt;/strong&gt; by running the installer or using Homebrew: &lt;code&gt;brew install zed&lt;/code&gt;.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Configure your AI model&lt;/strong&gt; in &lt;strong&gt;Settings &amp;gt; AI&lt;/strong&gt;. Zed supports its own hosted models, Anthropic, OpenAI, Google, and Ollama for local models.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Open a project&lt;/strong&gt; by launching Zed and opening your project directory.&lt;/li&gt;
&lt;/ol&gt;
&lt;p&gt;Zed is free and open-source under the GPL license. Its native Rust architecture makes it significantly faster than Electron-based editors, with sub-millisecond input latency and GPU-accelerated rendering.&lt;/p&gt;
&lt;h2&gt;Approach 1: Connect the Dremio Cloud MCP Server&lt;/h2&gt;
&lt;p&gt;Every Dremio Cloud project ships with a built-in MCP server. Zed supports MCP through its JSON settings file, where MCP servers are configured as context servers.&lt;/p&gt;
&lt;p&gt;For Claude-based tools, Dremio provides an &lt;a href=&quot;https://github.com/dremio/claude-plugins&quot;&gt;official Claude plugin&lt;/a&gt; with guided setup. For Zed, you configure the MCP connection through &lt;code&gt;settings.json&lt;/code&gt;.&lt;/p&gt;
&lt;h3&gt;Find Your Project&apos;s MCP Endpoint&lt;/h3&gt;
&lt;p&gt;Log into &lt;a href=&quot;https://www.dremio.com/get-started&quot;&gt;Dremio Cloud&lt;/a&gt; and navigate to &lt;strong&gt;Project Settings &amp;gt; Info&lt;/strong&gt;. Copy the MCP server URL.&lt;/p&gt;
&lt;h3&gt;Set Up OAuth in Dremio Cloud&lt;/h3&gt;
&lt;ol&gt;
&lt;li&gt;Go to &lt;strong&gt;Settings &amp;gt; Organization Settings &amp;gt; OAuth Applications&lt;/strong&gt;.&lt;/li&gt;
&lt;li&gt;Click &lt;strong&gt;Add Application&lt;/strong&gt; and enter a name (e.g., &amp;quot;Zed MCP&amp;quot;).&lt;/li&gt;
&lt;li&gt;Add the appropriate redirect URIs.&lt;/li&gt;
&lt;li&gt;Save and copy the &lt;strong&gt;Client ID&lt;/strong&gt;.&lt;/li&gt;
&lt;/ol&gt;
&lt;h3&gt;Configure Zed&apos;s MCP Connection&lt;/h3&gt;
&lt;p&gt;Open Zed&apos;s settings (&lt;code&gt;Cmd+,&lt;/code&gt;) and add the MCP server configuration:&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-json&quot;&gt;{
  &amp;quot;context_servers&amp;quot;: {
    &amp;quot;dremio&amp;quot;: {
      &amp;quot;command&amp;quot;: {
        &amp;quot;path&amp;quot;: &amp;quot;npx&amp;quot;,
        &amp;quot;args&amp;quot;: [&amp;quot;-y&amp;quot;, &amp;quot;@dremio/mcp-client&amp;quot;, &amp;quot;--url&amp;quot;, &amp;quot;https://YOUR_PROJECT_MCP_URL&amp;quot;]
      }
    }
  }
}
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;For project-level configuration, create a &lt;code&gt;.zed/settings.json&lt;/code&gt; file in your project root with the same structure.&lt;/p&gt;
&lt;p&gt;Zed&apos;s AI agent now has access to Dremio&apos;s MCP tools:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;GetUsefulSystemTableNames&lt;/strong&gt; returns available tables.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;GetSchemaOfTable&lt;/strong&gt; returns column definitions.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;GetDescriptionOfTableOrSchema&lt;/strong&gt; pulls catalog descriptions.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;GetTableOrViewLineage&lt;/strong&gt; shows data lineage.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;RunSqlQuery&lt;/strong&gt; executes SQL and returns results.&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;Test by opening the agent panel and asking: &amp;quot;What tables are available in Dremio?&amp;quot;&lt;/p&gt;
&lt;h3&gt;Self-Hosted Alternative&lt;/h3&gt;
&lt;p&gt;For Dremio Software deployments, use the dremio-mcp server:&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-json&quot;&gt;{
  &amp;quot;context_servers&amp;quot;: {
    &amp;quot;dremio&amp;quot;: {
      &amp;quot;command&amp;quot;: {
        &amp;quot;path&amp;quot;: &amp;quot;uv&amp;quot;,
        &amp;quot;args&amp;quot;: [
          &amp;quot;run&amp;quot;, &amp;quot;--directory&amp;quot;, &amp;quot;/path/to/dremio-mcp&amp;quot;,
          &amp;quot;dremio-mcp-server&amp;quot;, &amp;quot;run&amp;quot;
        ]
      }
    }
  }
}
&lt;/code&gt;&lt;/pre&gt;
&lt;h2&gt;Approach 2: Use AGENTS.md for Dremio Context&lt;/h2&gt;
&lt;p&gt;Zed uses &lt;code&gt;AGENTS.md&lt;/code&gt; as its primary context file. Place it in your project root and reference it in agent conversations with &lt;code&gt;@agents.md&lt;/code&gt;.&lt;/p&gt;
&lt;h3&gt;Writing a Dremio Context File&lt;/h3&gt;
&lt;p&gt;Create &lt;code&gt;AGENTS.md&lt;/code&gt; in your project root:&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-markdown&quot;&gt;# Dremio Project Context

## SQL Conventions
- Use CREATE FOLDER IF NOT EXISTS (not CREATE NAMESPACE or CREATE SCHEMA)
- Tables in the Open Catalog use folder.subfolder.table_name
- External federated sources use source_name.schema.table_name
- Cast DATE to TIMESTAMP for consistent joins
- Use TIMESTAMPDIFF for duration calculations

## Credentials
- Never hardcode Personal Access Tokens. Use environment variable: DREMIO_PAT
- Cloud endpoint: environment variable DREMIO_URI

## Terminology
- Call it &amp;quot;Agentic Lakehouse&amp;quot;, not &amp;quot;data warehouse&amp;quot;
- &amp;quot;Reflections&amp;quot; are pre-computed optimizations, not &amp;quot;materialized views&amp;quot;

## Reference
- SQL syntax: ./docs/dremio-sql-reference.md
- Python SDK: ./docs/dremioframe-patterns.md
- Table schemas: ./docs/table-schemas.md
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;When starting a new agent session, type &lt;code&gt;@agents.md&lt;/code&gt; to load the context. Zed will include the file contents in the agent&apos;s working context.&lt;/p&gt;
&lt;h3&gt;Agent Profiles&lt;/h3&gt;
&lt;p&gt;Zed supports agent profiles for controlling which tools are available. Create a &amp;quot;Dremio Data&amp;quot; profile that enables MCP tools and file editing while restricting terminal access:&lt;/p&gt;
&lt;p&gt;In &lt;strong&gt;Settings &amp;gt; AI &amp;gt; Profiles&lt;/strong&gt;, create a profile with specific tool permissions. This is useful for separating data exploration (read-only MCP queries) from development work (full tool access).&lt;/p&gt;
&lt;p&gt;&lt;img src=&quot;/assets/images/aitoolblogs/13/four-integration-approaches.png&quot; alt=&quot;Four approaches to connecting AI coding tools to Dremio&quot;&gt;&lt;/p&gt;
&lt;h2&gt;Approach 3: Install Pre-Built Dremio Skills and Docs&lt;/h2&gt;
&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;Official vs. Community Resources:&lt;/strong&gt; Dremio provides an &lt;a href=&quot;https://github.com/dremio/claude-plugins&quot;&gt;official plugin&lt;/a&gt; for Claude Code users and the built-in &lt;a href=&quot;https://docs.dremio.com/current/developer/mcp-server/&quot;&gt;Dremio Cloud MCP server&lt;/a&gt; is an official Dremio product. The repositories below, along with libraries like dremioframe, are community-supported projects from the Dremio Developer Advocacy team. They are actively maintained but not part of the core Dremio product.&lt;/p&gt;
&lt;/blockquote&gt;
&lt;h3&gt;dremio-agent-skill (Community)&lt;/h3&gt;
&lt;p&gt;The &lt;a href=&quot;https://github.com/developer-advocacy-dremio/dremio-agent-skill&quot;&gt;dremio-agent-skill&lt;/a&gt; repository provides knowledge files:&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-bash&quot;&gt;git clone https://github.com/developer-advocacy-dremio/dremio-agent-skill
cd dremio-agent-skill
./install.sh
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Copy the knowledge directory into your project. Reference it in your &lt;code&gt;AGENTS.md&lt;/code&gt;:&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-markdown&quot;&gt;For Dremio conventions, read the knowledge files in ./dremio-skill/knowledge/.
&lt;/code&gt;&lt;/pre&gt;
&lt;h3&gt;dremio-agent-md (Community)&lt;/h3&gt;
&lt;p&gt;The &lt;a href=&quot;https://github.com/developer-advocacy-dremio/dremio-agent-md&quot;&gt;dremio-agent-md&lt;/a&gt; repository provides documentation sitemaps:&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-bash&quot;&gt;git clone https://github.com/developer-advocacy-dremio/dremio-agent-md
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Reference it in &lt;code&gt;AGENTS.md&lt;/code&gt;:&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-markdown&quot;&gt;For Dremio SQL validation, read DREMIO_AGENT.md in ./dremio-agent-md/.
&lt;/code&gt;&lt;/pre&gt;
&lt;h2&gt;Approach 4: Build Your Own AGENTS.md Context&lt;/h2&gt;
&lt;p&gt;Create a comprehensive context file tailored to your team:&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-markdown&quot;&gt;# Team Dremio Data Context

## Environment
- Lakehouse: Dremio Cloud
- Catalog: Apache Polaris-based Open Catalog
- Architecture: Medallion (bronze → silver → gold)

## Table Schemas (updated weekly)
For exact column definitions, read ./docs/table-schemas.md

## SQL Standards
- Bronze: raw.*, Silver: cleaned.*, Gold: analytics.*
- Always use TIMESTAMP, never DATE
- Validate functions against ./docs/dremio-sql-reference.md

## Common Queries
For frequently used patterns, read ./docs/common-queries.md

## Python SDK
- Use dremioframe for all Dremio connections
- Patterns: read ./docs/dremioframe-patterns.md
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Zed&apos;s fast file loading means referencing external docs adds negligible latency. Keep the &lt;code&gt;AGENTS.md&lt;/code&gt; concise and point to detailed reference files.&lt;/p&gt;
&lt;h2&gt;Using Dremio with Zed: Practical Use Cases&lt;/h2&gt;
&lt;p&gt;Once Dremio is connected, Zed&apos;s AI agent can execute complete data projects with the speed advantage of a native editor.&lt;/p&gt;
&lt;h3&gt;Ask Natural Language Questions About Your Data&lt;/h3&gt;
&lt;p&gt;Open the agent panel and ask:&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;&amp;quot;What were our top 10 products by revenue last quarter? Show growth rates and regional breakdown.&amp;quot;&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;Zed&apos;s agent uses MCP to discover tables, writes SQL, and returns results. The GPU-accelerated rendering handles large result tables without lag.&lt;/p&gt;
&lt;p&gt;Follow up:&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;&amp;quot;For products with negative growth, show the correlation between customer complaints and revenue decline over the last 6 months.&amp;quot;&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;The agent maintains context and generates multi-table analytical queries.&lt;/p&gt;
&lt;h3&gt;Build a Locally Running Dashboard&lt;/h3&gt;
&lt;p&gt;Ask the agent to create a dashboard:&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;&amp;quot;Query Dremio gold-layer views for revenue metrics and build an HTML dashboard with Plotly.js. Include monthly trends, regional heatmap, and top customer charts. Add a dark theme, date filters, and export buttons.&amp;quot;&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;The agent generates the complete dashboard across multiple files. Zed&apos;s multi-buffer editing lets you see all generated files side-by-side without performance degradation.&lt;/p&gt;
&lt;h3&gt;Create a Data Exploration App&lt;/h3&gt;
&lt;p&gt;Build interactive tools:&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;&amp;quot;Create a Streamlit app connected to Dremio via dremioframe. Include schema browsing, data preview, SQL editor, and CSV download. Generate all files.&amp;quot;&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;The agent generates the full application. Zed&apos;s speed makes iterating on the generated code feel instantaneous.&lt;/p&gt;
&lt;h3&gt;Generate Data Pipeline Scripts&lt;/h3&gt;
&lt;p&gt;Automate data engineering:&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;&amp;quot;Write a Medallion pipeline using dremioframe. Bronze ingestion, silver cleaning with deduplication and validation, gold aggregations with business metrics. Include logging and dry-run mode.&amp;quot;&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;The agent writes the pipeline following your &lt;code&gt;AGENTS.md&lt;/code&gt; conventions.&lt;/p&gt;
&lt;h3&gt;Build API Endpoints Over Dremio Data&lt;/h3&gt;
&lt;p&gt;Create backend services:&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;&amp;quot;Build a FastAPI app serving Dremio gold-layer data. Add endpoints for analytics, customer segments, and product performance. Include Pydantic models and OpenAPI docs.&amp;quot;&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;The agent generates the complete API server.&lt;/p&gt;
&lt;h2&gt;Which Approach Should You Use?&lt;/h2&gt;
&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Approach&lt;/th&gt;
&lt;th&gt;Setup Time&lt;/th&gt;
&lt;th&gt;What You Get&lt;/th&gt;
&lt;th&gt;Best For&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;MCP Server&lt;/td&gt;
&lt;td&gt;5 minutes&lt;/td&gt;
&lt;td&gt;Live queries, schema browsing, catalog exploration&lt;/td&gt;
&lt;td&gt;Data analysis, SQL generation, real-time access&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;AGENTS.md&lt;/td&gt;
&lt;td&gt;10 minutes&lt;/td&gt;
&lt;td&gt;Convention enforcement, reference file pointers&lt;/td&gt;
&lt;td&gt;Teams that want speed + context control&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Pre-Built Skills&lt;/td&gt;
&lt;td&gt;5 minutes&lt;/td&gt;
&lt;td&gt;Comprehensive Dremio knowledge (CLI, SDK, SQL, API)&lt;/td&gt;
&lt;td&gt;Quick start with broad coverage&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Custom Context&lt;/td&gt;
&lt;td&gt;30+ minutes&lt;/td&gt;
&lt;td&gt;Tailored schemas, profiles, and team conventions&lt;/td&gt;
&lt;td&gt;Mature teams with specific workflows&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;
&lt;p&gt;Start with the MCP server for live data access. Add &lt;code&gt;AGENTS.md&lt;/code&gt; with conventions and reference file pointers. Use agent profiles to scope tool access for different workflows.&lt;/p&gt;
&lt;h2&gt;Get Started&lt;/h2&gt;
&lt;ol&gt;
&lt;li&gt;&lt;a href=&quot;https://www.dremio.com/get-started&quot;&gt;Sign up for a free Dremio Cloud trial&lt;/a&gt; (30 days, $400 in compute credits).&lt;/li&gt;
&lt;li&gt;Find your project&apos;s MCP endpoint in &lt;strong&gt;Project Settings &amp;gt; Info&lt;/strong&gt;.&lt;/li&gt;
&lt;li&gt;Add it to Zed&apos;s &lt;code&gt;settings.json&lt;/code&gt; under &lt;code&gt;context_servers&lt;/code&gt;.&lt;/li&gt;
&lt;li&gt;Create &lt;code&gt;AGENTS.md&lt;/code&gt; with your Dremio conventions.&lt;/li&gt;
&lt;li&gt;Open the agent panel and ask it to explore your Dremio catalog.&lt;/li&gt;
&lt;/ol&gt;
&lt;p&gt;Dremio&apos;s Agentic Lakehouse gives Zed&apos;s agent accurate data context, and Zed&apos;s native performance makes data exploration and code generation feel effortless.&lt;/p&gt;
&lt;p&gt;For more on the Dremio MCP Server, check out the &lt;a href=&quot;https://docs.dremio.com/current/developer/mcp-server/&quot;&gt;official documentation&lt;/a&gt; or enroll in the free &lt;a href=&quot;https://university.dremio.com/course/dremio-mcp&quot;&gt;Dremio MCP Server course&lt;/a&gt; on Dremio University.&lt;/p&gt;
</content:encoded><author>Alex Merced</author></item><item><title>How to Use Dremio with Windsurf: Connect, Query, and Build Data Apps</title><link>https://iceberglakehouse.com/posts/2026-03-aitool-windsurf/</link><guid isPermaLink="true">https://iceberglakehouse.com/posts/2026-03-aitool-windsurf/</guid><description>
Windsurf is an AI-native code editor built as a fork of VS Code. Its standout feature is Cascade, an agentic AI system that plans and executes multi-...</description><pubDate>Thu, 05 Mar 2026 20:00:00 GMT</pubDate><content:encoded>&lt;p&gt;Windsurf is an AI-native code editor built as a fork of VS Code. Its standout feature is Cascade, an agentic AI system that plans and executes multi-step coding tasks autonomously. Cascade understands your entire codebase, can chain together multiple file edits, terminal commands, and tool calls in a single flow. Dremio is a unified lakehouse platform that provides business context through its semantic layer, universal data access through query federation, and interactive speed through Reflections and Apache Arrow.&lt;/p&gt;
&lt;p&gt;Connecting them gives Cascade the context it needs to write accurate Dremio SQL, generate data pipelines, and build applications against your lakehouse. Without this connection, Cascade treats Dremio like a generic database. With it, the agent knows your schemas, business logic encoded in views, and the correct Dremio SQL dialect.&lt;/p&gt;
&lt;p&gt;Windsurf&apos;s Cascade is especially well-suited for data projects because it can chain together discovery, querying, code generation, and testing in a single autonomous flow. Ask it to explore your Dremio catalog, identify relevant tables, write a pipeline, and generate tests — all in one prompt.&lt;/p&gt;
&lt;p&gt;This post covers four approaches, ordered from quickest setup to most customizable.&lt;/p&gt;
&lt;p&gt;&lt;img src=&quot;/assets/images/aitoolblogs/12/windsurf-dremio-architecture.png&quot; alt=&quot;Windsurf AI editor with Cascade agent connecting to Dremio Agentic Lakehouse via MCP&quot;&gt;&lt;/p&gt;
&lt;h2&gt;Setting Up Windsurf&lt;/h2&gt;
&lt;p&gt;If you do not already have Windsurf installed:&lt;/p&gt;
&lt;ol&gt;
&lt;li&gt;&lt;strong&gt;Download Windsurf&lt;/strong&gt; from &lt;a href=&quot;https://windsurf.com/&quot;&gt;windsurf.com&lt;/a&gt; (available for macOS, Linux, and Windows).&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Install it&lt;/strong&gt; by running the installer. Windsurf is a VS Code fork, so all VS Code extensions and themes work.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Sign in&lt;/strong&gt; with a Windsurf account. The free tier includes limited Cascade credits; Pro provides expanded access.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Open a project&lt;/strong&gt; by selecting File &amp;gt; Open Folder and pointing to your project directory.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Open Cascade&lt;/strong&gt; by pressing &lt;code&gt;Cmd+L&lt;/code&gt; (macOS) or &lt;code&gt;Ctrl+L&lt;/code&gt; to access the agentic AI chat panel.&lt;/li&gt;
&lt;/ol&gt;
&lt;p&gt;If you are migrating from VS Code or Cursor, your existing extensions and settings transfer automatically.&lt;/p&gt;
&lt;h2&gt;Approach 1: Connect the Dremio Cloud MCP Server&lt;/h2&gt;
&lt;p&gt;The Model Context Protocol (MCP) is an open standard that lets AI tools call external services. Every Dremio Cloud project ships with a built-in MCP server. Windsurf supports MCP natively through its Cascade settings.&lt;/p&gt;
&lt;p&gt;For Claude-based tools, Dremio provides an &lt;a href=&quot;https://github.com/dremio/claude-plugins&quot;&gt;official Claude plugin&lt;/a&gt; with guided setup. For Windsurf, you configure the MCP connection through the Cascade settings or &lt;code&gt;mcp_config.json&lt;/code&gt;.&lt;/p&gt;
&lt;h3&gt;Find Your Project&apos;s MCP Endpoint&lt;/h3&gt;
&lt;p&gt;Log into &lt;a href=&quot;https://www.dremio.com/get-started&quot;&gt;Dremio Cloud&lt;/a&gt; and open your project. Navigate to &lt;strong&gt;Project Settings &amp;gt; Info&lt;/strong&gt;. The MCP server URL is listed on the project overview page. Copy it.&lt;/p&gt;
&lt;h3&gt;Set Up OAuth in Dremio Cloud&lt;/h3&gt;
&lt;ol&gt;
&lt;li&gt;Go to &lt;strong&gt;Settings &amp;gt; Organization Settings &amp;gt; OAuth Applications&lt;/strong&gt;.&lt;/li&gt;
&lt;li&gt;Click &lt;strong&gt;Add Application&lt;/strong&gt; and enter a name (e.g., &amp;quot;Windsurf MCP&amp;quot;).&lt;/li&gt;
&lt;li&gt;Add the appropriate redirect URIs.&lt;/li&gt;
&lt;li&gt;Save and copy the &lt;strong&gt;Client ID&lt;/strong&gt;.&lt;/li&gt;
&lt;/ol&gt;
&lt;h3&gt;Configure Windsurf&apos;s MCP Connection&lt;/h3&gt;
&lt;p&gt;You have two options:&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Option A: Via Settings UI&lt;/strong&gt;&lt;/p&gt;
&lt;p&gt;Open Windsurf Settings and navigate to &lt;strong&gt;Cascade &amp;gt; MCP&lt;/strong&gt;. Click &lt;strong&gt;Add custom server&lt;/strong&gt; and paste your Dremio MCP configuration.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Option B: Via mcp_config.json&lt;/strong&gt;&lt;/p&gt;
&lt;p&gt;Create or edit &lt;code&gt;~/.codeium/windsurf/mcp_config.json&lt;/code&gt;:&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-json&quot;&gt;{
  &amp;quot;mcpServers&amp;quot;: {
    &amp;quot;dremio&amp;quot;: {
      &amp;quot;url&amp;quot;: &amp;quot;https://YOUR_PROJECT_MCP_URL&amp;quot;
    }
  }
}
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Restart Windsurf. Cascade now has access to Dremio&apos;s MCP tools:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;GetUsefulSystemTableNames&lt;/strong&gt; returns available tables with descriptions.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;GetSchemaOfTable&lt;/strong&gt; returns column names, types, and metadata.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;GetDescriptionOfTableOrSchema&lt;/strong&gt; pulls wiki descriptions from the catalog.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;GetTableOrViewLineage&lt;/strong&gt; shows upstream dependencies.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;RunSqlQuery&lt;/strong&gt; executes SQL and returns results as JSON.&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;Test the connection by opening Cascade and asking: &amp;quot;What tables are available in Dremio?&amp;quot; Cascade will call &lt;code&gt;GetUsefulSystemTableNames&lt;/code&gt; and return your catalog contents.&lt;/p&gt;
&lt;h3&gt;Self-Hosted Alternative&lt;/h3&gt;
&lt;p&gt;For Dremio Software deployments, use the open-source &lt;a href=&quot;https://github.com/dremio/dremio-mcp&quot;&gt;dremio-mcp&lt;/a&gt; server:&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-bash&quot;&gt;git clone https://github.com/dremio/dremio-mcp
cd dremio-mcp
uv run dremio-mcp-server config create dremioai \
  --uri https://your-dremio-instance.com \
  --pat YOUR_PERSONAL_ACCESS_TOKEN
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;In &lt;code&gt;mcp_config.json&lt;/code&gt;:&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-json&quot;&gt;{
  &amp;quot;mcpServers&amp;quot;: {
    &amp;quot;dremio&amp;quot;: {
      &amp;quot;command&amp;quot;: &amp;quot;uv&amp;quot;,
      &amp;quot;args&amp;quot;: [
        &amp;quot;run&amp;quot;, &amp;quot;--directory&amp;quot;, &amp;quot;/path/to/dremio-mcp&amp;quot;,
        &amp;quot;dremio-mcp-server&amp;quot;, &amp;quot;run&amp;quot;
      ]
    }
  }
}
&lt;/code&gt;&lt;/pre&gt;
&lt;h2&gt;Approach 2: Use Windsurf Rules for Dremio Context&lt;/h2&gt;
&lt;p&gt;Windsurf supports &lt;code&gt;.windsurfrules&lt;/code&gt; files in your project root for persistent AI instructions. These work similarly to &lt;code&gt;.cursorrules&lt;/code&gt; and are loaded into every Cascade interaction.&lt;/p&gt;
&lt;h3&gt;Project-Wide Rules&lt;/h3&gt;
&lt;p&gt;Create a &lt;code&gt;.windsurfrules&lt;/code&gt; file in your project root:&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-markdown&quot;&gt;# Dremio SQL Conventions
- Use CREATE FOLDER IF NOT EXISTS (not CREATE NAMESPACE or CREATE SCHEMA)
- Tables in the Open Catalog use folder.subfolder.table_name without a catalog prefix
- External federated sources use source_name.schema.table_name
- Cast DATE to TIMESTAMP for consistent joins
- Use TIMESTAMPDIFF for duration calculations

# Credentials
- Never hardcode Personal Access Tokens. Use environment variable: DREMIO_PAT
- Dremio Cloud endpoint: environment variable DREMIO_URI

# Terminology
- Call it &amp;quot;Agentic Lakehouse&amp;quot;, not &amp;quot;data warehouse&amp;quot;
- &amp;quot;Reflections&amp;quot; are pre-computed optimizations, not &amp;quot;materialized views&amp;quot;
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Windsurf also reads &lt;code&gt;.cursorrules&lt;/code&gt; as a fallback if no &lt;code&gt;.windsurfrules&lt;/code&gt; file is present, so if your team uses Cursor alongside Windsurf, shared rules files work across both editors.&lt;/p&gt;
&lt;h3&gt;Cascade Memory and Context&lt;/h3&gt;
&lt;p&gt;Cascade has a persistent memory system. As you work with Dremio tables, Cascade remembers the schemas, query patterns, and conventions it has encountered. This means subsequent requests in the same project get more accurate over time without needing to re-read context files.&lt;/p&gt;
&lt;p&gt;&lt;img src=&quot;/assets/images/aitoolblogs/12/four-integration-approaches.png&quot; alt=&quot;Four approaches to connecting AI coding tools to Dremio&quot;&gt;&lt;/p&gt;
&lt;h2&gt;Approach 3: Install Pre-Built Dremio Skills and Docs&lt;/h2&gt;
&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;Official vs. Community Resources:&lt;/strong&gt; Dremio provides an &lt;a href=&quot;https://github.com/dremio/claude-plugins&quot;&gt;official plugin&lt;/a&gt; for Claude Code users and the built-in &lt;a href=&quot;https://docs.dremio.com/current/developer/mcp-server/&quot;&gt;Dremio Cloud MCP server&lt;/a&gt; is an official Dremio product. The repositories below, along with libraries like dremioframe, are community-supported projects from the Dremio Developer Advocacy team. They are actively maintained but not part of the core Dremio product.&lt;/p&gt;
&lt;/blockquote&gt;
&lt;h3&gt;dremio-agent-skill (Community)&lt;/h3&gt;
&lt;p&gt;The &lt;a href=&quot;https://github.com/developer-advocacy-dremio/dremio-agent-skill&quot;&gt;dremio-agent-skill&lt;/a&gt; repository contains a comprehensive skill directory with knowledge files and a &lt;code&gt;.cursorrules&lt;/code&gt; file that Windsurf reads as a fallback.&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-bash&quot;&gt;git clone https://github.com/developer-advocacy-dremio/dremio-agent-skill
cd dremio-agent-skill
./install.sh
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Choose &lt;strong&gt;Local Project Install (Copy)&lt;/strong&gt; to copy the &lt;code&gt;.cursorrules&lt;/code&gt; file and knowledge directory into your project. Windsurf will pick up the rules file automatically.&lt;/p&gt;
&lt;h3&gt;dremio-agent-md (Community)&lt;/h3&gt;
&lt;p&gt;The &lt;a href=&quot;https://github.com/developer-advocacy-dremio/dremio-agent-md&quot;&gt;dremio-agent-md&lt;/a&gt; repository provides a master protocol file and browsable documentation sitemaps:&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-bash&quot;&gt;git clone https://github.com/developer-advocacy-dremio/dremio-agent-md
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Reference it in your &lt;code&gt;.windsurfrules&lt;/code&gt;:&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-markdown&quot;&gt;For Dremio SQL validation, read DREMIO_AGENT.md in the dremio-agent-md directory.
Use the sitemaps in dremio_sitemaps/ to verify syntax before generating SQL.
&lt;/code&gt;&lt;/pre&gt;
&lt;h2&gt;Approach 4: Build Your Own Windsurf Rules&lt;/h2&gt;
&lt;p&gt;Create a custom &lt;code&gt;.windsurfrules&lt;/code&gt; with your team&apos;s specific Dremio environment:&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-markdown&quot;&gt;# Team Dremio Context

## Table Schemas (updated weekly)
- For table schemas, read ./docs/table-schemas.md
- For SQL conventions, read ./docs/dremio-conventions.md
- For common queries, read ./docs/common-queries.md

## Naming Standards
- Bronze: raw.*, Silver: cleaned.*, Gold: analytics.*
- Always use TIMESTAMP, never DATE
- Validate function names against docs/dremio-conventions.md
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Export your actual schemas from Dremio and keep them updated. Cascade&apos;s memory system means it learns your patterns over time, but explicit rules ensure consistency from the first interaction.&lt;/p&gt;
&lt;h2&gt;Using Dremio with Windsurf: Practical Use Cases&lt;/h2&gt;
&lt;p&gt;Once Dremio is connected, Cascade can execute complex multi-step data projects autonomously. Here are detailed examples.&lt;/p&gt;
&lt;h3&gt;Ask Natural Language Questions About Your Data&lt;/h3&gt;
&lt;p&gt;Open Cascade and ask questions in plain English:&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;&amp;quot;What were our top 10 products by revenue last quarter? Break it down by region and show the growth rate compared to the previous quarter.&amp;quot;&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;Cascade uses MCP to discover your tables, writes the SQL, runs it against Dremio, and returns formatted results. Its multi-step nature means it can chain multiple queries together autonomously.&lt;/p&gt;
&lt;p&gt;Follow up immediately:&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;&amp;quot;For products with declining growth, pull the customer reviews and support tickets. Is there a pattern between product issues and revenue decline?&amp;quot;&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;Cascade maintains full context and chains together cross-table queries without prompting. This turns the editor into a data analysis workstation.&lt;/p&gt;
&lt;h3&gt;Build a Locally Running Dashboard&lt;/h3&gt;
&lt;p&gt;Use Cascade for multi-step project generation:&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;&amp;quot;Query our gold-layer sales views in Dremio and build a local HTML dashboard with Chart.js. Include monthly revenue trends, top products by region, and customer metrics. Add date range filters, a dark theme, and export buttons. Create separate HTML, CSS, and JavaScript files.&amp;quot;&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;Cascade will autonomously:&lt;/p&gt;
&lt;ol&gt;
&lt;li&gt;Call MCP to discover gold-layer views and schemas&lt;/li&gt;
&lt;li&gt;Write and execute SQL queries for each metric&lt;/li&gt;
&lt;li&gt;Create &lt;code&gt;index.html&lt;/code&gt; with the dashboard layout&lt;/li&gt;
&lt;li&gt;Create &lt;code&gt;styles.css&lt;/code&gt; with dark theme and responsive design&lt;/li&gt;
&lt;li&gt;Create &lt;code&gt;dashboard.js&lt;/code&gt; with Chart.js configurations&lt;/li&gt;
&lt;li&gt;Wire everything together and save to your project&lt;/li&gt;
&lt;/ol&gt;
&lt;p&gt;Open &lt;code&gt;index.html&lt;/code&gt; in a browser for a complete interactive dashboard. Cascade&apos;s agentic flow handles the entire process without manual intervention.&lt;/p&gt;
&lt;h3&gt;Create a Data Exploration App&lt;/h3&gt;
&lt;p&gt;Build interactive tools in one prompt:&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;&amp;quot;Create a Streamlit app connected to Dremio via dremioframe. Include a schema browser with table previews, a SQL editor with syntax highlighting, CSV download, and charting for numeric columns. Generate all the files including requirements.txt and a README.&amp;quot;&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;Cascade generates the full application stack and wires the components together. Run &lt;code&gt;streamlit run app.py&lt;/code&gt; for a local data explorer.&lt;/p&gt;
&lt;h3&gt;Generate Data Pipeline Scripts&lt;/h3&gt;
&lt;p&gt;Automate data engineering:&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;&amp;quot;Write a Medallion Architecture pipeline using dremioframe. Bronze: ingest raw events from S3. Silver: deduplicate, validate required fields, cast timestamps. Gold: aggregate daily metrics and build customer lifetime value calculations. Include structured logging, retry logic, and dry-run mode.&amp;quot;&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;Cascade writes the pipeline code, creates test files, and can execute a dry run to verify the logic against your live Dremio instance.&lt;/p&gt;
&lt;h3&gt;Build API Endpoints Over Dremio Data&lt;/h3&gt;
&lt;p&gt;Create backend services:&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;&amp;quot;Build a FastAPI app that queries Dremio gold-layer views via dremioframe. Add endpoints for customer segments, revenue analytics, and cohort retention. Include Pydantic models, caching, and OpenAPI docs.&amp;quot;&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;Cascade generates the complete API project. Run &lt;code&gt;uvicorn main:app --reload&lt;/code&gt; for a local API connected to your lakehouse.&lt;/p&gt;
&lt;h2&gt;Which Approach Should You Use?&lt;/h2&gt;
&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Approach&lt;/th&gt;
&lt;th&gt;Setup Time&lt;/th&gt;
&lt;th&gt;What You Get&lt;/th&gt;
&lt;th&gt;Best For&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;MCP Server&lt;/td&gt;
&lt;td&gt;5 minutes&lt;/td&gt;
&lt;td&gt;Live queries, schema browsing, catalog exploration&lt;/td&gt;
&lt;td&gt;Data analysis, SQL generation, real-time access&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Windsurf Rules&lt;/td&gt;
&lt;td&gt;10 minutes&lt;/td&gt;
&lt;td&gt;Convention enforcement, persistent AI instructions&lt;/td&gt;
&lt;td&gt;Teams with specific SQL standards&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Pre-Built Skills&lt;/td&gt;
&lt;td&gt;5 minutes&lt;/td&gt;
&lt;td&gt;Comprehensive Dremio knowledge (CLI, SDK, SQL, API)&lt;/td&gt;
&lt;td&gt;Quick start with broad coverage&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Custom Rules&lt;/td&gt;
&lt;td&gt;30+ minutes&lt;/td&gt;
&lt;td&gt;Tailored schemas, patterns, and team conventions&lt;/td&gt;
&lt;td&gt;Mature teams with project-specific needs&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;
&lt;p&gt;Start with the MCP server for immediate value. Add a &lt;code&gt;.windsurfrules&lt;/code&gt; file for Dremio conventions. Let Cascade&apos;s memory build on your patterns over time.&lt;/p&gt;
&lt;h2&gt;Get Started&lt;/h2&gt;
&lt;ol&gt;
&lt;li&gt;&lt;a href=&quot;https://www.dremio.com/get-started&quot;&gt;Sign up for a free Dremio Cloud trial&lt;/a&gt; (30 days, $400 in compute credits).&lt;/li&gt;
&lt;li&gt;Find your project&apos;s MCP endpoint in &lt;strong&gt;Project Settings &amp;gt; Info&lt;/strong&gt;.&lt;/li&gt;
&lt;li&gt;Add it in Windsurf&apos;s &lt;strong&gt;Cascade &amp;gt; MCP&lt;/strong&gt; settings or &lt;code&gt;mcp_config.json&lt;/code&gt;.&lt;/li&gt;
&lt;li&gt;Clone &lt;a href=&quot;https://github.com/developer-advocacy-dremio/dremio-agent-skill&quot;&gt;dremio-agent-skill&lt;/a&gt; and run &lt;code&gt;./install.sh&lt;/code&gt;.&lt;/li&gt;
&lt;li&gt;Open Cascade and ask it to explore your Dremio catalog.&lt;/li&gt;
&lt;/ol&gt;
&lt;p&gt;Dremio&apos;s Agentic Lakehouse gives Cascade accurate data context, and Cascade&apos;s multi-step autonomous flows turn that context into complete data projects.&lt;/p&gt;
&lt;p&gt;For more on the Dremio MCP Server, check out the &lt;a href=&quot;https://docs.dremio.com/current/developer/mcp-server/&quot;&gt;official documentation&lt;/a&gt; or enroll in the free &lt;a href=&quot;https://university.dremio.com/course/dremio-mcp&quot;&gt;Dremio MCP Server course&lt;/a&gt; on Dremio University.&lt;/p&gt;
</content:encoded><author>Alex Merced</author></item><item><title>How to Use Dremio with OpenWork: Connect, Query, and Build Data Apps</title><link>https://iceberglakehouse.com/posts/2026-03-aitool-openwork/</link><guid isPermaLink="true">https://iceberglakehouse.com/posts/2026-03-aitool-openwork/</guid><description>
OpenWork is an open-source desktop AI agent built on the OpenCode engine. It runs entirely on your machine with your own API keys, giving you full co...</description><pubDate>Thu, 05 Mar 2026 19:00:00 GMT</pubDate><content:encoded>&lt;p&gt;OpenWork is an open-source desktop AI agent built on the OpenCode engine. It runs entirely on your machine with your own API keys, giving you full control over your data and your AI costs. Dremio is a unified lakehouse platform built on open standards like Apache Iceberg, Apache Arrow, and Apache Polaris.&lt;/p&gt;
&lt;p&gt;Both tools share a local-first philosophy. Dremio stores data in open formats with no vendor lock-in. OpenWork runs on your hardware with no cloud dependency for the agent itself. Connecting them creates an open-source analytics stack where your coding agent queries your lakehouse without sending data through third-party services.&lt;/p&gt;
&lt;p&gt;OpenWork inherits OpenCode&apos;s &lt;code&gt;AGENTS.md&lt;/code&gt; support, &lt;code&gt;opencode.json&lt;/code&gt; configuration, and MCP integration. If you have already written Dremio context files for OpenCode or OpenAI Codex, they work in OpenWork without modification. The desktop application adds a graphical interface, integrated file browser, and agent chat panel on top of the terminal experience.&lt;/p&gt;
&lt;p&gt;The local-first model has specific advantages for data work. Your Dremio queries and results stay on your machine. Your API keys are stored locally. The agent code runs in your environment. For teams that handle sensitive data or operate under compliance constraints, this architecture keeps the AI agent within your security perimeter.&lt;/p&gt;
&lt;p&gt;This post covers four approaches, from a five-minute MCP connection to a fully custom Dremio configuration.&lt;/p&gt;
&lt;p&gt;&lt;img src=&quot;/assets/images/aitoolblogs/11/openwork-dremio-architecture.png&quot; alt=&quot;OpenWork desktop AI assistant connecting to Dremio Agentic Lakehouse&quot;&gt;&lt;/p&gt;
&lt;h2&gt;Setting Up OpenWork&lt;/h2&gt;
&lt;p&gt;If you do not already have OpenWork installed:&lt;/p&gt;
&lt;ol&gt;
&lt;li&gt;&lt;strong&gt;Download OpenWork&lt;/strong&gt; from &lt;a href=&quot;https://openwork.software&quot;&gt;openwork.software&lt;/a&gt; (available for macOS, Linux, and Windows).&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Install it&lt;/strong&gt; by following the platform-specific instructions.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Configure your AI model&lt;/strong&gt; by adding your API key (OpenAI, Anthropic, or another supported provider) in the application settings.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Open a project&lt;/strong&gt; by selecting your project directory in the OpenWork file browser.&lt;/li&gt;
&lt;/ol&gt;
&lt;p&gt;OpenWork is built on the OpenCode engine but provides a desktop GUI with an integrated file browser, agent chat panel, and visual output display. It runs entirely on your machine with your own API keys, giving you full control over costs and data privacy.&lt;/p&gt;
&lt;h2&gt;Approach 1: Connect the Dremio Cloud MCP Server&lt;/h2&gt;
&lt;p&gt;Every Dremio Cloud project includes a built-in MCP server. OpenWork supports MCP through its inherited &lt;code&gt;opencode.json&lt;/code&gt; configuration.&lt;/p&gt;
&lt;p&gt;For Claude-based tools, Dremio provides an &lt;a href=&quot;https://github.com/dremio/claude-plugins&quot;&gt;official Claude plugin&lt;/a&gt; with guided setup. For OpenWork, you configure the MCP connection through &lt;code&gt;opencode.json&lt;/code&gt;:&lt;/p&gt;
&lt;h3&gt;Find Your MCP Endpoint and Set Up OAuth&lt;/h3&gt;
&lt;ol&gt;
&lt;li&gt;Log into &lt;a href=&quot;https://www.dremio.com/get-started&quot;&gt;Dremio Cloud&lt;/a&gt; and find your MCP URL under &lt;strong&gt;Project Settings &amp;gt; Info&lt;/strong&gt;.&lt;/li&gt;
&lt;li&gt;Go to &lt;strong&gt;Settings &amp;gt; Organization Settings &amp;gt; OAuth Applications&lt;/strong&gt;.&lt;/li&gt;
&lt;li&gt;Create a new application with an appropriate redirect URI.&lt;/li&gt;
&lt;li&gt;Copy the &lt;strong&gt;Client ID&lt;/strong&gt;.&lt;/li&gt;
&lt;/ol&gt;
&lt;h3&gt;Configure OpenWork&apos;s MCP Connection&lt;/h3&gt;
&lt;p&gt;Add the Dremio server to your &lt;code&gt;opencode.json&lt;/code&gt;:&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-json&quot;&gt;{
  &amp;quot;mcp&amp;quot;: {
    &amp;quot;dremio&amp;quot;: {
      &amp;quot;url&amp;quot;: &amp;quot;https://YOUR_PROJECT_MCP_URL&amp;quot;,
      &amp;quot;auth&amp;quot;: {
        &amp;quot;type&amp;quot;: &amp;quot;oauth&amp;quot;,
        &amp;quot;clientId&amp;quot;: &amp;quot;YOUR_CLIENT_ID&amp;quot;
      }
    }
  }
}
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Place this at your project root or globally at &lt;code&gt;~/.config/opencode/opencode.json&lt;/code&gt;. After configuration, OpenWork can call Dremio&apos;s MCP tools:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;GetUsefulSystemTableNames&lt;/strong&gt; returns available tables with descriptions.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;GetSchemaOfTable&lt;/strong&gt; returns column details and metadata.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;GetDescriptionOfTableOrSchema&lt;/strong&gt; pulls wiki descriptions from the catalog.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;RunSqlQuery&lt;/strong&gt; executes SQL and returns JSON results.&lt;/li&gt;
&lt;/ul&gt;
&lt;h3&gt;Self-Hosted MCP&lt;/h3&gt;
&lt;p&gt;For Dremio Software deployments, use the open-source &lt;a href=&quot;https://github.com/dremio/dremio-mcp&quot;&gt;dremio-mcp&lt;/a&gt; server:&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-bash&quot;&gt;git clone https://github.com/dremio/dremio-mcp
cd dremio-mcp
uv run dremio-mcp-server config create dremioai \
  --uri https://your-dremio-instance.com \
  --pat YOUR_PERSONAL_ACCESS_TOKEN
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Configure OpenWork to run the local server:&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-json&quot;&gt;{
  &amp;quot;mcp&amp;quot;: {
    &amp;quot;dremio&amp;quot;: {
      &amp;quot;command&amp;quot;: &amp;quot;uv&amp;quot;,
      &amp;quot;args&amp;quot;: [
        &amp;quot;run&amp;quot;, &amp;quot;--directory&amp;quot;, &amp;quot;/path/to/dremio-mcp&amp;quot;,
        &amp;quot;dremio-mcp-server&amp;quot;, &amp;quot;run&amp;quot;
      ]
    }
  }
}
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;The server supports three modes: &lt;code&gt;FOR_DATA_PATTERNS&lt;/code&gt; (data exploration, default), &lt;code&gt;FOR_SELF&lt;/code&gt; (system introspection for diagnosing performance), and &lt;code&gt;FOR_PROMETHEUS&lt;/code&gt; (metrics correlation). The local-first nature of OpenWork pairs well with the self-hosted MCP option, as both components run entirely on your infrastructure.&lt;/p&gt;
&lt;h2&gt;Approach 2: Use AGENTS.md for Dremio Context&lt;/h2&gt;
&lt;p&gt;OpenWork inherits &lt;code&gt;AGENTS.md&lt;/code&gt; support from OpenCode. The same file works in OpenWork, OpenCode, and OpenAI Codex.&lt;/p&gt;
&lt;h3&gt;Writing a Dremio-Focused AGENTS.md&lt;/h3&gt;
&lt;p&gt;Create &lt;code&gt;AGENTS.md&lt;/code&gt; in your project root:&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-markdown&quot;&gt;# Agent Configuration

## Dremio Lakehouse

This project uses Dremio Cloud as its lakehouse.

### SQL Conventions
- Use `CREATE FOLDER IF NOT EXISTS` (not CREATE NAMESPACE)
- Open Catalog tables: `folder.subfolder.table_name` (no catalog prefix)
- External sources: `source_name.schema.table_name`
- Cast DATE to TIMESTAMP for join consistency
- Use TIMESTAMPDIFF for duration calculations

### Credentials
- PAT: env var `DREMIO_PAT`
- Endpoint: env var `DREMIO_URI`
- Never hardcode credentials

### References
- SQL reference: https://docs.dremio.com/current/reference/sql/
- REST API: https://docs.dremio.com/current/reference/api/
- Local SQL docs: ./docs/dremio-sql-reference.md

### Terminology
- &amp;quot;Agentic Lakehouse&amp;quot; not &amp;quot;data warehouse&amp;quot;
- &amp;quot;Reflections&amp;quot; not &amp;quot;materialized views&amp;quot;
- &amp;quot;Open Catalog&amp;quot; built on Apache Polaris
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;OpenWork auto-scans this file at project start. Global defaults go in &lt;code&gt;~/.config/opencode/AGENTS.md&lt;/code&gt; and project-level files override them.&lt;/p&gt;
&lt;h3&gt;Cross-Tool Portability&lt;/h3&gt;
&lt;p&gt;The AGENTS.md you write for OpenWork works identically in OpenCode and OpenAI Codex. If your team uses multiple tools, you maintain one Dremio configuration file instead of separate context files for each tool.&lt;/p&gt;
&lt;p&gt;&lt;img src=&quot;/assets/images/aitoolblogs/11/four-integration-approaches.png&quot; alt=&quot;Four integration approaches for connecting AI tools to Dremio&quot;&gt;&lt;/p&gt;
&lt;h2&gt;Approach 3: Install Pre-Built Dremio Skills and Docs&lt;/h2&gt;
&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;Official vs. Community Resources:&lt;/strong&gt; Dremio provides an &lt;a href=&quot;https://github.com/dremio/claude-plugins&quot;&gt;official plugin&lt;/a&gt; for Claude Code users and the built-in &lt;a href=&quot;https://docs.dremio.com/current/developer/mcp-server/&quot;&gt;Dremio Cloud MCP server&lt;/a&gt; is an official Dremio product. The repositories below, along with libraries like dremioframe, are community-supported projects from the Dremio Developer Advocacy team. They are actively maintained but not part of the core Dremio product.&lt;/p&gt;
&lt;/blockquote&gt;
&lt;h3&gt;dremio-agent-skill (Community)&lt;/h3&gt;
&lt;p&gt;The &lt;a href=&quot;https://github.com/developer-advocacy-dremio/dremio-agent-skill&quot;&gt;dremio-agent-skill&lt;/a&gt; repository provides a complete skill directory with &lt;code&gt;SKILL.md&lt;/code&gt;, knowledge files (CLI, Python SDK, SQL, REST API), and &lt;code&gt;AGENTS.md&lt;/code&gt; in the &lt;code&gt;rules/&lt;/code&gt; directory.&lt;/p&gt;
&lt;p&gt;For OpenWork, copy the AGENTS.md:&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-bash&quot;&gt;git clone https://github.com/developer-advocacy-dremio/dremio-agent-skill
cp dremio-agent-skill/dremio-skill/rules/AGENTS.md ./AGENTS.md
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Or run the full installer for broader integration:&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-bash&quot;&gt;cd dremio-agent-skill &amp;amp;&amp;amp; ./install.sh
&lt;/code&gt;&lt;/pre&gt;
&lt;h3&gt;dremio-agent-md (Community)&lt;/h3&gt;
&lt;p&gt;The &lt;a href=&quot;https://github.com/developer-advocacy-dremio/dremio-agent-md&quot;&gt;dremio-agent-md&lt;/a&gt; repository provides &lt;code&gt;DREMIO_AGENT.md&lt;/code&gt; and documentation sitemaps. Clone it alongside your project:&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-bash&quot;&gt;git clone https://github.com/developer-advocacy-dremio/dremio-agent-md
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Tell OpenWork: &amp;quot;Read DREMIO_AGENT.md in the dremio-agent-md directory and use the sitemaps to validate SQL.&amp;quot; OpenWork&apos;s desktop interface makes it easy to have the agent-md folder open in the file browser while working on your project.&lt;/p&gt;
&lt;h2&gt;Approach 4: Build a Custom Dremio Configuration&lt;/h2&gt;
&lt;h3&gt;Custom AGENTS.md with Knowledge Files&lt;/h3&gt;
&lt;p&gt;Create a project structure with reference docs:&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;project-root/
  AGENTS.md
  docs/
    dremio-sql-reference.md
    team-schemas.md
    dremioframe-patterns.md
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Reference the docs in your AGENTS.md so OpenWork reads them on demand. Populate with your actual table schemas exported from Dremio, team-specific SQL patterns, and dremioframe code snippets.&lt;/p&gt;
&lt;h3&gt;Custom Agents&lt;/h3&gt;
&lt;p&gt;OpenWork inherits OpenCode&apos;s custom agent system. Create dedicated Dremio agents in &lt;code&gt;.opencode/agents/&lt;/code&gt;:&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-markdown&quot;&gt;# .opencode/agents/dremio-analyst.md
---
description: Dremio data analyst agent
mode: subagent
---

You are a data analyst working with Dremio Cloud.
1. Use the MCP connection to explore tables
2. Follow Dremio SQL conventions (CREATE FOLDER IF NOT EXISTS, etc.)
3. Validate function names against the SQL reference
4. Never hardcode credentials
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;This subagent uses a separate model and context window dedicated to Dremio tasks, producing higher-quality SQL than a general-purpose agent.&lt;/p&gt;
&lt;h2&gt;Using Dremio with OpenWork: Practical Use Cases&lt;/h2&gt;
&lt;p&gt;Once Dremio is connected, OpenWork can generate complete data applications. Here are detailed examples.&lt;/p&gt;
&lt;h3&gt;Ask Natural Language Questions About Your Data&lt;/h3&gt;
&lt;p&gt;Type a question in the OpenWork chat panel and get answers from your lakehouse:&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;&amp;quot;What is the average order value by customer segment for Q4? Which segment grew the fastest compared to Q3?&amp;quot;&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;OpenWork queries Dremio through MCP, computes the comparison, and returns a formatted answer with the SQL it ran. This turns your desktop agent into a local, private data analyst that works with production data.&lt;/p&gt;
&lt;p&gt;Follow up with deeper analysis:&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;&amp;quot;For the fastest-growing segment, show the top 10 customers by order frequency. Are they new customers or returning? Pull their first order date and total lifetime value.&amp;quot;&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;OpenWork maintains context from the previous question and writes progressively more complex queries. Because everything runs locally, your data never leaves your machine.&lt;/p&gt;
&lt;p&gt;This pattern is especially powerful for teams with data sovereignty requirements. The AI model processes your prompt, but the data stays on your infrastructure.&lt;/p&gt;
&lt;h3&gt;Build a Locally Running Dashboard&lt;/h3&gt;
&lt;p&gt;Ask OpenWork to create a self-contained dashboard:&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;&amp;quot;Query our gold-layer views in Dremio for monthly revenue, active users, and churn rate over the last 12 months. Build an HTML dashboard with Plotly.js charts. Include filters for region and product line. Add a dark theme and export-to-PNG buttons.&amp;quot;&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;OpenWork will:&lt;/p&gt;
&lt;ol&gt;
&lt;li&gt;Use MCP to discover gold-layer views and their schemas&lt;/li&gt;
&lt;li&gt;Write and execute SQL queries for each metric&lt;/li&gt;
&lt;li&gt;Generate an HTML file with Plotly.js interactive charts&lt;/li&gt;
&lt;li&gt;Add dropdown filters for region and product line&lt;/li&gt;
&lt;li&gt;Include export functionality and responsive layout&lt;/li&gt;
&lt;li&gt;Save everything to your project folder&lt;/li&gt;
&lt;/ol&gt;
&lt;p&gt;Open it in a browser for a fully interactive dashboard running from a local file. No server required. The Plotly.js charts support zoom, pan, and hover tooltips.&lt;/p&gt;
&lt;h3&gt;Create a Data Exploration App&lt;/h3&gt;
&lt;p&gt;Build a more sophisticated tool:&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;&amp;quot;Create a Streamlit app that connects to Dremio using dremioframe. Add a sidebar for selecting schemas and tables, a schema viewer, a data preview with pagination, a custom SQL query editor with results displayed as a table, and CSV download buttons.&amp;quot;&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;OpenWork writes the full Python application:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;code&gt;app.py&lt;/code&gt; with Streamlit layout, dremioframe connection, and query execution&lt;/li&gt;
&lt;li&gt;&lt;code&gt;requirements.txt&lt;/code&gt; with pinned dependencies&lt;/li&gt;
&lt;li&gt;&lt;code&gt;.env.example&lt;/code&gt; with required environment variables&lt;/li&gt;
&lt;li&gt;&lt;code&gt;README.md&lt;/code&gt; with setup instructions&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;Run &lt;code&gt;streamlit run app.py&lt;/code&gt; and you have a local data exploration tool connected to your lakehouse. Since both OpenWork and the app run on your machine, your data never leaves your infrastructure.&lt;/p&gt;
&lt;h3&gt;Generate Data Pipeline Scripts&lt;/h3&gt;
&lt;p&gt;Automate your ETL workflows:&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;&amp;quot;Write a Python script using dremioframe that reads raw CSV data from S3, creates a bronze table in Dremio, builds silver views with data quality rules (null checks, type validation, deduplication), and creates a gold view with business logic aggregations. Include error handling, logging, and a dry-run mode.&amp;quot;&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;OpenWork uses the Dremio skill knowledge to write pipeline code that follows your team&apos;s Medallion Architecture conventions. The script includes structured logging, retry logic, and a summary report at the end.&lt;/p&gt;
&lt;h3&gt;Build API Endpoints Over Dremio Data&lt;/h3&gt;
&lt;p&gt;Serve lakehouse data to other applications:&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;&amp;quot;Build a FastAPI application that connects to Dremio using dremioframe. Create endpoints for device metrics, alert summaries, and historical trends. Include request validation, response caching with a 5-minute TTL, and auto-generated API docs.&amp;quot;&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;OpenWork generates the complete server with proper error handling and connection management. Run &lt;code&gt;uvicorn main:app --reload&lt;/code&gt; for a local API connected to your lakehouse.&lt;/p&gt;
&lt;h2&gt;Which Approach Should You Use?&lt;/h2&gt;
&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Approach&lt;/th&gt;
&lt;th&gt;Setup Time&lt;/th&gt;
&lt;th&gt;What You Get&lt;/th&gt;
&lt;th&gt;Best For&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;MCP Server&lt;/td&gt;
&lt;td&gt;5 minutes&lt;/td&gt;
&lt;td&gt;Live queries, schema browsing, catalog&lt;/td&gt;
&lt;td&gt;NL data exploration, building apps&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;AGENTS.md&lt;/td&gt;
&lt;td&gt;10 minutes&lt;/td&gt;
&lt;td&gt;Convention enforcement, cross-tool portable&lt;/td&gt;
&lt;td&gt;Multi-tool teams&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Pre-Built Skills&lt;/td&gt;
&lt;td&gt;5 minutes&lt;/td&gt;
&lt;td&gt;Broad Dremio knowledge&lt;/td&gt;
&lt;td&gt;Quick start&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Custom Config&lt;/td&gt;
&lt;td&gt;30+ minutes&lt;/td&gt;
&lt;td&gt;Tailored agents, schemas, patterns&lt;/td&gt;
&lt;td&gt;Advanced multi-agent workflows&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;
&lt;p&gt;OpenWork&apos;s advantage is the local-first model. Your agent, your API keys, your data connections all run on your machine. Combined with Dremio&apos;s open lakehouse formats, you get a fully controlled analytics stack.&lt;/p&gt;
&lt;p&gt;Start with the MCP server for immediate access to your data. Layer in AGENTS.md for conventions and custom agents for specialized Dremio workflows. If your team already uses OpenCode or Codex, your existing AGENTS.md and MCP configuration work in OpenWork immediately.&lt;/p&gt;
&lt;p&gt;The local-first model means you can evaluate OpenWork with Dremio without any organizational approval process. Install it on your machine, connect it to your Dremio Cloud project, and start querying. If it works for you, share the &lt;code&gt;AGENTS.md&lt;/code&gt; and &lt;code&gt;opencode.json&lt;/code&gt; files with your team so they can replicate the same setup on their machines.&lt;/p&gt;
&lt;h2&gt;Get Started&lt;/h2&gt;
&lt;ol&gt;
&lt;li&gt;&lt;a href=&quot;https://www.dremio.com/get-started&quot;&gt;Sign up for Dremio Cloud free for 30 days&lt;/a&gt; ($400 in compute credits).&lt;/li&gt;
&lt;li&gt;Find your project&apos;s MCP endpoint in &lt;strong&gt;Project Settings &amp;gt; Info&lt;/strong&gt;.&lt;/li&gt;
&lt;li&gt;Add it to your &lt;code&gt;opencode.json&lt;/code&gt;.&lt;/li&gt;
&lt;li&gt;Clone &lt;a href=&quot;https://github.com/developer-advocacy-dremio/dremio-agent-skill&quot;&gt;dremio-agent-skill&lt;/a&gt; and copy the AGENTS.md.&lt;/li&gt;
&lt;li&gt;Ask OpenWork to explore your catalog and build a local dashboard from your data.&lt;/li&gt;
&lt;/ol&gt;
&lt;p&gt;Dremio&apos;s Agentic Lakehouse provides what OpenWork needs: the semantic layer for business context, query federation for universal data access, and Reflections for interactive speed. Both platforms embrace open standards and local-first operation, making them a natural fit for teams that prioritize data sovereignty and transparency.&lt;/p&gt;
&lt;p&gt;For more on the Dremio MCP Server, see the &lt;a href=&quot;https://docs.dremio.com/current/developer/mcp-server/&quot;&gt;official documentation&lt;/a&gt; or enroll in the free &lt;a href=&quot;https://university.dremio.com/course/dremio-mcp&quot;&gt;Dremio MCP Server course&lt;/a&gt; on Dremio University.&lt;/p&gt;
</content:encoded><author>Alex Merced</author></item><item><title>How to Use Dremio with OpenCode: Connect, Query, and Build Data Apps</title><link>https://iceberglakehouse.com/posts/2026-03-aitool-opencode/</link><guid isPermaLink="true">https://iceberglakehouse.com/posts/2026-03-aitool-opencode/</guid><description>
OpenCode is an open-source, terminal-based AI coding agent released under the MIT license. It provides a TUI with split panes, uses the Language Serv...</description><pubDate>Thu, 05 Mar 2026 18:00:00 GMT</pubDate><content:encoded>&lt;p&gt;OpenCode is an open-source, terminal-based AI coding agent released under the MIT license. It provides a TUI with split panes, uses the Language Server Protocol (LSP) for deep codebase understanding, and maintains persistent project context through file-based memory. Dremio is a unified lakehouse platform built on open standards like Apache Iceberg, Apache Arrow, and Apache Polaris.&lt;/p&gt;
&lt;p&gt;The open-source philosophy aligns. Dremio stores data in open formats with no vendor lock-in. OpenCode gives you full control over your AI coding agent with no proprietary restrictions. Connecting them means your open-source agent can query an open lakehouse, validate SQL against real schemas, and generate scripts using your team&apos;s actual conventions.&lt;/p&gt;
&lt;p&gt;OpenCode uses the same &lt;code&gt;AGENTS.md&lt;/code&gt; standard as OpenAI Codex, so the Dremio context files you write work across both tools. It also supports custom agents with dedicated prompts and model configurations, which opens up a Dremio-specific agent pattern that other tools do not offer. You can create a dedicated data analyst subagent that uses a reasoning model for SQL generation while your primary agent uses a faster model for application code.&lt;/p&gt;
&lt;p&gt;OpenCode&apos;s LSP integration gives it another advantage. The agent analyzes imports, dependencies, and file structure at the language level. When you combine this with Dremio&apos;s MCP server, the agent understands both your code structure and your data structure simultaneously.&lt;/p&gt;
&lt;p&gt;This post covers four approaches, ordered from quickest to most customizable.&lt;/p&gt;
&lt;p&gt;&lt;img src=&quot;/assets/images/aitoolblogs/10/opencode-dremio-architecture.png&quot; alt=&quot;OpenCode TUI connecting to Dremio Agentic Lakehouse via MCP&quot;&gt;&lt;/p&gt;
&lt;h2&gt;Setting Up OpenCode&lt;/h2&gt;
&lt;p&gt;If you do not already have OpenCode installed:&lt;/p&gt;
&lt;ol&gt;
&lt;li&gt;&lt;strong&gt;Install Go&lt;/strong&gt; (version 1.23 or later) from &lt;a href=&quot;https://go.dev/dl/&quot;&gt;go.dev&lt;/a&gt;.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Install OpenCode&lt;/strong&gt;:&lt;pre&gt;&lt;code class=&quot;language-bash&quot;&gt;go install github.com/opencode-ai/opencode@latest
&lt;/code&gt;&lt;/pre&gt;
Or use Homebrew: &lt;code&gt;brew install opencode&lt;/code&gt;.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Configure your AI model&lt;/strong&gt; by setting the &lt;code&gt;OPENAI_API_KEY&lt;/code&gt;, &lt;code&gt;ANTHROPIC_API_KEY&lt;/code&gt;, or other model provider key in your environment.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Launch OpenCode&lt;/strong&gt; by running &lt;code&gt;opencode&lt;/code&gt; in your terminal from any project directory.&lt;/li&gt;
&lt;/ol&gt;
&lt;p&gt;OpenCode provides a TUI with split panes, LSP-powered code understanding, and a multi-agent architecture that lets you define specialized subagents for different tasks. It is open-source under the MIT license.&lt;/p&gt;
&lt;h2&gt;Approach 1: Connect the Dremio Cloud MCP Server&lt;/h2&gt;
&lt;p&gt;Every Dremio Cloud project includes a built-in MCP server. OpenCode supports MCP natively through its &lt;code&gt;opencode.json&lt;/code&gt; configuration.&lt;/p&gt;
&lt;p&gt;For Claude-based tools, Dremio provides an &lt;a href=&quot;https://github.com/dremio/claude-plugins&quot;&gt;official Claude plugin&lt;/a&gt; with guided setup. For OpenCode, you configure the MCP connection through &lt;code&gt;opencode.json&lt;/code&gt;:&lt;/p&gt;
&lt;h3&gt;Find Your Project&apos;s MCP Endpoint&lt;/h3&gt;
&lt;p&gt;Log into &lt;a href=&quot;https://www.dremio.com/get-started&quot;&gt;Dremio Cloud&lt;/a&gt; and open your project. Go to &lt;strong&gt;Project Settings &amp;gt; Info&lt;/strong&gt; and copy the MCP server URL.&lt;/p&gt;
&lt;h3&gt;Set Up OAuth in Dremio Cloud&lt;/h3&gt;
&lt;ol&gt;
&lt;li&gt;Navigate to &lt;strong&gt;Settings &amp;gt; Organization Settings &amp;gt; OAuth Applications&lt;/strong&gt;.&lt;/li&gt;
&lt;li&gt;Click &lt;strong&gt;Add Application&lt;/strong&gt; and name it (e.g., &amp;quot;OpenCode MCP&amp;quot;).&lt;/li&gt;
&lt;li&gt;Add the redirect URI for your setup.&lt;/li&gt;
&lt;li&gt;Copy the &lt;strong&gt;Client ID&lt;/strong&gt;.&lt;/li&gt;
&lt;/ol&gt;
&lt;h3&gt;Configure OpenCode&apos;s MCP Connection&lt;/h3&gt;
&lt;p&gt;Add the Dremio MCP server to your &lt;code&gt;opencode.json&lt;/code&gt;:&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-json&quot;&gt;{
  &amp;quot;mcp&amp;quot;: {
    &amp;quot;dremio&amp;quot;: {
      &amp;quot;url&amp;quot;: &amp;quot;https://YOUR_PROJECT_MCP_URL&amp;quot;,
      &amp;quot;auth&amp;quot;: {
        &amp;quot;type&amp;quot;: &amp;quot;oauth&amp;quot;,
        &amp;quot;clientId&amp;quot;: &amp;quot;YOUR_CLIENT_ID&amp;quot;
      }
    }
  }
}
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;For global configuration, place it in &lt;code&gt;~/.config/opencode/opencode.json&lt;/code&gt;. For project-specific config, place it at the project root.&lt;/p&gt;
&lt;p&gt;After configuring, OpenCode can call Dremio&apos;s MCP tools:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;GetUsefulSystemTableNames&lt;/strong&gt; lists tables with descriptions.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;GetSchemaOfTable&lt;/strong&gt; returns column names, types, and metadata.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;GetDescriptionOfTableOrSchema&lt;/strong&gt; pulls wiki descriptions and labels from the catalog.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;GetTableOrViewLineage&lt;/strong&gt; shows data lineage.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;RunSqlQuery&lt;/strong&gt; executes SQL and returns JSON results.&lt;/li&gt;
&lt;/ul&gt;
&lt;h3&gt;Self-Hosted Alternative&lt;/h3&gt;
&lt;p&gt;For Dremio Software, use the open-source &lt;a href=&quot;https://github.com/dremio/dremio-mcp&quot;&gt;dremio-mcp&lt;/a&gt; server:&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-bash&quot;&gt;git clone https://github.com/dremio/dremio-mcp
cd dremio-mcp
uv run dremio-mcp-server config create dremioai \
  --uri https://your-dremio-instance.com \
  --pat YOUR_PERSONAL_ACCESS_TOKEN
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Then configure OpenCode to run the local server in &lt;code&gt;opencode.json&lt;/code&gt;:&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-json&quot;&gt;{
  &amp;quot;mcp&amp;quot;: {
    &amp;quot;dremio&amp;quot;: {
      &amp;quot;command&amp;quot;: &amp;quot;uv&amp;quot;,
      &amp;quot;args&amp;quot;: [
        &amp;quot;run&amp;quot;, &amp;quot;--directory&amp;quot;, &amp;quot;/path/to/dremio-mcp&amp;quot;,
        &amp;quot;dremio-mcp-server&amp;quot;, &amp;quot;run&amp;quot;
      ]
    }
  }
}
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;The server supports &lt;code&gt;FOR_DATA_PATTERNS&lt;/code&gt; (query and explore), &lt;code&gt;FOR_SELF&lt;/code&gt; (system introspection), and &lt;code&gt;FOR_PROMETHEUS&lt;/code&gt; (metrics). Most coding workflows use &lt;code&gt;FOR_DATA_PATTERNS&lt;/code&gt;, the default mode. It gives the agent full access to explore your catalog, read schemas, pull wiki descriptions, and run SQL queries.&lt;/p&gt;
&lt;p&gt;If your team also handles Dremio administration, &lt;code&gt;FOR_SELF&lt;/code&gt; mode lets the agent analyze job history, resource utilization, and query performance. This is useful for platform engineering tasks where you need the agent to diagnose slow queries or suggest Reflection configurations. &lt;code&gt;FOR_PROMETHEUS&lt;/code&gt; connects to your monitoring stack for correlating Dremio metrics with broader system observability.&lt;/p&gt;
&lt;p&gt;For Dremio Cloud users, the hosted MCP server is the simpler option. No local installation, OAuth-based auth, and your existing access controls apply automatically. The self-hosted server gives more control and works with on-premise Dremio Software deployments.&lt;/p&gt;
&lt;h2&gt;Approach 2: Use AGENTS.md for Dremio Context&lt;/h2&gt;
&lt;p&gt;OpenCode shares the &lt;code&gt;AGENTS.md&lt;/code&gt; standard with OpenAI Codex. It auto-scans for this file at project start and uses it to guide agent behavior.&lt;/p&gt;
&lt;h3&gt;AGENTS.md Placement&lt;/h3&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Project root:&lt;/strong&gt; &lt;code&gt;AGENTS.md&lt;/code&gt; applies to the current project.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Global:&lt;/strong&gt; &lt;code&gt;~/.config/opencode/AGENTS.md&lt;/code&gt; applies across all projects.&lt;/li&gt;
&lt;li&gt;Project-level files override global defaults.&lt;/li&gt;
&lt;/ul&gt;
&lt;h3&gt;Writing a Dremio-Focused AGENTS.md&lt;/h3&gt;
&lt;pre&gt;&lt;code class=&quot;language-markdown&quot;&gt;# Agent Configuration

## Dremio Lakehouse

This project uses Dremio Cloud as its lakehouse.

### SQL Conventions
- Use `CREATE FOLDER IF NOT EXISTS` for namespace creation
- Open Catalog tables: `folder.subfolder.table_name` (no catalog prefix)
- External sources: `source_name.schema.table_name`
- Cast DATE to TIMESTAMP for join consistency
- Use TIMESTAMPDIFF for duration calculations

### Credentials
- PAT: env var `DREMIO_PAT`
- Endpoint: env var `DREMIO_URI`
- Never hardcode credentials

### References
- SQL syntax: https://docs.dremio.com/current/reference/sql/
- REST API: https://docs.dremio.com/current/reference/api/
- Local SQL reference: ./docs/dremio-sql-reference.md

### Terminology
- &amp;quot;Agentic Lakehouse&amp;quot; not &amp;quot;data warehouse&amp;quot;
- &amp;quot;Reflections&amp;quot; not &amp;quot;materialized views&amp;quot;
- &amp;quot;Open Catalog&amp;quot; built on Apache Polaris
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Run &lt;code&gt;/init&lt;/code&gt; inside OpenCode to generate a starter &lt;code&gt;AGENTS.md&lt;/code&gt; from a project scan, then add the Dremio sections above.&lt;/p&gt;
&lt;h3&gt;Custom Agents for Dremio-Specific Workflows&lt;/h3&gt;
&lt;p&gt;OpenCode supports defining custom agents in &lt;code&gt;.opencode/agents/&lt;/code&gt;. This is a capability that most other tools lack. You can create a dedicated Dremio agent with its own system prompt, model choice, and tool permissions.&lt;/p&gt;
&lt;p&gt;Create &lt;code&gt;.opencode/agents/dremio-analyst.md&lt;/code&gt;:&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-markdown&quot;&gt;---
description: Dremio data analyst agent
mode: subagent
---

You are a data analyst working with Dremio Cloud. Your job is to:
1. Explore available tables using the MCP connection
2. Write SQL queries that follow Dremio conventions
3. Use TIMESTAMPDIFF, not DATEDIFF
4. Use CREATE FOLDER IF NOT EXISTS, not CREATE SCHEMA
5. Always validate function names against the SQL reference before using them
6. Never hardcode credentials; use environment variables
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;This agent runs as a subagent that the primary agent can invoke for Dremio-specific tasks. You can configure it with a different model (for example, a reasoning model optimized for SQL generation) and restrict its tool access to only the Dremio MCP server.&lt;/p&gt;
&lt;p&gt;&lt;img src=&quot;/assets/images/aitoolblogs/10/four-integration-approaches.png&quot; alt=&quot;Four approaches to connecting AI coding tools to Dremio&quot;&gt;&lt;/p&gt;
&lt;h2&gt;Approach 3: Install Pre-Built Dremio Skills and Docs&lt;/h2&gt;
&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;Official vs. Community Resources:&lt;/strong&gt; Dremio provides an &lt;a href=&quot;https://github.com/dremio/claude-plugins&quot;&gt;official plugin&lt;/a&gt; for Claude Code users and the built-in &lt;a href=&quot;https://docs.dremio.com/current/developer/mcp-server/&quot;&gt;Dremio Cloud MCP server&lt;/a&gt; is an official Dremio product. The repositories below, along with libraries like dremioframe, are community-supported projects from the Dremio Developer Advocacy team. They are actively maintained but not part of the core Dremio product.&lt;/p&gt;
&lt;/blockquote&gt;
&lt;h3&gt;dremio-agent-skill (Community)&lt;/h3&gt;
&lt;p&gt;The &lt;a href=&quot;https://github.com/developer-advocacy-dremio/dremio-agent-skill&quot;&gt;dremio-agent-skill&lt;/a&gt; repository contains a complete agent skill with &lt;code&gt;SKILL.md&lt;/code&gt;, knowledge files, and an &lt;code&gt;AGENTS.md&lt;/code&gt; in the &lt;code&gt;rules/&lt;/code&gt; directory.&lt;/p&gt;
&lt;p&gt;For OpenCode, copy the AGENTS.md from the skill to your project:&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-bash&quot;&gt;git clone https://github.com/developer-advocacy-dremio/dremio-agent-skill
cp dremio-agent-skill/dremio-skill/rules/AGENTS.md ./AGENTS.md
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Or run the full installer for broader integration:&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-bash&quot;&gt;cd dremio-agent-skill
./install.sh
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;The skill includes knowledge files covering Dremio CLI, Python SDK (dremioframe), SQL syntax, and REST API endpoints.&lt;/p&gt;
&lt;h3&gt;dremio-agent-md (Community)&lt;/h3&gt;
&lt;p&gt;The &lt;a href=&quot;https://github.com/developer-advocacy-dremio/dremio-agent-md&quot;&gt;dremio-agent-md&lt;/a&gt; repository provides a &lt;code&gt;DREMIO_AGENT.md&lt;/code&gt; protocol file and hierarchical documentation sitemaps.&lt;/p&gt;
&lt;p&gt;Clone it and tell OpenCode to read the protocol:&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-bash&quot;&gt;git clone https://github.com/developer-advocacy-dremio/dremio-agent-md
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Instruct OpenCode: &amp;quot;Read DREMIO_AGENT.md in the dremio-agent-md directory. Use the sitemaps to verify SQL syntax before generating code.&amp;quot;&lt;/p&gt;
&lt;p&gt;This is especially powerful with OpenCode&apos;s LSP-based context engine. The agent can cross-reference the Dremio sitemaps with your actual project imports and file structure, ensuring that the SQL it generates fits both the Dremio dialect and your project conventions.&lt;/p&gt;
&lt;h2&gt;Approach 4: Build a Custom Dremio Agent&lt;/h2&gt;
&lt;p&gt;OpenCode&apos;s custom agent system is its differentiator for Dremio integration. While other tools limit you to context files, OpenCode lets you define a purpose-built Dremio agent.&lt;/p&gt;
&lt;h3&gt;Multi-Agent Architecture&lt;/h3&gt;
&lt;p&gt;Create a primary coding agent plus a Dremio-focused subagent:&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;.opencode/agents/
  dremio-analyst.md       # Subagent for SQL and data queries
  dremio-pipeline.md      # Subagent for ETL/pipeline scripts
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Each agent gets its own system prompt, model configuration, and tool permissions. The primary agent delegates Dremio tasks to the appropriate subagent, which has the full Dremio context loaded while keeping the primary agent&apos;s context window focused on application code.&lt;/p&gt;
&lt;p&gt;This separation matters for large projects. A data pipeline subagent can be configured with a reasoning-capable model (like a Chain of Thought model) that excels at complex SQL generation, while your primary coding agent uses a faster model for application logic. The Dremio subagent&apos;s tool permissions can be restricted to only the Dremio MCP server, preventing it from accidentally modifying application files.&lt;/p&gt;
&lt;h3&gt;Knowledge Files&lt;/h3&gt;
&lt;p&gt;Pair your custom agents with reference documentation:&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;docs/
  dremio-sql-reference.md
  team-schemas.md
  dremioframe-patterns.md
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Reference these in both your &lt;code&gt;AGENTS.md&lt;/code&gt; and your custom agent prompts. OpenCode&apos;s file-based memory system ensures the agent retains context from these references across interactions. Export your actual table schemas from Dremio&apos;s catalog and save them as markdown. Include dremioframe code snippets for common operations like querying, creating views, and managing branches. Add REST API call patterns for your CI/CD pipelines.&lt;/p&gt;
&lt;h2&gt;Using Dremio with OpenCode: Practical Use Cases&lt;/h2&gt;
&lt;p&gt;Once Dremio is connected, OpenCode&apos;s multi-agent architecture enables sophisticated data workflows. Here are detailed examples.&lt;/p&gt;
&lt;h3&gt;Ask Natural Language Questions About Your Data&lt;/h3&gt;
&lt;p&gt;Type a question in OpenCode&apos;s TUI and get answers from your lakehouse:&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;&amp;quot;Which product categories have the highest return rates? Cross-reference with customer satisfaction scores and identify correlations.&amp;quot;&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;OpenCode routes this to the Dremio subagent, which discovers the relevant tables via MCP, writes a multi-table join with aggregations, runs it against Dremio, and returns analysis with the underlying SQL. The primary agent stays focused on your code context while the Dremio subagent handles the data work.&lt;/p&gt;
&lt;p&gt;Dig deeper with follow-up analysis:&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;&amp;quot;For the categories with highest returns, pull the top reasons from the returns table. Group by product SKU and show which specific items are driving the category-level numbers.&amp;quot;&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;The Dremio subagent already knows the schema context from the previous query and generates the follow-up efficiently.&lt;/p&gt;
&lt;h3&gt;Build a Locally Running Dashboard&lt;/h3&gt;
&lt;p&gt;Ask OpenCode to create a visualization:&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;&amp;quot;Query Dremio&apos;s gold-layer views for inventory levels, reorder rates, and supplier lead times. Build a local HTML dashboard with ECharts showing stock trends, a forecasting chart, and supplier performance scorecards. Include a responsive layout and dark theme.&amp;quot;&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;OpenCode&apos;s multi-agent system handles this: the Dremio subagent writes and executes the SQL queries, while the primary agent generates the HTML/CSS/JavaScript:&lt;/p&gt;
&lt;ol&gt;
&lt;li&gt;Dremio subagent discovers inventory views and pulls data&lt;/li&gt;
&lt;li&gt;Primary agent generates the HTML structure with ECharts&lt;/li&gt;
&lt;li&gt;Data is embedded as JSON in the generated file&lt;/li&gt;
&lt;li&gt;Interactive filters for warehouse, category, and date range&lt;/li&gt;
&lt;li&gt;Responsive layout that works on desktop and tablet&lt;/li&gt;
&lt;/ol&gt;
&lt;p&gt;Open it in a browser for an interactive supply chain dashboard running from a local file.&lt;/p&gt;
&lt;h3&gt;Create a Data Exploration App&lt;/h3&gt;
&lt;p&gt;Build a full application leveraging the multi-agent architecture:&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;&amp;quot;Create a Python Dash app that uses dremioframe to connect to Dremio. Include a catalog browser, table schema viewer with column statistics, and a multi-tab interface for SQL queries, data profiling, and anomaly detection. Add a connection settings page.&amp;quot;&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;OpenCode delegates the work: the Dremio subagent writes the dremioframe connection code and SQL queries, while the primary agent builds the Dash UI components and layout:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Multi-tab interface with catalog browser, schema viewer, SQL editor, and profiler&lt;/li&gt;
&lt;li&gt;Column statistics calculated from Dremio metadata&lt;/li&gt;
&lt;li&gt;Anomaly detection using basic IQR analysis on numeric columns&lt;/li&gt;
&lt;li&gt;Connection settings stored in &lt;code&gt;.env&lt;/code&gt; with a settings page for updates&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;Run &lt;code&gt;python app.py&lt;/code&gt; and your team has a local data platform connected to the lakehouse.&lt;/p&gt;
&lt;h3&gt;Generate Pipeline Scripts with Agent Collaboration&lt;/h3&gt;
&lt;p&gt;Automate data engineering with Dremio-aware code:&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;&amp;quot;Using the Dremio skill, create a Python ETL pipeline that processes IoT sensor data. Create bronze tables for raw readings, silver views that apply calibration offsets and flag anomalies (readings outside 3 standard deviations), and gold views that aggregate by device and time window. Include retry logic and structured logging.&amp;quot;&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;The Dremio subagent writes the pipeline code using correct Dremio SQL conventions and bronze-silver-gold patterns, while the primary agent handles file management, error handling, and test generation. The result is production-quality code with proper separation of concerns.&lt;/p&gt;
&lt;h3&gt;Build API Endpoints Over Dremio Data&lt;/h3&gt;
&lt;p&gt;Serve lakehouse data to downstream applications:&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;&amp;quot;Build a FastAPI service with endpoints for IoT device metrics, alert summaries, and historical trends. Connect to Dremio using dremioframe. Add WebSocket support for real-time data streaming and Pydantic response models.&amp;quot;&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;OpenCode generates the full API server with the Dremio subagent handling query logic and the primary agent building the FastAPI framework.&lt;/p&gt;
&lt;h2&gt;Which Approach Should You Use?&lt;/h2&gt;
&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Approach&lt;/th&gt;
&lt;th&gt;Setup Time&lt;/th&gt;
&lt;th&gt;What You Get&lt;/th&gt;
&lt;th&gt;Best For&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;MCP Server&lt;/td&gt;
&lt;td&gt;5 minutes&lt;/td&gt;
&lt;td&gt;Live queries, schema browsing, catalog&lt;/td&gt;
&lt;td&gt;Data analysis, real-time access&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;AGENTS.md&lt;/td&gt;
&lt;td&gt;10 minutes&lt;/td&gt;
&lt;td&gt;Convention enforcement, portable config&lt;/td&gt;
&lt;td&gt;Cross-tool consistency&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Pre-Built Skills&lt;/td&gt;
&lt;td&gt;5 minutes&lt;/td&gt;
&lt;td&gt;Broad Dremio knowledge&lt;/td&gt;
&lt;td&gt;Quick start&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Custom Agent&lt;/td&gt;
&lt;td&gt;30+ minutes&lt;/td&gt;
&lt;td&gt;Dedicated Dremio subagent with own model/prompt&lt;/td&gt;
&lt;td&gt;Advanced multi-agent workflows&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;
&lt;p&gt;OpenCode&apos;s custom agent system makes the fourth approach more powerful than in other tools. A dedicated Dremio subagent with its own reasoning model and restricted tool access produces higher-quality SQL than a general-purpose agent trying to handle both application code and data queries in the same context.&lt;/p&gt;
&lt;p&gt;Combine the MCP server for live data access with a custom Dremio agent for SQL generation, and an &lt;code&gt;AGENTS.md&lt;/code&gt; for project-wide conventions. This three-layer stack gives you the strongest Dremio integration available in any open-source coding tool.&lt;/p&gt;
&lt;p&gt;If you are coming from Claude Code or Codex and want an open-source alternative, start with the AGENTS.md approach since your existing file works directly in OpenCode. Add the MCP connection for live data, then explore custom agents to see if the multi-agent architecture improves your workflow.&lt;/p&gt;
&lt;h2&gt;Get Started&lt;/h2&gt;
&lt;ol&gt;
&lt;li&gt;&lt;a href=&quot;https://www.dremio.com/get-started&quot;&gt;Sign up for Dremio Cloud free for 30 days&lt;/a&gt; ($400 in compute credits).&lt;/li&gt;
&lt;li&gt;Find your project&apos;s MCP endpoint in &lt;strong&gt;Project Settings &amp;gt; Info&lt;/strong&gt;.&lt;/li&gt;
&lt;li&gt;Add it to your &lt;code&gt;opencode.json&lt;/code&gt;.&lt;/li&gt;
&lt;li&gt;Clone &lt;a href=&quot;https://github.com/developer-advocacy-dremio/dremio-agent-skill&quot;&gt;dremio-agent-skill&lt;/a&gt; and copy the AGENTS.md.&lt;/li&gt;
&lt;li&gt;Start OpenCode and ask it to explore your Dremio catalog.&lt;/li&gt;
&lt;/ol&gt;
&lt;p&gt;Dremio&apos;s Agentic Lakehouse provides what OpenCode&apos;s agents need for accurate analytics: the semantic layer delivers business context, query federation delivers universal data access, and Reflections deliver interactive speed. Both platforms are built on open standards, and connecting them gives you an open-source analytics stack from agent to lakehouse.&lt;/p&gt;
&lt;p&gt;For more on the Dremio MCP Server, see the &lt;a href=&quot;https://docs.dremio.com/current/developer/mcp-server/&quot;&gt;official documentation&lt;/a&gt; or take the free &lt;a href=&quot;https://university.dremio.com/course/dremio-mcp&quot;&gt;Dremio MCP Server course&lt;/a&gt; on Dremio University.&lt;/p&gt;
</content:encoded><author>Alex Merced</author></item><item><title>How to Use Dremio with OpenAI Codex CLI: Connect, Query, and Build Data Apps</title><link>https://iceberglakehouse.com/posts/2026-03-aitool-openai-codex/</link><guid isPermaLink="true">https://iceberglakehouse.com/posts/2026-03-aitool-openai-codex/</guid><description>
OpenAI Codex CLI is a terminal-based coding agent built in Rust. It reads your codebase, writes files, executes commands, and supports MCP for connec...</description><pubDate>Thu, 05 Mar 2026 17:00:00 GMT</pubDate><content:encoded>&lt;p&gt;OpenAI Codex CLI is a terminal-based coding agent built in Rust. It reads your codebase, writes files, executes commands, and supports MCP for connecting to external data services. Dremio is a unified lakehouse platform that provides the business context, universal data access, and query speed that coding agents need to produce accurate, working analytics code.&lt;/p&gt;
&lt;p&gt;Codex uses &lt;code&gt;AGENTS.md&lt;/code&gt; as its primary context file. This is an open standard designed to work across multiple AI tools, so the Dremio configuration you write for Codex also works with other AGENTS.md-compatible tools. That portability matters if your team uses different agents.&lt;/p&gt;
&lt;p&gt;Without a Dremio connection, Codex treats your lakehouse like any generic database. It may guess at table names, hallucinate SQL functions, or ignore your team&apos;s naming conventions. With a proper connection, Codex knows your schema, your business logic encoded in virtual views, and the right Dremio SQL dialect.&lt;/p&gt;
&lt;p&gt;Codex&apos;s support for the AGENTS.md open standard is worth highlighting. Unlike tool-specific context files, AGENTS.md works across multiple AI agents. Write it once for Codex and your team members using OpenCode, OpenWork, or any other AGENTS.md-compatible tool get the same context without maintaining separate files.&lt;/p&gt;
&lt;p&gt;This post covers four approaches, ordered from quickest setup to most customizable. Start with the one that matches your current needs, and layer in the others as your Dremio usage grows.&lt;/p&gt;
&lt;p&gt;&lt;img src=&quot;/assets/images/aitoolblogs/09/codex-dremio-mcp-architecture.png&quot; alt=&quot;OpenAI Codex CLI connecting to Dremio Agentic Lakehouse via MCP&quot;&gt;&lt;/p&gt;
&lt;h2&gt;Setting Up OpenAI Codex CLI&lt;/h2&gt;
&lt;p&gt;If you do not already have Codex CLI installed:&lt;/p&gt;
&lt;ol&gt;
&lt;li&gt;&lt;strong&gt;Install Node.js&lt;/strong&gt; (version 22 or later) from &lt;a href=&quot;https://nodejs.org/&quot;&gt;nodejs.org&lt;/a&gt;.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Install Codex&lt;/strong&gt; globally via npm:&lt;pre&gt;&lt;code class=&quot;language-bash&quot;&gt;npm install -g @openai/codex
&lt;/code&gt;&lt;/pre&gt;
&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Launch Codex&lt;/strong&gt; by running &lt;code&gt;codex&lt;/code&gt; in your terminal from any project directory.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Authenticate&lt;/strong&gt; with your OpenAI API key. Codex uses the &lt;code&gt;OPENAI_API_KEY&lt;/code&gt; environment variable.&lt;/li&gt;
&lt;/ol&gt;
&lt;p&gt;Codex runs in your terminal and reads your project files for context. It supports three autonomy modes: &lt;code&gt;suggest&lt;/code&gt; (proposes changes), &lt;code&gt;auto-edit&lt;/code&gt; (applies file edits), and &lt;code&gt;full-auto&lt;/code&gt; (runs commands without confirmation).&lt;/p&gt;
&lt;h2&gt;Approach 1: Connect the Dremio Cloud MCP Server&lt;/h2&gt;
&lt;p&gt;The Model Context Protocol (MCP) is an open standard that lets AI tools call external services. Every Dremio Cloud project ships with a built-in MCP server. Codex supports MCP natively, making this the fastest way to give the agent direct access to your data.&lt;/p&gt;
&lt;p&gt;For Claude-based tools, Dremio provides an &lt;a href=&quot;https://github.com/dremio/claude-plugins&quot;&gt;official Claude plugin&lt;/a&gt; with guided setup including &lt;code&gt;/dremio-setup&lt;/code&gt; for step-by-step configuration. For Codex, you configure the MCP connection through your project settings:&lt;/p&gt;
&lt;h3&gt;Find Your Project&apos;s MCP Endpoint&lt;/h3&gt;
&lt;p&gt;Log into &lt;a href=&quot;https://www.dremio.com/get-started&quot;&gt;Dremio Cloud&lt;/a&gt; and open your project. Navigate to &lt;strong&gt;Project Settings &amp;gt; Info&lt;/strong&gt;. Copy the MCP server URL from the project overview page.&lt;/p&gt;
&lt;h3&gt;Set Up OAuth in Dremio Cloud&lt;/h3&gt;
&lt;p&gt;Dremio&apos;s hosted MCP server uses OAuth for authentication. Your existing access controls apply to every query Codex runs.&lt;/p&gt;
&lt;ol&gt;
&lt;li&gt;Go to &lt;strong&gt;Settings &amp;gt; Organization Settings &amp;gt; OAuth Applications&lt;/strong&gt;.&lt;/li&gt;
&lt;li&gt;Click &lt;strong&gt;Add Application&lt;/strong&gt; and name it (e.g., &amp;quot;Codex MCP&amp;quot;).&lt;/li&gt;
&lt;li&gt;Add the redirect URI specific to your Codex client setup.&lt;/li&gt;
&lt;li&gt;Save and copy the &lt;strong&gt;Client ID&lt;/strong&gt;.&lt;/li&gt;
&lt;/ol&gt;
&lt;h3&gt;Configure Codex&apos;s MCP Connection&lt;/h3&gt;
&lt;p&gt;Codex reads MCP configuration from its settings. Add the Dremio server to your MCP configuration:&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-json&quot;&gt;{
  &amp;quot;mcpServers&amp;quot;: {
    &amp;quot;dremio&amp;quot;: {
      &amp;quot;url&amp;quot;: &amp;quot;https://YOUR_PROJECT_MCP_URL&amp;quot;,
      &amp;quot;auth&amp;quot;: {
        &amp;quot;type&amp;quot;: &amp;quot;oauth&amp;quot;,
        &amp;quot;clientId&amp;quot;: &amp;quot;YOUR_CLIENT_ID&amp;quot;
      }
    }
  }
}
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;After configuring, Codex can call Dremio&apos;s MCP tools directly:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;GetUsefulSystemTableNames&lt;/strong&gt; lists available tables with descriptions.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;GetSchemaOfTable&lt;/strong&gt; returns column names, types, and metadata.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;GetDescriptionOfTableOrSchema&lt;/strong&gt; pulls wiki descriptions and labels from the catalog.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;GetTableOrViewLineage&lt;/strong&gt; shows upstream data dependencies.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;RunSqlQuery&lt;/strong&gt; executes SQL and returns results as JSON.&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;Test it by asking Codex: &amp;quot;What tables are available in Dremio?&amp;quot; The agent will call the appropriate MCP resource and return your catalog contents.&lt;/p&gt;
&lt;h3&gt;Self-Hosted Alternative&lt;/h3&gt;
&lt;p&gt;For Dremio Software deployments, use the open-source &lt;a href=&quot;https://github.com/dremio/dremio-mcp&quot;&gt;dremio-mcp&lt;/a&gt; server:&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-bash&quot;&gt;git clone https://github.com/dremio/dremio-mcp
cd dremio-mcp
uv run dremio-mcp-server config create dremioai \
  --uri https://your-dremio-instance.com \
  --pat YOUR_PERSONAL_ACCESS_TOKEN
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Then configure Codex to run the local server:&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-json&quot;&gt;{
  &amp;quot;mcpServers&amp;quot;: {
    &amp;quot;dremio&amp;quot;: {
      &amp;quot;command&amp;quot;: &amp;quot;uv&amp;quot;,
      &amp;quot;args&amp;quot;: [
        &amp;quot;run&amp;quot;, &amp;quot;--directory&amp;quot;, &amp;quot;/path/to/dremio-mcp&amp;quot;,
        &amp;quot;dremio-mcp-server&amp;quot;, &amp;quot;run&amp;quot;
      ]
    }
  }
}
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;The self-hosted server supports three modes: &lt;code&gt;FOR_DATA_PATTERNS&lt;/code&gt; for data exploration (default), &lt;code&gt;FOR_SELF&lt;/code&gt; for system introspection, and &lt;code&gt;FOR_PROMETHEUS&lt;/code&gt; for metrics correlation.&lt;/p&gt;
&lt;p&gt;The &lt;code&gt;FOR_DATA_PATTERNS&lt;/code&gt; mode is what you want for most coding workflows. It enables the agent to explore your catalog, read table schemas, pull wiki descriptions, and run SQL queries. The &lt;code&gt;FOR_SELF&lt;/code&gt; mode is useful for DevOps and platform engineering tasks where you need the agent to analyze Dremio&apos;s own performance metrics, job history, and resource utilization. The &lt;code&gt;FOR_PROMETHEUS&lt;/code&gt; mode connects to your monitoring stack for correlating Dremio-specific metrics with broader system observability.&lt;/p&gt;
&lt;p&gt;For Dremio Cloud users, the hosted MCP server is the simpler choice. It requires no local installation, handles authentication through OAuth, and inherits your existing access controls. The self-hosted option gives you more control and works with Dremio Software deployments that are not in the cloud.&lt;/p&gt;
&lt;h2&gt;Approach 2: Use AGENTS.md for Dremio Context&lt;/h2&gt;
&lt;p&gt;&lt;code&gt;AGENTS.md&lt;/code&gt; is an open standard for providing AI coding agents with project context. Codex auto-scans for this file at the start of every task. It defines your project structure, coding conventions, and tool-specific instructions.&lt;/p&gt;
&lt;h3&gt;How AGENTS.md Works in Codex&lt;/h3&gt;
&lt;p&gt;Codex supports layered guidance:&lt;/p&gt;
&lt;ol&gt;
&lt;li&gt;&lt;strong&gt;Global defaults:&lt;/strong&gt; &lt;code&gt;~/.codex/AGENTS.md&lt;/code&gt; applies to every project.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Project-level:&lt;/strong&gt; &lt;code&gt;AGENTS.md&lt;/code&gt; at the repo root overrides global defaults.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Nested overrides:&lt;/strong&gt; &lt;code&gt;AGENTS.override.md&lt;/code&gt; in subdirectories provides directory-specific rules that take precedence over broader ones.&lt;/li&gt;
&lt;/ol&gt;
&lt;p&gt;This layering is useful for monorepos where different subdirectories interact with Dremio differently.&lt;/p&gt;
&lt;h3&gt;Writing a Dremio-Focused AGENTS.md&lt;/h3&gt;
&lt;p&gt;Create &lt;code&gt;AGENTS.md&lt;/code&gt; in your project root:&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-markdown&quot;&gt;# Project Agent Configuration

## Dremio Lakehouse

This project uses Dremio Cloud as its lakehouse platform.

### SQL Conventions
- Use `CREATE FOLDER IF NOT EXISTS` for namespace creation
- Tables in the Open Catalog: `folder.subfolder.table_name` (no catalog prefix)
- External sources: `source_name.schema.table_name`
- Cast DATE columns to TIMESTAMP for consistent joins
- Use TIMESTAMPDIFF for duration calculations

### Credentials
- Personal Access Token: use env var `DREMIO_PAT`
- Cloud endpoint: use env var `DREMIO_URI`
- Never hardcode credentials in scripts

### Documentation References
- Dremio SQL reference: https://docs.dremio.com/current/reference/sql/
- REST API: https://docs.dremio.com/current/reference/api/
- For detailed SQL validation, read ./docs/dremio-sql-reference.md

### Terminology
- Use &amp;quot;Agentic Lakehouse&amp;quot; not &amp;quot;data warehouse&amp;quot;
- &amp;quot;Reflections&amp;quot; not &amp;quot;materialized views&amp;quot;
- &amp;quot;Open Catalog&amp;quot; is built on Apache Polaris
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;You can also run &lt;code&gt;codex init&lt;/code&gt; to let Codex scan your project and scaffold an initial &lt;code&gt;AGENTS.md&lt;/code&gt;. Then edit it to add the Dremio-specific sections shown above.&lt;/p&gt;
&lt;h3&gt;Nested Overrides for Multi-Schema Projects&lt;/h3&gt;
&lt;p&gt;If different directories in your project target different Dremio namespaces, use &lt;code&gt;AGENTS.override.md&lt;/code&gt;:&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-markdown&quot;&gt;# data-pipeline/AGENTS.override.md

## Dremio Namespace Override
All tables in this directory use the `etl_pipeline` top-level namespace.
Bronze views: etl_pipeline.bronze.*
Silver views: etl_pipeline.silver.*
Gold views: etl_pipeline.gold.*
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;This override applies only when Codex is working on files within the &lt;code&gt;data-pipeline/&lt;/code&gt; directory.&lt;/p&gt;
&lt;h3&gt;Portability Across Tools&lt;/h3&gt;
&lt;p&gt;One key advantage of AGENTS.md over tool-specific formats: the same file works with OpenCode, OpenWork, and any future tool that adopts the standard. Write it once for Codex and your team members using other AGENTS.md-compatible tools get the same Dremio context without extra setup.&lt;/p&gt;
&lt;p&gt;This portability is especially valuable for teams that are still evaluating which AI coding tool to standardize on. Rather than committing to CLAUDE.md (Claude-only) or SKILL.md (Antigravity-optimized), AGENTS.md gives you a tool-agnostic foundation that carries your Dremio conventions forward regardless of which agent your team picks.&lt;/p&gt;
&lt;p&gt;&lt;img src=&quot;/assets/images/aitoolblogs/09/four-integration-approaches.png&quot; alt=&quot;Four approaches to connecting AI coding tools to Dremio&quot;&gt;&lt;/p&gt;
&lt;h2&gt;Approach 3: Install Pre-Built Dremio Skills and Docs&lt;/h2&gt;
&lt;p&gt;Two community-supported open-source repositories provide ready-made Dremio context for coding agents.&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;Official vs. Community Resources:&lt;/strong&gt; Dremio provides an &lt;a href=&quot;https://github.com/dremio/claude-plugins&quot;&gt;official plugin&lt;/a&gt; for Claude Code users and the built-in &lt;a href=&quot;https://docs.dremio.com/current/developer/mcp-server/&quot;&gt;Dremio Cloud MCP server&lt;/a&gt; is an official Dremio product. The repositories below, along with libraries like dremioframe, are community-supported projects from the Dremio Developer Advocacy team. They are actively maintained but not part of the core Dremio product.&lt;/p&gt;
&lt;/blockquote&gt;
&lt;h3&gt;dremio-agent-skill: Full Agent Skill (Community)&lt;/h3&gt;
&lt;p&gt;The &lt;a href=&quot;https://github.com/developer-advocacy-dremio/dremio-agent-skill&quot;&gt;dremio-agent-skill&lt;/a&gt; repository contains a comprehensive skill directory that teaches AI assistants how to interact with Dremio. It includes knowledge files for the CLI, Python SDK (dremioframe), SQL syntax, and REST API.&lt;/p&gt;
&lt;p&gt;For Codex, the skill&apos;s &lt;code&gt;rules/&lt;/code&gt; directory includes an &lt;code&gt;AGENTS.md&lt;/code&gt; file you can copy to your project root:&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-bash&quot;&gt;git clone https://github.com/developer-advocacy-dremio/dremio-agent-skill
cp dremio-agent-skill/dremio-skill/rules/AGENTS.md ./AGENTS.md
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;This gives Codex the Dremio conventions and references without running the full skill installer. For broader integration, run the installer:&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-bash&quot;&gt;cd dremio-agent-skill
./install.sh
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Choose &lt;strong&gt;Local Project Install (Copy)&lt;/strong&gt; to copy the skill directory into your project, or &lt;strong&gt;Global Install (Symlink)&lt;/strong&gt; for system-wide access.&lt;/p&gt;
&lt;h3&gt;dremio-agent-md: Documentation Protocol (Community)&lt;/h3&gt;
&lt;p&gt;The &lt;a href=&quot;https://github.com/developer-advocacy-dremio/dremio-agent-md&quot;&gt;dremio-agent-md&lt;/a&gt; repository provides a &lt;code&gt;DREMIO_AGENT.md&lt;/code&gt; master protocol file and a browsable sitemap of the Dremio documentation.&lt;/p&gt;
&lt;p&gt;Clone it alongside your project:&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-bash&quot;&gt;git clone https://github.com/developer-advocacy-dremio/dremio-agent-md
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Then tell Codex to read the protocol file: &amp;quot;Read DREMIO_AGENT.md in the dremio-agent-md directory. Use the sitemaps in dremio_sitemaps/ to verify Dremio syntax before generating SQL.&amp;quot;&lt;/p&gt;
&lt;p&gt;This is especially useful for SQL validation. The agent navigates the sitemaps to find correct function signatures and reserved words instead of relying on training data.&lt;/p&gt;
&lt;h2&gt;Approach 4: Build Your Own Dremio Agent Configuration&lt;/h2&gt;
&lt;p&gt;If the pre-built options do not match your workflow, create a custom configuration.&lt;/p&gt;
&lt;h3&gt;Custom AGENTS.md with Knowledge Files&lt;/h3&gt;
&lt;p&gt;Create a directory structure that pairs your &lt;code&gt;AGENTS.md&lt;/code&gt; with reference documents:&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;project-root/
  AGENTS.md
  docs/
    dremio-sql-reference.md
    team-schemas.md
    dremioframe-patterns.md
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;In your &lt;code&gt;AGENTS.md&lt;/code&gt;, reference these files so Codex reads them when needed:&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-markdown&quot;&gt;## Reference Documentation
- For SQL syntax rules, read docs/dremio-sql-reference.md
- For team table schemas, read docs/team-schemas.md
- For Python SDK patterns, read docs/dremioframe-patterns.md
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Populate the knowledge files with your actual table schemas exported from Dremio, team-specific SQL patterns, and dremioframe code snippets for common operations.&lt;/p&gt;
&lt;h3&gt;Directory-Level Overrides&lt;/h3&gt;
&lt;p&gt;For monorepos, use &lt;code&gt;AGENTS.override.md&lt;/code&gt; in each subdirectory to provide namespace-specific context. The parent &lt;code&gt;AGENTS.md&lt;/code&gt; sets the Dremio conventions; the overrides specify which schemas and tables are relevant to each sub-project.&lt;/p&gt;
&lt;h2&gt;Using Dremio with Codex: Practical Use Cases&lt;/h2&gt;
&lt;p&gt;Once Dremio is connected, Codex becomes a data engineering assistant in your terminal. Here are detailed examples.&lt;/p&gt;
&lt;h3&gt;Ask Natural Language Questions About Your Data&lt;/h3&gt;
&lt;p&gt;Type a question directly in Codex and get answers from production data:&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;&amp;quot;What were our top 5 underperforming regions last quarter? Compare to the same quarter last year and suggest which metrics to investigate.&amp;quot;&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;Codex discovers your tables via MCP, writes a multi-step SQL analysis, runs it against Dremio, and returns a structured answer. You get insights from production data without opening the Dremio UI.&lt;/p&gt;
&lt;p&gt;Follow up with deeper investigation:&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;&amp;quot;For the worst-performing region, break down the decline by product category. Is it a demand issue or a fulfillment issue? Show return rates and delivery times alongside revenue.&amp;quot;&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;Codex maintains session context and uses the AGENTS.md conventions to write correct Dremio SQL. The layered guidance system means your global Dremio rules apply automatically.&lt;/p&gt;
&lt;p&gt;This pattern is especially powerful for engineers who live in the terminal. You can explore data, validate hypotheses, and generate insights without switching to a browser-based BI tool.&lt;/p&gt;
&lt;h3&gt;Build a Locally Running Dashboard&lt;/h3&gt;
&lt;p&gt;Ask Codex to create a complete visualization:&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;&amp;quot;Query Dremio&apos;s gold-layer financial views for revenue, expenses, and margins by department. Build a local HTML dashboard with D3.js charts showing trends, a summary table, and conditional formatting for over/under budget departments. Add a dark theme and filter controls.&amp;quot;&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;Codex will:&lt;/p&gt;
&lt;ol&gt;
&lt;li&gt;Use MCP to discover gold-layer financial views and their schemas&lt;/li&gt;
&lt;li&gt;Write and execute SQL queries for each metric&lt;/li&gt;
&lt;li&gt;Generate an HTML file with D3.js interactive visualizations&lt;/li&gt;
&lt;li&gt;Add conditional formatting (green/red) for budget variance&lt;/li&gt;
&lt;li&gt;Include filter dropdowns for department and date range&lt;/li&gt;
&lt;li&gt;Save the complete file to your project&lt;/li&gt;
&lt;/ol&gt;
&lt;p&gt;Open it in a browser for an interactive financial dashboard. No server required. Re-run the prompt weekly with fresh data from Dremio.&lt;/p&gt;
&lt;h3&gt;Create a Data Exploration App&lt;/h3&gt;
&lt;p&gt;Build an interactive tool for your team:&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;&amp;quot;Create a Flask app with a REST API that proxies queries to Dremio through dremioframe. Add a React frontend with a table browser, column statistics view, and a SQL sandbox where I can run ad-hoc queries. Include authentication with API keys.&amp;quot;&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;Codex scaffolds the full-stack app with:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Flask backend with dremioframe connection pooling&lt;/li&gt;
&lt;li&gt;React frontend with schema browser and SQL editor&lt;/li&gt;
&lt;li&gt;API key middleware for access control&lt;/li&gt;
&lt;li&gt;&lt;code&gt;docker-compose.yml&lt;/code&gt; for easy deployment&lt;/li&gt;
&lt;li&gt;Proper project structure with &lt;code&gt;requirements.txt&lt;/code&gt; and &lt;code&gt;package.json&lt;/code&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;This pattern lets you create internal data tools quickly without a formal development cycle.&lt;/p&gt;
&lt;h3&gt;Generate Data Pipeline Code&lt;/h3&gt;
&lt;p&gt;Automate your ETL workflows:&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;&amp;quot;Write a Python pipeline using dremioframe that incrementally processes new customer records. Create bronze views for raw data with TIMESTAMP casts, silver views with deduplication and email validation, and gold views with customer segmentation logic using CASE WHEN expressions. Add logging, error handling, and a summary report at the end.&amp;quot;&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;Codex follows the Dremio conventions from your AGENTS.md and produces production-ready pipeline code. The AGENTS.md cross-tool portability means the same conventions apply whether you run this from Codex, OpenCode, or OpenWork.&lt;/p&gt;
&lt;h3&gt;Build API Endpoints Over Dremio Data&lt;/h3&gt;
&lt;p&gt;Serve lakehouse data to other applications:&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;&amp;quot;Create a FastAPI service that connects to Dremio and serves customer analytics. Add endpoints for cohort analysis, retention metrics, and revenue forecasting. Include request validation, response caching, and health checks.&amp;quot;&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;Codex generates a complete API server ready for &lt;code&gt;uvicorn main:app --reload&lt;/code&gt;.&lt;/p&gt;
&lt;h2&gt;Which Approach Should You Use?&lt;/h2&gt;
&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Approach&lt;/th&gt;
&lt;th&gt;Setup Time&lt;/th&gt;
&lt;th&gt;What You Get&lt;/th&gt;
&lt;th&gt;Best For&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;MCP Server&lt;/td&gt;
&lt;td&gt;5 minutes&lt;/td&gt;
&lt;td&gt;Live queries, schema browsing, catalog exploration&lt;/td&gt;
&lt;td&gt;Data analysis, SQL generation, real-time access&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;AGENTS.md&lt;/td&gt;
&lt;td&gt;10 minutes&lt;/td&gt;
&lt;td&gt;Convention enforcement, doc references, portable config&lt;/td&gt;
&lt;td&gt;Teams needing cross-tool consistency&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Pre-Built Skills&lt;/td&gt;
&lt;td&gt;5 minutes&lt;/td&gt;
&lt;td&gt;Comprehensive Dremio knowledge (CLI, SDK, SQL, API)&lt;/td&gt;
&lt;td&gt;Quick start with broad coverage&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Custom Config&lt;/td&gt;
&lt;td&gt;30+ minutes&lt;/td&gt;
&lt;td&gt;Tailored to your schemas, patterns, and monorepo layout&lt;/td&gt;
&lt;td&gt;Mature teams with project-specific needs&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;
&lt;p&gt;These approaches stack. Start with the MCP server for live data access, add an &lt;code&gt;AGENTS.md&lt;/code&gt; for Dremio conventions, and supplement with knowledge files as your team identifies recurring patterns. The layered guidance system in Codex (global, project, nested overrides) makes it easy to manage Dremio context at every level of your project hierarchy.&lt;/p&gt;
&lt;p&gt;If your team uses multiple AI coding tools, invest in the AGENTS.md approach first. It gives you a single Dremio configuration that works across tools, and you can layer in MCP for live data access from whichever agent you are using at the time.&lt;/p&gt;
&lt;h2&gt;Get Started&lt;/h2&gt;
&lt;ol&gt;
&lt;li&gt;&lt;a href=&quot;https://www.dremio.com/get-started&quot;&gt;Sign up for Dremio Cloud free for 30 days&lt;/a&gt; ($400 in compute credits included).&lt;/li&gt;
&lt;li&gt;Find your project&apos;s MCP endpoint in &lt;strong&gt;Project Settings &amp;gt; Info&lt;/strong&gt;.&lt;/li&gt;
&lt;li&gt;Add it to Codex&apos;s MCP configuration.&lt;/li&gt;
&lt;li&gt;Clone &lt;a href=&quot;https://github.com/developer-advocacy-dremio/dremio-agent-skill&quot;&gt;dremio-agent-skill&lt;/a&gt; and copy the &lt;code&gt;AGENTS.md&lt;/code&gt; to your project root.&lt;/li&gt;
&lt;li&gt;Start Codex and ask it to explore your Dremio catalog.&lt;/li&gt;
&lt;/ol&gt;
&lt;p&gt;Dremio&apos;s Agentic Lakehouse delivers the three things Codex needs to write accurate analytics code: the semantic layer provides business context, query federation provides universal data access, and Reflections provide interactive speed. The MCP server bridges them, and &lt;code&gt;AGENTS.md&lt;/code&gt; teaches the agent your team&apos;s conventions.&lt;/p&gt;
&lt;p&gt;For more on the Dremio MCP Server, see the &lt;a href=&quot;https://docs.dremio.com/current/developer/mcp-server/&quot;&gt;official documentation&lt;/a&gt; or take the free &lt;a href=&quot;https://university.dremio.com/course/dremio-mcp&quot;&gt;Dremio MCP Server course&lt;/a&gt; on Dremio University.&lt;/p&gt;
</content:encoded><author>Alex Merced</author></item><item><title>How to Use Dremio with JetBrains AI Assistant: Connect, Query, and Build Data Apps</title><link>https://iceberglakehouse.com/posts/2026-03-aitool-jetbrains-ai/</link><guid isPermaLink="true">https://iceberglakehouse.com/posts/2026-03-aitool-jetbrains-ai/</guid><description>
JetBrains AI Assistant is built into IntelliJ IDEA, PyCharm, DataGrip, and every JetBrains IDE. It provides AI chat, inline code generation, multi-fi...</description><pubDate>Thu, 05 Mar 2026 16:00:00 GMT</pubDate><content:encoded>&lt;p&gt;JetBrains AI Assistant is built into IntelliJ IDEA, PyCharm, DataGrip, and every JetBrains IDE. It provides AI chat, inline code generation, multi-file refactoring, and agentic background workers that can autonomously execute multi-step tasks. Dremio is a unified lakehouse platform that provides business context through its semantic layer, universal data access through query federation, and interactive speed through Reflections and Apache Arrow.&lt;/p&gt;
&lt;p&gt;Connecting them gives the AI Assistant the context it needs to write accurate Dremio SQL, generate data pipelines, and build applications against your lakehouse. JetBrains IDEs are especially strong for data engineering: DataGrip provides native database tooling, IntelliJ supports full-stack development, and PyCharm is the standard for Python data work. Adding Dremio context to the AI Assistant turns these IDEs into data-aware development environments.&lt;/p&gt;
&lt;p&gt;A unique feature of the JetBrains ecosystem is its dual MCP role: the AI Assistant acts as an MCP client (connecting to external servers like Dremio), and the IDE itself can also act as an MCP server (exposing IDE tools to other AI clients).&lt;/p&gt;
&lt;p&gt;This post covers four approaches, ordered from quickest setup to most customizable.&lt;/p&gt;
&lt;p&gt;&lt;img src=&quot;/assets/images/aitoolblogs/08/jetbrains-dremio-architecture.png&quot; alt=&quot;JetBrains IntelliJ AI Assistant connecting to Dremio Agentic Lakehouse via MCP&quot;&gt;&lt;/p&gt;
&lt;h2&gt;Setting Up JetBrains AI Assistant&lt;/h2&gt;
&lt;p&gt;If you do not already have JetBrains AI Assistant:&lt;/p&gt;
&lt;ol&gt;
&lt;li&gt;&lt;strong&gt;Install a JetBrains IDE&lt;/strong&gt; — IntelliJ IDEA, PyCharm, DataGrip, or any other JetBrains IDE from &lt;a href=&quot;https://www.jetbrains.com/&quot;&gt;jetbrains.com&lt;/a&gt;. Community editions are free; Ultimate editions require a subscription.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Activate AI Assistant&lt;/strong&gt; — AI Assistant is included with JetBrains IDE subscriptions (2025.1+). Go to &lt;strong&gt;Settings &amp;gt; Plugins&lt;/strong&gt; and ensure &amp;quot;AI Assistant&amp;quot; is enabled.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Sign in&lt;/strong&gt; with your JetBrains account to activate the AI quota.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Open the AI Chat&lt;/strong&gt; by clicking the AI Assistant icon in the right sidebar or pressing &lt;code&gt;Alt+Enter&lt;/code&gt; on a code selection.&lt;/li&gt;
&lt;/ol&gt;
&lt;p&gt;JetBrains AI Assistant supports multiple LLM providers. You can use JetBrains-hosted models, connect your own API keys for Anthropic or OpenAI, or run local models via OpenAI-compatible servers for privacy-sensitive environments.&lt;/p&gt;
&lt;h2&gt;Approach 1: Connect the Dremio Cloud MCP Server&lt;/h2&gt;
&lt;p&gt;Every Dremio Cloud project ships with a built-in MCP server. JetBrains AI Assistant supports MCP as a client starting with version 2025.1.&lt;/p&gt;
&lt;p&gt;For Claude-based tools, Dremio provides an &lt;a href=&quot;https://github.com/dremio/claude-plugins&quot;&gt;official Claude plugin&lt;/a&gt; with guided setup. For JetBrains, you configure the MCP connection through the IDE settings.&lt;/p&gt;
&lt;h3&gt;Find Your Project&apos;s MCP Endpoint&lt;/h3&gt;
&lt;p&gt;Log into &lt;a href=&quot;https://www.dremio.com/get-started&quot;&gt;Dremio Cloud&lt;/a&gt; and navigate to &lt;strong&gt;Project Settings &amp;gt; Info&lt;/strong&gt;. Copy the MCP server URL.&lt;/p&gt;
&lt;h3&gt;Set Up OAuth in Dremio Cloud&lt;/h3&gt;
&lt;ol&gt;
&lt;li&gt;Go to &lt;strong&gt;Settings &amp;gt; Organization Settings &amp;gt; OAuth Applications&lt;/strong&gt;.&lt;/li&gt;
&lt;li&gt;Click &lt;strong&gt;Add Application&lt;/strong&gt; and enter a name (e.g., &amp;quot;JetBrains MCP&amp;quot;).&lt;/li&gt;
&lt;li&gt;Add the appropriate redirect URIs.&lt;/li&gt;
&lt;li&gt;Save and copy the &lt;strong&gt;Client ID&lt;/strong&gt;.&lt;/li&gt;
&lt;/ol&gt;
&lt;h3&gt;Configure JetBrains MCP Connection&lt;/h3&gt;
&lt;p&gt;Go to &lt;strong&gt;Settings &amp;gt; Tools &amp;gt; AI Assistant &amp;gt; Model Context Protocol (MCP)&lt;/strong&gt;. Click &lt;strong&gt;Add&lt;/strong&gt; and select the transport type:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Streamable HTTP&lt;/strong&gt;: For Dremio Cloud&apos;s hosted MCP server. Enter the MCP URL directly.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;STDIO&lt;/strong&gt;: For the self-hosted dremio-mcp server. Enter the command and arguments.&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;For HTTP configuration:&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;Name: Dremio
Type: Streamable HTTP
URL: https://YOUR_PROJECT_MCP_URL
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;After adding the server, the AI Assistant has access to Dremio&apos;s MCP tools:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;GetUsefulSystemTableNames&lt;/strong&gt; returns available tables.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;GetSchemaOfTable&lt;/strong&gt; returns column definitions.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;GetDescriptionOfTableOrSchema&lt;/strong&gt; pulls catalog descriptions.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;GetTableOrViewLineage&lt;/strong&gt; shows data lineage.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;RunSqlQuery&lt;/strong&gt; executes SQL and returns results.&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;Test by asking the AI chat: &amp;quot;What tables are available in Dremio?&amp;quot;&lt;/p&gt;
&lt;h3&gt;Self-Hosted Alternative&lt;/h3&gt;
&lt;p&gt;For Dremio Software deployments, configure the dremio-mcp server as STDIO transport:&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;Name: Dremio
Type: STDIO
Command: uv
Arguments: run --directory /path/to/dremio-mcp dremio-mcp-server run
&lt;/code&gt;&lt;/pre&gt;
&lt;h2&gt;Approach 2: Use Project Rules for Dremio Context&lt;/h2&gt;
&lt;p&gt;JetBrains AI Assistant supports project-specific rules through markdown files in &lt;code&gt;.aiassistant/rules/&lt;/code&gt;. These files provide persistent AI instructions scoped to your project.&lt;/p&gt;
&lt;h3&gt;Create Project Rules&lt;/h3&gt;
&lt;p&gt;Create &lt;code&gt;.aiassistant/rules/dremio.md&lt;/code&gt;:&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-markdown&quot;&gt;# Dremio SQL Conventions

This project uses Dremio Cloud as its lakehouse platform.

## SQL Rules
- Use CREATE FOLDER IF NOT EXISTS (not CREATE NAMESPACE or CREATE SCHEMA)
- Tables in the Open Catalog use folder.subfolder.table_name
- External federated sources use source_name.schema.table_name
- Cast DATE to TIMESTAMP for consistent joins
- Use TIMESTAMPDIFF for duration calculations

## Credentials
- Never hardcode Personal Access Tokens. Use environment variable: DREMIO_PAT
- Cloud endpoint: environment variable DREMIO_URI

## Terminology
- Call it &amp;quot;Agentic Lakehouse&amp;quot;, not &amp;quot;data warehouse&amp;quot;
- &amp;quot;Reflections&amp;quot; are pre-computed optimizations, not &amp;quot;materialized views&amp;quot;
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;You can also set rules via the IDE: &lt;strong&gt;Settings &amp;gt; Tools &amp;gt; AI Assistant &amp;gt; Project Rules&lt;/strong&gt;.&lt;/p&gt;
&lt;h3&gt;Custom Prompts&lt;/h3&gt;
&lt;p&gt;Create reusable prompts in the Prompt Library (&lt;strong&gt;Settings &amp;gt; Tools &amp;gt; AI Assistant &amp;gt; Prompt Library&lt;/strong&gt;). For example, create a &amp;quot;Dremio SQL Review&amp;quot; prompt that validates SQL against Dremio conventions before execution. These prompts are available from the AI Actions menu and can be invoked on selected code.&lt;/p&gt;
&lt;h3&gt;DataGrip Integration&lt;/h3&gt;
&lt;p&gt;If you use DataGrip or the Database plugin in IntelliJ, you can connect directly to Dremio as a JDBC data source. The AI Assistant then has access to your live schema through the IDE&apos;s built-in database tools, complementing the MCP-based approach.&lt;/p&gt;
&lt;p&gt;&lt;img src=&quot;/assets/images/aitoolblogs/08/four-integration-approaches.png&quot; alt=&quot;Four approaches to connecting AI coding tools to Dremio&quot;&gt;&lt;/p&gt;
&lt;h2&gt;Approach 3: Install Pre-Built Dremio Skills and Docs&lt;/h2&gt;
&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;Official vs. Community Resources:&lt;/strong&gt; Dremio provides an &lt;a href=&quot;https://github.com/dremio/claude-plugins&quot;&gt;official plugin&lt;/a&gt; for Claude Code users and the built-in &lt;a href=&quot;https://docs.dremio.com/current/developer/mcp-server/&quot;&gt;Dremio Cloud MCP server&lt;/a&gt; is an official Dremio product. The repositories below, along with libraries like dremioframe, are community-supported projects from the Dremio Developer Advocacy team. They are actively maintained but not part of the core Dremio product.&lt;/p&gt;
&lt;/blockquote&gt;
&lt;h3&gt;dremio-agent-skill (Community)&lt;/h3&gt;
&lt;p&gt;The &lt;a href=&quot;https://github.com/developer-advocacy-dremio/dremio-agent-skill&quot;&gt;dremio-agent-skill&lt;/a&gt; repository provides knowledge files and rules:&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-bash&quot;&gt;git clone https://github.com/developer-advocacy-dremio/dremio-agent-skill
cd dremio-agent-skill
./install.sh
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Copy the knowledge files into your project&apos;s &lt;code&gt;.aiassistant/rules/&lt;/code&gt; directory and reference them from your project rules.&lt;/p&gt;
&lt;h3&gt;dremio-agent-md (Community)&lt;/h3&gt;
&lt;p&gt;The &lt;a href=&quot;https://github.com/developer-advocacy-dremio/dremio-agent-md&quot;&gt;dremio-agent-md&lt;/a&gt; repository provides a protocol file and sitemaps:&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-bash&quot;&gt;git clone https://github.com/developer-advocacy-dremio/dremio-agent-md
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Reference it in your project rules:&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-markdown&quot;&gt;For Dremio SQL validation, read DREMIO_AGENT.md in ./dremio-agent-md/.
&lt;/code&gt;&lt;/pre&gt;
&lt;h2&gt;Approach 4: Build Your Own Project Rules&lt;/h2&gt;
&lt;p&gt;Create a comprehensive rules setup in &lt;code&gt;.aiassistant/rules/&lt;/code&gt;:&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;.aiassistant/rules/
  dremio-sql.md           # SQL conventions
  dremio-python.md        # dremioframe patterns
  dremio-schemas.md       # Team table schemas
  dremio-api.md           # REST API patterns
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Export your actual schemas from Dremio and keep them as a rule file. The AI Assistant reads all files in the &lt;code&gt;rules/&lt;/code&gt; directory and applies them to relevant interactions.&lt;/p&gt;
&lt;h2&gt;Using Dremio with JetBrains AI: Practical Use Cases&lt;/h2&gt;
&lt;p&gt;Once Dremio is connected, the AI Assistant becomes a data-aware coding partner across all JetBrains IDEs.&lt;/p&gt;
&lt;h3&gt;Ask Natural Language Questions About Your Data&lt;/h3&gt;
&lt;p&gt;In the AI Chat panel, ask questions about your lakehouse:&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;&amp;quot;What were our top 10 accounts by contract value last quarter? Break down by industry vertical and show renewal rates.&amp;quot;&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;The AI uses MCP to discover tables, writes the SQL, and returns results. In DataGrip, you can execute the generated SQL directly in the query console for additional exploration.&lt;/p&gt;
&lt;p&gt;Follow up:&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;&amp;quot;For accounts with renewal rates below 70%, pull their support ticket history and calculate average resolution time. Cross-reference with product usage metrics.&amp;quot;&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;The AI maintains conversation context and generates multi-table joins.&lt;/p&gt;
&lt;h3&gt;Build a Locally Running Dashboard&lt;/h3&gt;
&lt;p&gt;Ask the AI to generate a dashboard project:&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;&amp;quot;Query Dremio gold-layer views for revenue, customer metrics, and churn data. Create an HTML dashboard with ECharts. Include date filters, dark theme, and regional drill-down. Generate separate files.&amp;quot;&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;The AI generates the complete dashboard. In IntelliJ or WebStorm, you can preview the HTML directly in the IDE&apos;s built-in browser.&lt;/p&gt;
&lt;h3&gt;Create a Data Exploration App&lt;/h3&gt;
&lt;p&gt;Generate a data tool:&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;&amp;quot;Create a Streamlit app connected to Dremio via dremioframe. Include schema browsing, SQL query editor with syntax highlighting, data preview with pagination, and CSV download. Generate requirements.txt.&amp;quot;&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;In PyCharm, the AI generates the app and you can run it directly from the IDE with integrated debugging.&lt;/p&gt;
&lt;h3&gt;Generate Data Pipeline Scripts&lt;/h3&gt;
&lt;p&gt;Use the AI for data engineering:&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;&amp;quot;Write a Medallion Architecture pipeline using dremioframe. Bronze: ingest raw data. Silver: deduplicate, validate, standardize timestamps. Gold: business metrics and KPIs. Include logging and error handling.&amp;quot;&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;The AI generates the pipeline code following your project rules. PyCharm&apos;s debugger lets you step through the pipeline against live Dremio data.&lt;/p&gt;
&lt;h3&gt;Build API Endpoints Over Dremio Data&lt;/h3&gt;
&lt;p&gt;Scaffold backend services:&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;&amp;quot;Build a FastAPI app that serves Dremio analytics through REST endpoints. Add customer segments, revenue by region, and product trends. Include Pydantic models and OpenAPI docs.&amp;quot;&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;IntelliJ&apos;s HTTP client lets you test the endpoints directly from the IDE.&lt;/p&gt;
&lt;h2&gt;Which Approach Should You Use?&lt;/h2&gt;
&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Approach&lt;/th&gt;
&lt;th&gt;Setup Time&lt;/th&gt;
&lt;th&gt;What You Get&lt;/th&gt;
&lt;th&gt;Best For&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;MCP Server&lt;/td&gt;
&lt;td&gt;5 minutes&lt;/td&gt;
&lt;td&gt;Live queries, schema browsing, catalog exploration&lt;/td&gt;
&lt;td&gt;Data analysis, SQL generation, real-time access&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Project Rules&lt;/td&gt;
&lt;td&gt;10 minutes&lt;/td&gt;
&lt;td&gt;Convention enforcement, persistent AI context&lt;/td&gt;
&lt;td&gt;Teams with specific standards per IDE&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Pre-Built Skills&lt;/td&gt;
&lt;td&gt;5 minutes&lt;/td&gt;
&lt;td&gt;Comprehensive Dremio knowledge (CLI, SDK, SQL, API)&lt;/td&gt;
&lt;td&gt;Quick start with broad coverage&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Custom Rules&lt;/td&gt;
&lt;td&gt;30+ minutes&lt;/td&gt;
&lt;td&gt;Tailored schemas, custom prompts, team conventions&lt;/td&gt;
&lt;td&gt;Mature teams with DataGrip/PyCharm workflows&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;
&lt;p&gt;Start with the MCP server for live data access. Add project rules for conventions. Use DataGrip&apos;s native Dremio connection for schema exploration alongside MCP.&lt;/p&gt;
&lt;h2&gt;Get Started&lt;/h2&gt;
&lt;ol&gt;
&lt;li&gt;&lt;a href=&quot;https://www.dremio.com/get-started&quot;&gt;Sign up for a free Dremio Cloud trial&lt;/a&gt; (30 days, $400 in compute credits).&lt;/li&gt;
&lt;li&gt;Find your project&apos;s MCP endpoint in &lt;strong&gt;Project Settings &amp;gt; Info&lt;/strong&gt;.&lt;/li&gt;
&lt;li&gt;Add it in &lt;strong&gt;Settings &amp;gt; Tools &amp;gt; AI Assistant &amp;gt; MCP&lt;/strong&gt;.&lt;/li&gt;
&lt;li&gt;Create &lt;code&gt;.aiassistant/rules/dremio.md&lt;/code&gt; with your SQL conventions.&lt;/li&gt;
&lt;li&gt;Open AI Chat and ask it to explore your Dremio catalog.&lt;/li&gt;
&lt;/ol&gt;
&lt;p&gt;Dremio&apos;s Agentic Lakehouse gives the JetBrains AI Assistant accurate data context, and the IDE&apos;s native database tooling provides complementary schema exploration and SQL execution.&lt;/p&gt;
&lt;p&gt;For more on the Dremio MCP Server, check out the &lt;a href=&quot;https://docs.dremio.com/current/developer/mcp-server/&quot;&gt;official documentation&lt;/a&gt; or enroll in the free &lt;a href=&quot;https://university.dremio.com/course/dremio-mcp&quot;&gt;Dremio MCP Server course&lt;/a&gt; on Dremio University.&lt;/p&gt;
</content:encoded><author>Alex Merced</author></item><item><title>How to Use Dremio with Google Antigravity: Connect, Query, and Build Data Apps</title><link>https://iceberglakehouse.com/posts/2026-03-aitool-google-antigravity/</link><guid isPermaLink="true">https://iceberglakehouse.com/posts/2026-03-aitool-google-antigravity/</guid><description>
Google Antigravity is an agent-first IDE built by Google DeepMind. Its autonomous agents plan multi-step tasks, write code, browse documentation, and...</description><pubDate>Thu, 05 Mar 2026 15:00:00 GMT</pubDate><content:encoded>&lt;p&gt;Google Antigravity is an agent-first IDE built by Google DeepMind. Its autonomous agents plan multi-step tasks, write code, browse documentation, and iterate without constant hand-holding. Dremio is a unified lakehouse platform that provides the business context, universal data access, and interactive query speed that AI agents need to produce accurate analytics.&lt;/p&gt;
&lt;p&gt;Connecting the two gives your Antigravity agents something most coding agents lack: direct access to your data catalog, table schemas, business logic encoded in views, and the correct SQL dialect for Dremio&apos;s query engine. Without it, the agent guesses at table names and hallucinates SQL functions. With it, the agent writes queries that actually run.&lt;/p&gt;
&lt;p&gt;Antigravity&apos;s skill system is a particularly strong fit for Dremio integration. Skills load on demand based on semantic matching, so Dremio knowledge enters the context only when the agent needs it. This keeps the context window efficient for tasks that have nothing to do with data, while still providing deep Dremio expertise when you shift to analytics work.&lt;/p&gt;
&lt;p&gt;This post walks through four integration approaches. Each one adds a different kind of context, and they combine well. You can start with the simplest option and layer in more approaches as your team&apos;s Dremio usage grows.&lt;/p&gt;
&lt;p&gt;&lt;img src=&quot;/assets/images/aitoolblogs/07/antigravity-dremio-mcp-architecture.png&quot; alt=&quot;Google Antigravity IDE connecting to Dremio Agentic Lakehouse via MCP&quot;&gt;&lt;/p&gt;
&lt;h2&gt;Setting Up Google Antigravity&lt;/h2&gt;
&lt;p&gt;If you do not already have Antigravity installed:&lt;/p&gt;
&lt;ol&gt;
&lt;li&gt;&lt;strong&gt;Download Antigravity&lt;/strong&gt; from the &lt;a href=&quot;https://deepmind.google/tools/&quot;&gt;Google DeepMind tools page&lt;/a&gt; or your organization&apos;s approved software catalog.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Install it&lt;/strong&gt; by following the platform-specific instructions (available for macOS, Linux, and Windows).&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Open a project&lt;/strong&gt; by launching Antigravity and pointing it to your project directory.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Configure your AI model&lt;/strong&gt; by adding your API key or connecting your Google Cloud account in the IDE settings.&lt;/li&gt;
&lt;/ol&gt;
&lt;p&gt;Antigravity&apos;s agent-first design means it can plan multi-step tasks, execute shell commands, browse documentation, and iterate autonomously. Its skill system and rules engine give you fine-grained control over how agents behave.&lt;/p&gt;
&lt;h2&gt;Approach 1: Connect the Dremio Cloud MCP Server&lt;/h2&gt;
&lt;p&gt;The Model Context Protocol (MCP) is an open standard for AI tools to call external services. Dremio Cloud includes a built-in MCP server in every project, and Antigravity supports MCP natively through its IDE settings.&lt;/p&gt;
&lt;p&gt;For Claude-based tools, Dremio provides an &lt;a href=&quot;https://github.com/dremio/claude-plugins&quot;&gt;official Claude plugin&lt;/a&gt; with guided setup. Since Antigravity uses its own MCP configuration, you will configure the connection through the IDE settings instead.&lt;/p&gt;
&lt;h3&gt;Find Your Project&apos;s MCP Endpoint&lt;/h3&gt;
&lt;p&gt;Log into &lt;a href=&quot;https://www.dremio.com/get-started&quot;&gt;Dremio Cloud&lt;/a&gt; and open your project. Go to &lt;strong&gt;Project Settings &amp;gt; Info&lt;/strong&gt;. The MCP server URL is displayed on the project overview page. Copy it.&lt;/p&gt;
&lt;h3&gt;Set Up OAuth in Dremio Cloud&lt;/h3&gt;
&lt;p&gt;The hosted MCP server uses OAuth to authenticate connections. Your existing Dremio access controls apply to every query your Antigravity agent runs.&lt;/p&gt;
&lt;ol&gt;
&lt;li&gt;Navigate to &lt;strong&gt;Settings &amp;gt; Organization Settings &amp;gt; OAuth Applications&lt;/strong&gt;.&lt;/li&gt;
&lt;li&gt;Click &lt;strong&gt;Add Application&lt;/strong&gt; and enter a name like &amp;quot;Antigravity MCP&amp;quot;.&lt;/li&gt;
&lt;li&gt;Add the appropriate redirect URI for your Antigravity setup.&lt;/li&gt;
&lt;li&gt;Save and copy the &lt;strong&gt;Client ID&lt;/strong&gt;.&lt;/li&gt;
&lt;/ol&gt;
&lt;h3&gt;Configure Antigravity&apos;s MCP Connection&lt;/h3&gt;
&lt;p&gt;In Antigravity, open the MCP settings panel from the IDE preferences. Add a new MCP server with your Dremio project URL and the OAuth client credentials. The agent will now have access to Dremio&apos;s MCP tools:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;GetUsefulSystemTableNames&lt;/strong&gt; returns available tables with descriptions.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;GetSchemaOfTable&lt;/strong&gt; returns column names, types, and metadata.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;GetDescriptionOfTableOrSchema&lt;/strong&gt; pulls wiki descriptions and labels from the catalog.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;GetTableOrViewLineage&lt;/strong&gt; shows upstream data dependencies.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;RunSqlQuery&lt;/strong&gt; executes SQL and returns results as JSON.&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;Test the connection by asking your Antigravity agent: &amp;quot;What tables are available in Dremio?&amp;quot; The agent will call &lt;code&gt;GetUsefulSystemTableNames&lt;/code&gt; and return your catalog contents.&lt;/p&gt;
&lt;h3&gt;Self-Hosted Alternative&lt;/h3&gt;
&lt;p&gt;For Dremio Software deployments, use the open-source &lt;a href=&quot;https://github.com/dremio/dremio-mcp&quot;&gt;dremio-mcp&lt;/a&gt; server. Clone the repo, configure it with your Dremio instance URL and a Personal Access Token (PAT), then point Antigravity&apos;s MCP settings to the local server:&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-bash&quot;&gt;git clone https://github.com/dremio/dremio-mcp
cd dremio-mcp
uv run dremio-mcp-server config create dremioai \
  --uri https://your-dremio-instance.com \
  --pat YOUR_PERSONAL_ACCESS_TOKEN
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;In Antigravity&apos;s MCP settings, configure the server to run via the local command:&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-json&quot;&gt;{
  &amp;quot;command&amp;quot;: &amp;quot;uv&amp;quot;,
  &amp;quot;args&amp;quot;: [
    &amp;quot;run&amp;quot;, &amp;quot;--directory&amp;quot;, &amp;quot;/path/to/dremio-mcp&amp;quot;,
    &amp;quot;dremio-mcp-server&amp;quot;, &amp;quot;run&amp;quot;
  ]
}
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;The self-hosted server supports three modes: &lt;code&gt;FOR_DATA_PATTERNS&lt;/code&gt; for data exploration and SQL generation (default), &lt;code&gt;FOR_SELF&lt;/code&gt; for system performance analysis, and &lt;code&gt;FOR_PROMETHEUS&lt;/code&gt; for correlating Dremio metrics with your monitoring stack.&lt;/p&gt;
&lt;h2&gt;Approach 2: Use SKILL.md and Agent Rules for Dremio Context&lt;/h2&gt;
&lt;p&gt;Antigravity&apos;s defining feature is its skill system. Skills are reusable knowledge packages that agents discover and load on demand. A skill is a directory containing a &lt;code&gt;SKILL.md&lt;/code&gt; file with YAML frontmatter for discovery and markdown instructions for the agent.&lt;/p&gt;
&lt;p&gt;The key difference from context files in other tools: Antigravity skills are loaded only when relevant. The agent reads the skill&apos;s name and description from the YAML frontmatter, semantically matches them against your prompt, and activates the skill only when it is needed. This avoids wasting context tokens on instructions the agent does not need for the current task.&lt;/p&gt;
&lt;p&gt;This architecture is called progressive disclosure. A tool like Claude Code loads &lt;code&gt;CLAUDE.md&lt;/code&gt; into every session whether you need it or not. Antigravity loads skills selectively. For teams that use Dremio for some projects and not others, this means zero overhead on non-Dremio work.&lt;/p&gt;
&lt;h3&gt;How SKILL.md Works&lt;/h3&gt;
&lt;p&gt;A &lt;code&gt;SKILL.md&lt;/code&gt; file has two parts:&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-markdown&quot;&gt;---
name: Dremio Conventions
description: SQL syntax, REST API patterns, and credential handling for Dremio Cloud
---

# Dremio Conventions

## SQL Rules
- Use CREATE FOLDER IF NOT EXISTS (not CREATE NAMESPACE)
- Tables in the Open Catalog use folder.subfolder.table_name without a catalog prefix
- External sources use source_name.schema.table_name
- Cast DATE to TIMESTAMP for consistent joins

## Credentials
- Never hardcode PATs. Use environment variable DREMIO_PAT
- Dremio Cloud endpoint: environment variable DREMIO_URI

## Reference
- For SQL syntax validation, read knowledge/sql-reference.md
- For REST API endpoints, read knowledge/rest-api.md
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Place this in &lt;code&gt;.agent/skills/dremio/SKILL.md&lt;/code&gt; for workspace scope or &lt;code&gt;~/.agent/skills/dremio/SKILL.md&lt;/code&gt; for global scope.&lt;/p&gt;
&lt;h3&gt;Agent Rules for Always-On Guidance&lt;/h3&gt;
&lt;p&gt;Skills activate on demand. For instructions that should apply to every session regardless of the prompt, use Antigravity&apos;s rules system. Place markdown files in &lt;code&gt;.agent/rules/&lt;/code&gt;:&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-markdown&quot;&gt;# .agent/rules/dremio-sql.md

When writing Dremio SQL:
- Never use CREATE SCHEMA or CREATE NAMESPACE. Dremio uses CREATE FOLDER IF NOT EXISTS.
- Always validate function names against the Dremio SQL reference before including them.
- Use TIMESTAMPDIFF for duration calculations, not DATEDIFF.
- Dremio is not a data warehouse. It is an Agentic Lakehouse platform.
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Rules load at session start, similar to &lt;code&gt;CLAUDE.md&lt;/code&gt; in Claude Code. Use rules for hard constraints (like SQL dialect rules) and skills for reference knowledge (like API documentation).&lt;/p&gt;
&lt;h3&gt;Workflows for Repetitive Dremio Tasks&lt;/h3&gt;
&lt;p&gt;Antigravity also supports workflows in &lt;code&gt;.agent/workflows/&lt;/code&gt;. These are saved prompts the agent follows step by step. For example:&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-markdown&quot;&gt;# .agent/workflows/dremio-data-model.md
---
description: Create a bronze-silver-gold data model in Dremio
---

1. Read the Dremio skill for SQL conventions
2. Create folders for bronze, silver, and gold layers
3. Create bronze views with column renames and TIMESTAMP casts
4. Create silver views joining bronze views with business logic
5. Create gold views with CASE WHEN classifications
6. Enable AI-generated wikis on gold views
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;&lt;img src=&quot;/assets/images/aitoolblogs/07/four-integration-approaches.png&quot; alt=&quot;Four approaches to connecting AI coding tools to Dremio&quot;&gt;&lt;/p&gt;
&lt;h2&gt;Approach 3: Install Pre-Built Dremio Skills and Docs&lt;/h2&gt;
&lt;p&gt;Two community-supported open-source repositories provide ready-made Dremio context. Antigravity has first-class support for the skill-based approach.&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;Official vs. Community Resources:&lt;/strong&gt; Dremio offers an &lt;a href=&quot;https://github.com/dremio/claude-plugins&quot;&gt;official Claude Code plugin&lt;/a&gt; for Claude-based tools, and the built-in &lt;a href=&quot;https://docs.dremio.com/current/developer/mcp-server/&quot;&gt;Dremio Cloud MCP server&lt;/a&gt; is an official Dremio product. The repositories below, along with libraries like dremioframe, are community-supported projects from the Dremio Developer Advocacy team. They are actively maintained but not part of the core Dremio product.&lt;/p&gt;
&lt;/blockquote&gt;
&lt;h3&gt;dremio-agent-skill: Native Antigravity Skill (Community)&lt;/h3&gt;
&lt;p&gt;The &lt;a href=&quot;https://github.com/developer-advocacy-dremio/dremio-agent-skill&quot;&gt;dremio-agent-skill&lt;/a&gt; repository is designed for tools like Antigravity. It contains a complete &lt;code&gt;dremio-skill/&lt;/code&gt; directory with &lt;code&gt;SKILL.md&lt;/code&gt;, comprehensive &lt;code&gt;knowledge/&lt;/code&gt; files (CLI, Python SDK, SQL, REST API), and configuration files for other tools.&lt;/p&gt;
&lt;p&gt;Install it globally:&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-bash&quot;&gt;git clone https://github.com/developer-advocacy-dremio/dremio-agent-skill
cd dremio-agent-skill
./install.sh
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Choose &lt;strong&gt;Global Install (Symlink)&lt;/strong&gt; when prompted. This creates a symlink from the repo&apos;s &lt;code&gt;dremio-skill/&lt;/code&gt; directory to &lt;code&gt;~/.agent/skills/&lt;/code&gt;, making the skill available in every Antigravity workspace. When you pull updates to the repo, the skill updates automatically.&lt;/p&gt;
&lt;p&gt;After installation, start a new Antigravity session and ask it to scan for available skills. The agent will discover the Dremio skill by its name and description, and load it whenever you ask Dremio-related questions.&lt;/p&gt;
&lt;p&gt;For team projects, choose &lt;strong&gt;Local Project Install (Copy)&lt;/strong&gt; instead. This copies the skill into your project and sets up &lt;code&gt;.agent&lt;/code&gt; symlinks so every team member who clones the repo gets the same context.&lt;/p&gt;
&lt;h3&gt;dremio-agent-md: Documentation Protocol (Community)&lt;/h3&gt;
&lt;p&gt;The &lt;a href=&quot;https://github.com/developer-advocacy-dremio/dremio-agent-md&quot;&gt;dremio-agent-md&lt;/a&gt; repository provides a &lt;code&gt;DREMIO_AGENT.md&lt;/code&gt; protocol file and browsable sitemaps of the Dremio documentation.&lt;/p&gt;
&lt;p&gt;Clone it and tell your Antigravity agent to read it:&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-bash&quot;&gt;git clone https://github.com/developer-advocacy-dremio/dremio-agent-md
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Then instruct the agent: &amp;quot;Read DREMIO_AGENT.md in the dremio-agent-md directory. Use the sitemaps in dremio_sitemaps/ to verify Dremio syntax before generating any SQL.&amp;quot;&lt;/p&gt;
&lt;p&gt;This approach is useful when you need the agent to cross-reference specific documentation pages rather than rely on pre-packaged knowledge files.&lt;/p&gt;
&lt;h2&gt;Approach 4: Build Your Own Dremio Skill&lt;/h2&gt;
&lt;p&gt;If the pre-built skill does not fit your workflow, build a custom one. Antigravity&apos;s skill system makes this straightforward.&lt;/p&gt;
&lt;h3&gt;Create the Skill Structure&lt;/h3&gt;
&lt;pre&gt;&lt;code&gt;.agent/skills/my-dremio/
  SKILL.md
  knowledge/
    sql-conventions.md
    team-schemas.md
    dremioframe-patterns.md
&lt;/code&gt;&lt;/pre&gt;
&lt;h3&gt;Write the SKILL.md&lt;/h3&gt;
&lt;pre&gt;&lt;code class=&quot;language-markdown&quot;&gt;---
name: Team Dremio Skill
description: SQL conventions, table schemas, and dremioframe patterns for our analytics lakehouse
---

# Team Dremio Skill

## SQL Standards
- All tables are under the analytics namespace
- Bronze: analytics.bronze.*, Silver: analytics.silver.*, Gold: analytics.gold.*
- Always use TIMESTAMP, never DATE
- Validate function names against knowledge/sql-conventions.md

## Authentication
- Use env var DREMIO_PAT for tokens
- Cloud endpoint: env var DREMIO_URI

## Common Tasks
- For bulk data operations, use dremioframe patterns in knowledge/dremioframe-patterns.md
- For table schemas, check knowledge/team-schemas.md
&lt;/code&gt;&lt;/pre&gt;
&lt;h3&gt;Populate Knowledge Files&lt;/h3&gt;
&lt;p&gt;Export your actual table schemas from Dremio and save them as markdown in the &lt;code&gt;knowledge/&lt;/code&gt; directory. Include dremioframe code snippets your team uses frequently, REST API call patterns for your CI/CD pipeline, and SQL examples that follow your naming conventions.&lt;/p&gt;
&lt;p&gt;The advantage of a custom skill over a generic rules file: skills activate based on semantic matching. When you ask about a completely unrelated topic, the Dremio skill stays out of the context window. When you ask about data pipelines or SQL, the agent pulls it in automatically.&lt;/p&gt;
&lt;h2&gt;Using Dremio with Antigravity: Practical Use Cases&lt;/h2&gt;
&lt;p&gt;Once Dremio is connected, Antigravity&apos;s agents can execute complete data projects autonomously. Here are detailed examples.&lt;/p&gt;
&lt;h3&gt;Ask Natural Language Questions About Your Data&lt;/h3&gt;
&lt;p&gt;Ask your Antigravity agent questions about your lakehouse in plain English:&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;&amp;quot;What is the average order value by product category for the last 6 months? Show me which categories are trending up.&amp;quot;&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;The agent uses MCP to discover relevant tables, writes and runs the SQL against Dremio, and returns formatted results with analysis. No SQL required.&lt;/p&gt;
&lt;p&gt;Take it further with multi-step analysis:&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;&amp;quot;For the categories trending up, pull the top 5 products in each and compare their margins. Are we making more revenue but at lower margins?&amp;quot;&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;Antigravity&apos;s skill system loads the Dremio conventions automatically when it detects a data-related question, so the SQL it generates follows your team&apos;s standards without you needing to remind it.&lt;/p&gt;
&lt;h3&gt;Build a Locally Running Dashboard&lt;/h3&gt;
&lt;p&gt;Give the agent a broader task:&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;&amp;quot;Query our gold-layer customer analytics views in Dremio. Build a local HTML dashboard with Plotly.js charts showing customer lifetime value distribution, churn rates by cohort, and retention curves. Include date range filters and a dark theme.&amp;quot;&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;Antigravity will:&lt;/p&gt;
&lt;ol&gt;
&lt;li&gt;Activate the Dremio skill to understand your SQL conventions&lt;/li&gt;
&lt;li&gt;Use MCP to discover gold-layer views and their schemas&lt;/li&gt;
&lt;li&gt;Write and execute the SQL queries&lt;/li&gt;
&lt;li&gt;Generate an HTML file with Plotly.js interactive charts&lt;/li&gt;
&lt;li&gt;Add filter controls and a responsive layout&lt;/li&gt;
&lt;li&gt;Save it to your workspace&lt;/li&gt;
&lt;/ol&gt;
&lt;p&gt;Open the HTML file in a browser for a complete dashboard running from a local file. The Plotly.js charts support zoom, pan, hover tooltips, and export to PNG.&lt;/p&gt;
&lt;h3&gt;Create a Data Exploration App&lt;/h3&gt;
&lt;p&gt;Ask for an interactive tool:&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;&amp;quot;Build a Streamlit app that connects to Dremio using dremioframe. Add a sidebar for browsing schemas and tables, a detail view showing table schemas and wiki descriptions, a SQL query editor with syntax highlighting, and a results panel with pagination and CSV download.&amp;quot;&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;Antigravity writes the full Python application with:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Dremio catalog browser using the MCP connection for live schema data&lt;/li&gt;
&lt;li&gt;SQL editor with autocomplete based on discovered table names&lt;/li&gt;
&lt;li&gt;Paginated results display with export options&lt;/li&gt;
&lt;li&gt;Connection management using environment variables&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;Run &lt;code&gt;streamlit run app.py&lt;/code&gt; and your team has a local data explorer without waiting for a BI tool deployment.&lt;/p&gt;
&lt;h3&gt;Automate Data Workflows with Antigravity Workflows&lt;/h3&gt;
&lt;p&gt;Use Antigravity&apos;s workflow system to create repeatable Dremio operations:&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;&amp;quot;Using the Dremio skill, write a Python script that creates a bronze-silver-gold view hierarchy for our new user events table. Follow the Medallion Architecture patterns. Bronze should rename columns to snake_case and cast dates. Silver should deduplicate and validate required fields. Gold should aggregate daily active users and session duration by segment.&amp;quot;&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;The agent references the Dremio skill for conventions and produces structured SQL and Python code. Save the prompt as an Antigravity workflow in &lt;code&gt;.agent/workflows/new-data-model.md&lt;/code&gt; so any team member can run it for new tables.&lt;/p&gt;
&lt;h3&gt;Build API Endpoints Over Dremio Data&lt;/h3&gt;
&lt;p&gt;Create backend services:&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;&amp;quot;Build a Flask API that queries Dremio&apos;s gold-layer views. Create endpoints for customer segments, revenue trends, and product performance. Include caching with a 5-minute TTL and rate limiting. Generate OpenAPI docs.&amp;quot;&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;Antigravity generates the full application with proper error handling, connection pooling via dremioframe, and production-ready configuration.&lt;/p&gt;
&lt;h2&gt;Which Approach Should You Use?&lt;/h2&gt;
&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Approach&lt;/th&gt;
&lt;th&gt;Setup Time&lt;/th&gt;
&lt;th&gt;What You Get&lt;/th&gt;
&lt;th&gt;Best For&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;MCP Server&lt;/td&gt;
&lt;td&gt;5 minutes&lt;/td&gt;
&lt;td&gt;Live queries, schema browsing, catalog exploration&lt;/td&gt;
&lt;td&gt;Data analysis, SQL generation, real-time access&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;SKILL.md + Rules&lt;/td&gt;
&lt;td&gt;15 minutes&lt;/td&gt;
&lt;td&gt;Convention enforcement, on-demand doc references&lt;/td&gt;
&lt;td&gt;Teams with specific SQL standards&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Pre-Built Skill&lt;/td&gt;
&lt;td&gt;5 minutes&lt;/td&gt;
&lt;td&gt;Comprehensive Dremio knowledge (CLI, SDK, SQL, API)&lt;/td&gt;
&lt;td&gt;Quick start with broad coverage&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Custom Skill&lt;/td&gt;
&lt;td&gt;30+ minutes&lt;/td&gt;
&lt;td&gt;Tailored to your schemas, patterns, workflows&lt;/td&gt;
&lt;td&gt;Mature teams with specific needs&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;
&lt;p&gt;Combine them for the strongest setup. Use the MCP server for live data, a pre-built skill for general Dremio knowledge, rules for hard SQL constraints, and a custom skill for your team&apos;s specific schemas and patterns.&lt;/p&gt;
&lt;p&gt;If you are evaluating Dremio for the first time, start with the MCP server. It takes five minutes and gives you immediate querying capabilities. As you develop team conventions, add rules files for the constraints that should apply universally. Once you have a stable set of patterns, package them into a custom skill that your entire team can install.&lt;/p&gt;
&lt;h2&gt;Get Started&lt;/h2&gt;
&lt;ol&gt;
&lt;li&gt;&lt;a href=&quot;https://www.dremio.com/get-started&quot;&gt;Sign up for Dremio Cloud free for 30 days&lt;/a&gt; ($400 in compute credits included).&lt;/li&gt;
&lt;li&gt;Find your project&apos;s MCP endpoint under &lt;strong&gt;Project Settings &amp;gt; Info&lt;/strong&gt;.&lt;/li&gt;
&lt;li&gt;Add it in Antigravity&apos;s MCP settings panel.&lt;/li&gt;
&lt;li&gt;Clone &lt;a href=&quot;https://github.com/developer-advocacy-dremio/dremio-agent-skill&quot;&gt;dremio-agent-skill&lt;/a&gt; and run &lt;code&gt;./install.sh&lt;/code&gt; with global symlink mode.&lt;/li&gt;
&lt;li&gt;Start a new Antigravity session and ask it to explore your Dremio catalog.&lt;/li&gt;
&lt;/ol&gt;
&lt;p&gt;Dremio&apos;s Agentic Lakehouse provides the three things Antigravity agents need for accurate analytics: the semantic layer delivers business context, query federation delivers universal data access, and Reflections deliver interactive speed. The MCP server connects them, and skills teach the agent your team&apos;s conventions.&lt;/p&gt;
&lt;p&gt;For more on the Dremio MCP Server, see the &lt;a href=&quot;https://docs.dremio.com/current/developer/mcp-server/&quot;&gt;official documentation&lt;/a&gt; or take the free &lt;a href=&quot;https://university.dremio.com/course/dremio-mcp&quot;&gt;Dremio MCP Server course&lt;/a&gt; on Dremio University.&lt;/p&gt;
</content:encoded><author>Alex Merced</author></item><item><title>How to Use Dremio with GitHub Copilot: Connect, Query, and Build Data Apps</title><link>https://iceberglakehouse.com/posts/2026-03-aitool-github-copilot/</link><guid isPermaLink="true">https://iceberglakehouse.com/posts/2026-03-aitool-github-copilot/</guid><description>
GitHub Copilot is the most widely adopted AI coding assistant, integrated into VS Code, JetBrains IDEs, and the GitHub platform. Its agent mode allow...</description><pubDate>Thu, 05 Mar 2026 14:00:00 GMT</pubDate><content:encoded>&lt;p&gt;GitHub Copilot is the most widely adopted AI coding assistant, integrated into VS Code, JetBrains IDEs, and the GitHub platform. Its agent mode allows Copilot to plan and execute multi-step coding tasks, run terminal commands, and interact with external tools through MCP. The Copilot CLI extends agentic development to the terminal. Dremio is a unified lakehouse platform that provides business context through its semantic layer, universal data access through query federation, and interactive speed through Reflections and Apache Arrow.&lt;/p&gt;
&lt;p&gt;Connecting them gives Copilot&apos;s agent mode the context it needs to write accurate Dremio SQL, generate data pipelines, and build applications against your lakehouse. This is significant because of Copilot&apos;s massive user base: if you already use Copilot for code completion and chat, adding Dremio context turns it into a data-aware development partner without switching tools.&lt;/p&gt;
&lt;p&gt;Copilot&apos;s &lt;code&gt;copilot-instructions.md&lt;/code&gt; file and &lt;code&gt;.vscode/mcp.json&lt;/code&gt; configuration make it straightforward to integrate project-specific Dremio conventions and live data access into your workflow.&lt;/p&gt;
&lt;p&gt;This post covers four approaches, ordered from quickest setup to most customizable.&lt;/p&gt;
&lt;p&gt;&lt;img src=&quot;/assets/images/aitoolblogs/06/copilot-dremio-architecture.png&quot; alt=&quot;GitHub Copilot agent mode in VS Code connecting to Dremio Agentic Lakehouse via MCP&quot;&gt;&lt;/p&gt;
&lt;h2&gt;Setting Up GitHub Copilot&lt;/h2&gt;
&lt;p&gt;If you do not already have GitHub Copilot:&lt;/p&gt;
&lt;ol&gt;
&lt;li&gt;&lt;strong&gt;Sign up for GitHub Copilot&lt;/strong&gt; at &lt;a href=&quot;https://github.com/features/copilot&quot;&gt;github.com/features/copilot&lt;/a&gt;. Individual ($10/month), Business ($19/user/month), and Enterprise ($39/user/month) plans are available. Free tier includes limited completions.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Install VS Code&lt;/strong&gt; from &lt;a href=&quot;https://code.visualstudio.com/&quot;&gt;code.visualstudio.com&lt;/a&gt; if not already installed.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Install the GitHub Copilot extension&lt;/strong&gt; from the VS Code marketplace (search &amp;quot;GitHub Copilot&amp;quot;).&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Sign in&lt;/strong&gt; with your GitHub account when prompted.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Enable agent mode&lt;/strong&gt; by clicking the Copilot chat icon and selecting &amp;quot;Agent&amp;quot; from the mode dropdown (available in VS Code 1.99+).&lt;/li&gt;
&lt;/ol&gt;
&lt;p&gt;For terminal usage, install the &lt;strong&gt;Copilot CLI&lt;/strong&gt; (&lt;code&gt;gh copilot&lt;/code&gt;) through the GitHub CLI:&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-bash&quot;&gt;gh extension install github/gh-copilot
&lt;/code&gt;&lt;/pre&gt;
&lt;h2&gt;Approach 1: Connect the Dremio Cloud MCP Server&lt;/h2&gt;
&lt;p&gt;Every Dremio Cloud project ships with a built-in MCP server. Copilot agent mode supports MCP natively through workspace configuration files.&lt;/p&gt;
&lt;p&gt;For Claude-based tools, Dremio provides an &lt;a href=&quot;https://github.com/dremio/claude-plugins&quot;&gt;official Claude plugin&lt;/a&gt; with guided setup. For Copilot, you configure the MCP connection through &lt;code&gt;.vscode/mcp.json&lt;/code&gt;.&lt;/p&gt;
&lt;h3&gt;Find Your Project&apos;s MCP Endpoint&lt;/h3&gt;
&lt;p&gt;Log into &lt;a href=&quot;https://www.dremio.com/get-started&quot;&gt;Dremio Cloud&lt;/a&gt; and navigate to &lt;strong&gt;Project Settings &amp;gt; Info&lt;/strong&gt;. Copy the MCP server URL.&lt;/p&gt;
&lt;h3&gt;Set Up OAuth in Dremio Cloud&lt;/h3&gt;
&lt;ol&gt;
&lt;li&gt;Go to &lt;strong&gt;Settings &amp;gt; Organization Settings &amp;gt; OAuth Applications&lt;/strong&gt;.&lt;/li&gt;
&lt;li&gt;Click &lt;strong&gt;Add Application&lt;/strong&gt; and enter a name (e.g., &amp;quot;Copilot MCP&amp;quot;).&lt;/li&gt;
&lt;li&gt;Add the appropriate redirect URIs.&lt;/li&gt;
&lt;li&gt;Save and copy the &lt;strong&gt;Client ID&lt;/strong&gt;.&lt;/li&gt;
&lt;/ol&gt;
&lt;h3&gt;Configure Copilot&apos;s MCP Connection&lt;/h3&gt;
&lt;p&gt;Create &lt;code&gt;.vscode/mcp.json&lt;/code&gt; in your project root:&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-json&quot;&gt;{
  &amp;quot;servers&amp;quot;: {
    &amp;quot;dremio&amp;quot;: {
      &amp;quot;type&amp;quot;: &amp;quot;http&amp;quot;,
      &amp;quot;url&amp;quot;: &amp;quot;https://YOUR_PROJECT_MCP_URL&amp;quot;
    }
  }
}
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;You can also configure MCP servers in your VS Code user settings:&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-json&quot;&gt;{
  &amp;quot;mcp&amp;quot;: {
    &amp;quot;servers&amp;quot;: {
      &amp;quot;dremio&amp;quot;: {
        &amp;quot;type&amp;quot;: &amp;quot;http&amp;quot;,
        &amp;quot;url&amp;quot;: &amp;quot;https://YOUR_PROJECT_MCP_URL&amp;quot;
      }
    }
  }
}
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Reload VS Code. Copilot agent mode now has access to Dremio&apos;s MCP tools:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;GetUsefulSystemTableNames&lt;/strong&gt; returns available tables with descriptions.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;GetSchemaOfTable&lt;/strong&gt; returns column names, types, and metadata.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;GetDescriptionOfTableOrSchema&lt;/strong&gt; pulls wiki descriptions from the catalog.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;GetTableOrViewLineage&lt;/strong&gt; shows upstream dependencies.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;RunSqlQuery&lt;/strong&gt; executes SQL and returns results as JSON.&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;Test the connection by opening Copilot Chat in agent mode and asking: &amp;quot;What tables are available in Dremio?&amp;quot; The agent will call &lt;code&gt;GetUsefulSystemTableNames&lt;/code&gt; and return your catalog contents.&lt;/p&gt;
&lt;h3&gt;Self-Hosted Alternative&lt;/h3&gt;
&lt;p&gt;For Dremio Software deployments, use the open-source &lt;a href=&quot;https://github.com/dremio/dremio-mcp&quot;&gt;dremio-mcp&lt;/a&gt; server:&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-json&quot;&gt;{
  &amp;quot;servers&amp;quot;: {
    &amp;quot;dremio&amp;quot;: {
      &amp;quot;type&amp;quot;: &amp;quot;stdio&amp;quot;,
      &amp;quot;command&amp;quot;: &amp;quot;uv&amp;quot;,
      &amp;quot;args&amp;quot;: [
        &amp;quot;run&amp;quot;, &amp;quot;--directory&amp;quot;, &amp;quot;/path/to/dremio-mcp&amp;quot;,
        &amp;quot;dremio-mcp-server&amp;quot;, &amp;quot;run&amp;quot;
      ]
    }
  }
}
&lt;/code&gt;&lt;/pre&gt;
&lt;h3&gt;Enterprise Policy Controls&lt;/h3&gt;
&lt;p&gt;For organizations, GitHub administrators can manage MCP server access through organization policies. This lets teams standardize on approved Dremio MCP connections while preventing unauthorized data access.&lt;/p&gt;
&lt;h2&gt;Approach 2: Use copilot-instructions.md for Dremio Context&lt;/h2&gt;
&lt;p&gt;Copilot reads custom instructions from &lt;code&gt;.github/copilot-instructions.md&lt;/code&gt; in your repository. This file is loaded into every Copilot interaction, providing persistent project context.&lt;/p&gt;
&lt;h3&gt;Repository-Level Instructions&lt;/h3&gt;
&lt;p&gt;Create &lt;code&gt;.github/copilot-instructions.md&lt;/code&gt;:&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-markdown&quot;&gt;# Dremio SQL Conventions

This project uses Dremio Cloud as its lakehouse platform.

## SQL Rules
- Use CREATE FOLDER IF NOT EXISTS (not CREATE NAMESPACE or CREATE SCHEMA)
- Tables in the Open Catalog use folder.subfolder.table_name
- External federated sources use source_name.schema.table_name
- Cast DATE to TIMESTAMP for consistent joins
- Use TIMESTAMPDIFF for duration calculations

## Credentials
- Never hardcode Personal Access Tokens. Use environment variable: DREMIO_PAT
- Cloud endpoint: environment variable DREMIO_URI

## Terminology
- Call it &amp;quot;Agentic Lakehouse&amp;quot;, not &amp;quot;data warehouse&amp;quot;
- &amp;quot;Reflections&amp;quot; are pre-computed optimizations, not &amp;quot;materialized views&amp;quot;
&lt;/code&gt;&lt;/pre&gt;
&lt;h3&gt;Pattern-Specific Instructions&lt;/h3&gt;
&lt;p&gt;Copilot also supports &lt;code&gt;.instructions&lt;/code&gt; files with YAML glob patterns for targeted application:&lt;/p&gt;
&lt;p&gt;Create &lt;code&gt;.github/instructions/dremio-sql.instructions.md&lt;/code&gt;:&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-markdown&quot;&gt;---
applyTo: &amp;quot;**/*.sql&amp;quot;
---

When writing SQL for Dremio:
- Validate function names against the Dremio SQL reference
- Use TIMESTAMPDIFF for duration calculations
- Cast DATE columns to TIMESTAMP before joins
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Create &lt;code&gt;.github/instructions/dremio-python.instructions.md&lt;/code&gt;:&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-markdown&quot;&gt;---
applyTo: &amp;quot;**/*.py&amp;quot;
---

When writing Python code that uses dremioframe:
- Import as: from dremioframe import DremioConnection
- Use environment variables for credentials
- Always close connections in a finally block
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;This scoping is similar to Cursor&apos;s &lt;code&gt;.cursor/rules/*.mdc&lt;/code&gt; pattern matching.&lt;/p&gt;
&lt;p&gt;&lt;img src=&quot;/assets/images/aitoolblogs/06/four-integration-approaches.png&quot; alt=&quot;Four approaches to connecting AI coding tools to Dremio&quot;&gt;&lt;/p&gt;
&lt;h2&gt;Approach 3: Install Pre-Built Dremio Skills and Docs&lt;/h2&gt;
&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;Official vs. Community Resources:&lt;/strong&gt; Dremio provides an &lt;a href=&quot;https://github.com/dremio/claude-plugins&quot;&gt;official plugin&lt;/a&gt; for Claude Code users and the built-in &lt;a href=&quot;https://docs.dremio.com/current/developer/mcp-server/&quot;&gt;Dremio Cloud MCP server&lt;/a&gt; is an official Dremio product. The repositories below, along with libraries like dremioframe, are community-supported projects from the Dremio Developer Advocacy team. They are actively maintained but not part of the core Dremio product.&lt;/p&gt;
&lt;/blockquote&gt;
&lt;h3&gt;dremio-agent-skill (Community)&lt;/h3&gt;
&lt;p&gt;The &lt;a href=&quot;https://github.com/developer-advocacy-dremio/dremio-agent-skill&quot;&gt;dremio-agent-skill&lt;/a&gt; repository provides knowledge files and a &lt;code&gt;.cursorrules&lt;/code&gt; file:&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-bash&quot;&gt;git clone https://github.com/developer-advocacy-dremio/dremio-agent-skill
cd dremio-agent-skill
./install.sh
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Reference the knowledge files from your &lt;code&gt;copilot-instructions.md&lt;/code&gt;:&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-markdown&quot;&gt;For Dremio SQL conventions, read the knowledge files in dremio-skill/knowledge/.
&lt;/code&gt;&lt;/pre&gt;
&lt;h3&gt;dremio-agent-md (Community)&lt;/h3&gt;
&lt;p&gt;The &lt;a href=&quot;https://github.com/developer-advocacy-dremio/dremio-agent-md&quot;&gt;dremio-agent-md&lt;/a&gt; repository provides a protocol file and documentation sitemaps:&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-bash&quot;&gt;git clone https://github.com/developer-advocacy-dremio/dremio-agent-md
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Reference it in &lt;code&gt;.github/copilot-instructions.md&lt;/code&gt;:&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-markdown&quot;&gt;For Dremio documentation, read DREMIO_AGENT.md in ./dremio-agent-md/.
&lt;/code&gt;&lt;/pre&gt;
&lt;h2&gt;Approach 4: Build Your Own copilot-instructions.md&lt;/h2&gt;
&lt;p&gt;Create a comprehensive instruction file with your team&apos;s Dremio environment:&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-markdown&quot;&gt;# Team Dremio Context

## Environment
- Lakehouse: Dremio Cloud (analytics project)
- Catalog: Apache Polaris-based Open Catalog
- Architecture: Medallion (bronze → silver → gold)

## Table Schemas
For exact column definitions, read ./docs/table-schemas.md

## SQL Standards
- Bronze: raw.*, Silver: cleaned.*, Gold: analytics.*
- Always use TIMESTAMP, never DATE
- Validate function names against ./docs/dremio-sql-reference.md

## Python SDK
- Use dremioframe for all Dremio connections
- Patterns: read ./docs/dremioframe-patterns.md
&lt;/code&gt;&lt;/pre&gt;
&lt;h2&gt;Using Dremio with GitHub Copilot: Practical Use Cases&lt;/h2&gt;
&lt;p&gt;Once Dremio is connected, Copilot agent mode can execute complete data projects. Here are detailed examples.&lt;/p&gt;
&lt;h3&gt;Ask Natural Language Questions About Your Data&lt;/h3&gt;
&lt;p&gt;In agent mode, ask Copilot questions about your data:&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;&amp;quot;What were our top 10 customers by lifetime value? Show their order frequency and most recent purchase date.&amp;quot;&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;Copilot agent mode uses MCP to discover tables, writes the SQL, runs it, and returns formatted results. Because it operates within VS Code, you can immediately use the results in your code.&lt;/p&gt;
&lt;p&gt;Follow up with analysis:&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;&amp;quot;For customers with declining order frequency, correlate with support ticket volume. Are our high-value customers churning?&amp;quot;&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;Copilot maintains context across the conversation and generates cross-table queries automatically.&lt;/p&gt;
&lt;h3&gt;Build a Locally Running Dashboard&lt;/h3&gt;
&lt;p&gt;Ask Copilot in agent mode:&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;&amp;quot;Query our Dremio gold-layer views for revenue metrics, then create an HTML dashboard with Chart.js. Include monthly trends, regional breakdown, and top product charts. Add date filters and a dark theme. Save as separate HTML, CSS, and JS files.&amp;quot;&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;Copilot agent mode will:&lt;/p&gt;
&lt;ol&gt;
&lt;li&gt;Call MCP to discover views and schemas&lt;/li&gt;
&lt;li&gt;Execute queries and collect results&lt;/li&gt;
&lt;li&gt;Generate &lt;code&gt;index.html&lt;/code&gt;, &lt;code&gt;styles.css&lt;/code&gt;, and &lt;code&gt;app.js&lt;/code&gt;&lt;/li&gt;
&lt;li&gt;Wire everything together with Chart.js&lt;/li&gt;
&lt;/ol&gt;
&lt;p&gt;Open &lt;code&gt;index.html&lt;/code&gt; in a browser for a working dashboard. Since this all happens in VS Code, you can iterate on the design with inline edits.&lt;/p&gt;
&lt;h3&gt;Create a Data Exploration App&lt;/h3&gt;
&lt;p&gt;Build interactive tools:&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;&amp;quot;Create a Streamlit app connected to Dremio via dremioframe. Include schema browsing, data preview with pagination, SQL query editor, and CSV export. Generate all files and a README.&amp;quot;&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;Copilot generates the full application. Run &lt;code&gt;streamlit run app.py&lt;/code&gt; for a local data explorer.&lt;/p&gt;
&lt;h3&gt;Generate Data Pipeline Scripts&lt;/h3&gt;
&lt;p&gt;Use inline completions for data engineering:&lt;/p&gt;
&lt;p&gt;Write a comment: &lt;code&gt;# Medallion pipeline for product_events: bronze ingestion, silver cleaning, gold aggregation&lt;/code&gt;&lt;/p&gt;
&lt;p&gt;Copilot generates the complete pipeline following your &lt;code&gt;copilot-instructions.md&lt;/code&gt; conventions. Agent mode can also run the generated code against your Dremio instance to validate it.&lt;/p&gt;
&lt;h3&gt;Build API Endpoints Over Dremio Data&lt;/h3&gt;
&lt;p&gt;Create backend services:&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;&amp;quot;Build a FastAPI app that serves Dremio gold-layer data through REST endpoints. Add customer analytics, revenue by region, and product performance. Include Pydantic models, caching, and OpenAPI docs.&amp;quot;&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;Copilot generates the complete API. Run &lt;code&gt;uvicorn main:app --reload&lt;/code&gt; for a local server.&lt;/p&gt;
&lt;h2&gt;Which Approach Should You Use?&lt;/h2&gt;
&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Approach&lt;/th&gt;
&lt;th&gt;Setup Time&lt;/th&gt;
&lt;th&gt;What You Get&lt;/th&gt;
&lt;th&gt;Best For&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;MCP Server&lt;/td&gt;
&lt;td&gt;5 minutes&lt;/td&gt;
&lt;td&gt;Live queries, schema browsing, catalog exploration&lt;/td&gt;
&lt;td&gt;Data analysis, SQL generation, real-time access&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;copilot-instructions.md&lt;/td&gt;
&lt;td&gt;10 minutes&lt;/td&gt;
&lt;td&gt;Convention enforcement, pattern-specific rules&lt;/td&gt;
&lt;td&gt;Teams with repository-wide standards&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Pre-Built Skills&lt;/td&gt;
&lt;td&gt;5 minutes&lt;/td&gt;
&lt;td&gt;Comprehensive Dremio knowledge (CLI, SDK, SQL, API)&lt;/td&gt;
&lt;td&gt;Quick start with broad coverage&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Custom Instructions&lt;/td&gt;
&lt;td&gt;30+ minutes&lt;/td&gt;
&lt;td&gt;Tailored schemas, patterns, and team conventions&lt;/td&gt;
&lt;td&gt;Mature teams with project-specific needs&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;
&lt;p&gt;Start with the MCP server for immediate value. Add &lt;code&gt;copilot-instructions.md&lt;/code&gt; for conventions. Use &lt;code&gt;.instructions&lt;/code&gt; files for pattern-specific rules.&lt;/p&gt;
&lt;h2&gt;Get Started&lt;/h2&gt;
&lt;ol&gt;
&lt;li&gt;&lt;a href=&quot;https://www.dremio.com/get-started&quot;&gt;Sign up for a free Dremio Cloud trial&lt;/a&gt; (30 days, $400 in compute credits).&lt;/li&gt;
&lt;li&gt;Find your project&apos;s MCP endpoint in &lt;strong&gt;Project Settings &amp;gt; Info&lt;/strong&gt;.&lt;/li&gt;
&lt;li&gt;Create &lt;code&gt;.vscode/mcp.json&lt;/code&gt; with your Dremio MCP server.&lt;/li&gt;
&lt;li&gt;Create &lt;code&gt;.github/copilot-instructions.md&lt;/code&gt; with Dremio conventions.&lt;/li&gt;
&lt;li&gt;Open Copilot in agent mode and ask it to explore your Dremio catalog.&lt;/li&gt;
&lt;/ol&gt;
&lt;p&gt;Dremio&apos;s Agentic Lakehouse gives Copilot accurate data context. Combined with Copilot&apos;s massive user base and VS Code integration, this is the lowest-friction path to AI-powered data development for most teams.&lt;/p&gt;
&lt;p&gt;For more on the Dremio MCP Server, check out the &lt;a href=&quot;https://docs.dremio.com/current/developer/mcp-server/&quot;&gt;official documentation&lt;/a&gt; or enroll in the free &lt;a href=&quot;https://university.dremio.com/course/dremio-mcp&quot;&gt;Dremio MCP Server course&lt;/a&gt; on Dremio University.&lt;/p&gt;
</content:encoded><author>Alex Merced</author></item><item><title>How to Use Dremio with Gemini CLI: Connect, Query, and Build Data Apps</title><link>https://iceberglakehouse.com/posts/2026-03-aitool-gemini-cli/</link><guid isPermaLink="true">https://iceberglakehouse.com/posts/2026-03-aitool-gemini-cli/</guid><description>
Gemini CLI is Google&apos;s open-source terminal-based AI agent. It runs directly in your terminal, powered by Gemini models with a 1-million token contex...</description><pubDate>Thu, 05 Mar 2026 13:00:00 GMT</pubDate><content:encoded>&lt;p&gt;Gemini CLI is Google&apos;s open-source terminal-based AI agent. It runs directly in your terminal, powered by Gemini models with a 1-million token context window. Dremio is a unified lakehouse platform that provides business context through its semantic layer, universal data access through query federation, and interactive speed through Reflections and Apache Arrow.&lt;/p&gt;
&lt;p&gt;Connecting them gives Gemini CLI the data context it needs to write accurate Dremio SQL, generate pipeline scripts, and build applications against your lakehouse. The 1-million token context window is a significant advantage: Gemini CLI can hold your entire project, documentation, and Dremio schema context simultaneously without the context limitations that constrain other agents.&lt;/p&gt;
&lt;p&gt;Gemini CLI&apos;s &lt;code&gt;GEMINI.md&lt;/code&gt; context file system is similar to &lt;code&gt;CLAUDE.md&lt;/code&gt; in Claude Code. It loads project-specific instructions at session start and supports hierarchical scoping from global defaults to project-specific overrides. The tool also supports MCP natively, Google Search grounding for real-time documentation lookups, and built-in file and shell tools.&lt;/p&gt;
&lt;p&gt;This post covers four approaches, ordered from quickest setup to most customizable.&lt;/p&gt;
&lt;p&gt;&lt;img src=&quot;/assets/images/aitoolblogs/05/gemini-cli-dremio-architecture.png&quot; alt=&quot;Gemini CLI terminal agent connecting to Dremio Agentic Lakehouse via MCP&quot;&gt;&lt;/p&gt;
&lt;h2&gt;Setting Up Gemini CLI&lt;/h2&gt;
&lt;p&gt;If you do not already have Gemini CLI installed:&lt;/p&gt;
&lt;ol&gt;
&lt;li&gt;&lt;strong&gt;Install Node.js&lt;/strong&gt; (version 18 or later) from &lt;a href=&quot;https://nodejs.org/&quot;&gt;nodejs.org&lt;/a&gt;.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Install Gemini CLI&lt;/strong&gt; globally via npm:&lt;pre&gt;&lt;code class=&quot;language-bash&quot;&gt;npm install -g @anthropic-ai/gemini-cli
&lt;/code&gt;&lt;/pre&gt;
Or install from source via the &lt;a href=&quot;https://github.com/google-gemini/gemini-cli&quot;&gt;GitHub repository&lt;/a&gt;.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Authenticate&lt;/strong&gt; by running &lt;code&gt;gemini&lt;/code&gt; in your terminal. On first launch, it will prompt you to sign in with your Google account. Gemini CLI is free to use with a Google account (rate-limited) or with a Gemini API key for higher throughput.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Verify the installation&lt;/strong&gt; by asking a question: &lt;code&gt;gemini &amp;quot;What is Apache Iceberg?&amp;quot;&lt;/code&gt;&lt;/li&gt;
&lt;/ol&gt;
&lt;p&gt;Gemini CLI runs in your terminal and reads your project files for context. It can execute shell commands, edit files, browse the web via Google Search grounding, and interact with MCP servers.&lt;/p&gt;
&lt;h2&gt;Approach 1: Connect the Dremio Cloud MCP Server&lt;/h2&gt;
&lt;p&gt;The Model Context Protocol (MCP) is an open standard that lets AI tools call external services. Every Dremio Cloud project ships with a built-in MCP server. Gemini CLI supports MCP natively through its &lt;code&gt;settings.json&lt;/code&gt; configuration.&lt;/p&gt;
&lt;p&gt;For Claude-based tools, Dremio provides an &lt;a href=&quot;https://github.com/dremio/claude-plugins&quot;&gt;official Claude plugin&lt;/a&gt; with guided setup. For Gemini CLI, you configure the MCP connection through &lt;code&gt;settings.json&lt;/code&gt;.&lt;/p&gt;
&lt;h3&gt;Find Your Project&apos;s MCP Endpoint&lt;/h3&gt;
&lt;p&gt;Log into &lt;a href=&quot;https://www.dremio.com/get-started&quot;&gt;Dremio Cloud&lt;/a&gt; and open your project. Navigate to &lt;strong&gt;Project Settings &amp;gt; Info&lt;/strong&gt;. The MCP server URL is listed on the project overview page. Copy it.&lt;/p&gt;
&lt;h3&gt;Set Up OAuth in Dremio Cloud&lt;/h3&gt;
&lt;ol&gt;
&lt;li&gt;Go to &lt;strong&gt;Settings &amp;gt; Organization Settings &amp;gt; OAuth Applications&lt;/strong&gt;.&lt;/li&gt;
&lt;li&gt;Click &lt;strong&gt;Add Application&lt;/strong&gt; and enter a name (e.g., &amp;quot;Gemini CLI MCP&amp;quot;).&lt;/li&gt;
&lt;li&gt;Add the appropriate redirect URI for your setup.&lt;/li&gt;
&lt;li&gt;Save and copy the &lt;strong&gt;Client ID&lt;/strong&gt;.&lt;/li&gt;
&lt;/ol&gt;
&lt;h3&gt;Configure Gemini CLI&apos;s MCP Connection&lt;/h3&gt;
&lt;p&gt;Gemini CLI reads MCP server definitions from &lt;code&gt;settings.json&lt;/code&gt;. You can configure this at two levels:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;User-level:&lt;/strong&gt; &lt;code&gt;~/.gemini/settings.json&lt;/code&gt; (applies to all projects)&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Project-level:&lt;/strong&gt; &lt;code&gt;.gemini/settings.json&lt;/code&gt; (applies to the current project only)&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;Create or edit the settings file:&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-json&quot;&gt;{
  &amp;quot;mcpServers&amp;quot;: {
    &amp;quot;dremio&amp;quot;: {
      &amp;quot;httpUrl&amp;quot;: &amp;quot;https://YOUR_PROJECT_MCP_URL&amp;quot;
    }
  }
}
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;You can also add MCP servers using the CLI command:&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-bash&quot;&gt;gemini mcp add dremio --httpUrl &amp;quot;https://YOUR_PROJECT_MCP_URL&amp;quot;
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Restart Gemini CLI. The agent now has access to Dremio&apos;s MCP tools:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;GetUsefulSystemTableNames&lt;/strong&gt; returns available tables with descriptions.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;GetSchemaOfTable&lt;/strong&gt; returns column names, types, and metadata.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;GetDescriptionOfTableOrSchema&lt;/strong&gt; pulls wiki descriptions from the catalog.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;GetTableOrViewLineage&lt;/strong&gt; shows upstream dependencies.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;RunSqlQuery&lt;/strong&gt; executes SQL and returns results as JSON.&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;Test the connection by asking: &amp;quot;What tables are available in Dremio?&amp;quot; Gemini CLI will call &lt;code&gt;GetUsefulSystemTableNames&lt;/code&gt; and return your catalog contents.&lt;/p&gt;
&lt;h3&gt;Self-Hosted Alternative&lt;/h3&gt;
&lt;p&gt;For Dremio Software deployments, use the open-source &lt;a href=&quot;https://github.com/dremio/dremio-mcp&quot;&gt;dremio-mcp&lt;/a&gt; server:&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-bash&quot;&gt;git clone https://github.com/dremio/dremio-mcp
cd dremio-mcp
uv run dremio-mcp-server config create dremioai \
  --uri https://your-dremio-instance.com \
  --pat YOUR_PERSONAL_ACCESS_TOKEN
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;In your &lt;code&gt;settings.json&lt;/code&gt;:&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-json&quot;&gt;{
  &amp;quot;mcpServers&amp;quot;: {
    &amp;quot;dremio&amp;quot;: {
      &amp;quot;command&amp;quot;: &amp;quot;uv&amp;quot;,
      &amp;quot;args&amp;quot;: [
        &amp;quot;run&amp;quot;, &amp;quot;--directory&amp;quot;, &amp;quot;/path/to/dremio-mcp&amp;quot;,
        &amp;quot;dremio-mcp-server&amp;quot;, &amp;quot;run&amp;quot;
      ]
    }
  }
}
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;The self-hosted server supports three modes: &lt;code&gt;FOR_DATA_PATTERNS&lt;/code&gt; for data exploration (default), &lt;code&gt;FOR_SELF&lt;/code&gt; for system analysis, and &lt;code&gt;FOR_PROMETHEUS&lt;/code&gt; for correlating metrics with monitoring.&lt;/p&gt;
&lt;h2&gt;Approach 2: Use GEMINI.md for Dremio Context&lt;/h2&gt;
&lt;p&gt;Gemini CLI auto-loads &lt;code&gt;GEMINI.md&lt;/code&gt; from your project root at the start of every session. It works similarly to &lt;code&gt;CLAUDE.md&lt;/code&gt; in Claude Code, providing persistent instructions that survive across conversations.&lt;/p&gt;
&lt;h3&gt;Hierarchical Context Loading&lt;/h3&gt;
&lt;p&gt;&lt;code&gt;GEMINI.md&lt;/code&gt; supports hierarchical scoping:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Global:&lt;/strong&gt; &lt;code&gt;~/.gemini/GEMINI.md&lt;/code&gt; applies to every project.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Project:&lt;/strong&gt; &lt;code&gt;GEMINI.md&lt;/code&gt; in the project root applies to that specific repo.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Subdirectory:&lt;/strong&gt; &lt;code&gt;GEMINI.md&lt;/code&gt; files in subdirectories provide additional context when working in those folders.&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;Project-level files override global ones. Subdirectory files add to the project context rather than replacing it.&lt;/p&gt;
&lt;h3&gt;Writing a Dremio-Focused GEMINI.md&lt;/h3&gt;
&lt;pre&gt;&lt;code class=&quot;language-markdown&quot;&gt;# Project Context

This project uses Dremio Cloud as its lakehouse platform.

## Dremio SQL Conventions
- Use `CREATE FOLDER IF NOT EXISTS` (not CREATE NAMESPACE or CREATE SCHEMA)
- Tables in the Open Catalog use `folder.subfolder.table_name` without a catalog prefix
- External federated sources use `source_name.schema.table_name`
- Cast DATE columns to TIMESTAMP for consistent joins
- Use TIMESTAMPDIFF for duration calculations

## Credentials
- Never hardcode Personal Access Tokens. Use environment variable: DREMIO_PAT
- Dremio Cloud endpoint is in environment variable: DREMIO_URI

## API Reference
- REST API docs: https://docs.dremio.com/current/reference/api/
- SQL reference: https://docs.dremio.com/current/reference/sql/
- For detailed SQL validation, read ./dremio-docs/sql-reference.md

## Terminology
- Call it &amp;quot;Agentic Lakehouse&amp;quot;, not &amp;quot;data warehouse&amp;quot;
- &amp;quot;Reflections&amp;quot; are pre-computed optimizations, not &amp;quot;materialized views&amp;quot;
- &amp;quot;Open Catalog&amp;quot; is built on Apache Polaris
&lt;/code&gt;&lt;/pre&gt;
&lt;h3&gt;Using Protocol Blocks for Gated Instructions&lt;/h3&gt;
&lt;p&gt;Gemini CLI supports &lt;code&gt;&amp;lt;PROTOCOL&amp;gt;&lt;/code&gt; blocks within &lt;code&gt;GEMINI.md&lt;/code&gt; for instructions that should only activate when specific conditions are met. This prevents context bloat:&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-markdown&quot;&gt;&amp;lt;PROTOCOL&amp;gt;
When the user asks about Dremio SQL or data pipelines:
1. Read ./dremio-docs/sql-reference.md for syntax validation
2. Use Dremio SQL conventions defined above
3. Always verify function names exist in the reference before using them
&amp;lt;/PROTOCOL&amp;gt;
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Protocol blocks are a form of delayed instructions. Gemini CLI reads the protocol definition but only executes the instructions when the triggering condition is met. This is more efficient than loading all reference files at session start.&lt;/p&gt;
&lt;h3&gt;Google Search Grounding&lt;/h3&gt;
&lt;p&gt;Gemini CLI has built-in Google Search grounding, meaning it can look up real-time Dremio documentation during a session. You can instruct it in &lt;code&gt;GEMINI.md&lt;/code&gt;:&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-markdown&quot;&gt;## Documentation Strategy
- Before writing any Dremio SQL, use Google Search to verify the syntax
  against the latest Dremio documentation at docs.dremio.com
- If a function name is uncertain, search for it before including it
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;This is a unique advantage over other agents. Instead of relying solely on pre-loaded context or training data, Gemini CLI can verify syntax against live documentation.&lt;/p&gt;
&lt;p&gt;&lt;img src=&quot;/assets/images/aitoolblogs/05/four-integration-approaches.png&quot; alt=&quot;Four approaches to connecting AI coding tools to Dremio&quot;&gt;&lt;/p&gt;
&lt;h2&gt;Approach 3: Install Pre-Built Dremio Skills and Docs&lt;/h2&gt;
&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;Official vs. Community Resources:&lt;/strong&gt; Dremio provides an &lt;a href=&quot;https://github.com/dremio/claude-plugins&quot;&gt;official plugin&lt;/a&gt; for Claude Code users and the built-in &lt;a href=&quot;https://docs.dremio.com/current/developer/mcp-server/&quot;&gt;Dremio Cloud MCP server&lt;/a&gt; is an official Dremio product. The repositories below, along with libraries like dremioframe, are community-supported projects from the Dremio Developer Advocacy team. They are actively maintained but not part of the core Dremio product.&lt;/p&gt;
&lt;/blockquote&gt;
&lt;h3&gt;dremio-agent-skill (Community)&lt;/h3&gt;
&lt;p&gt;The &lt;a href=&quot;https://github.com/developer-advocacy-dremio/dremio-agent-skill&quot;&gt;dremio-agent-skill&lt;/a&gt; repository contains a complete skill directory with &lt;code&gt;SKILL.md&lt;/code&gt;, knowledge files, and configuration files for multiple tools.&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-bash&quot;&gt;git clone https://github.com/developer-advocacy-dremio/dremio-agent-skill
cd dremio-agent-skill
./install.sh
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;For Gemini CLI, tell the agent to read the skill at session start:&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;&amp;quot;Read dremio-skill/SKILL.md and use the knowledge files in dremio-skill/knowledge/ for Dremio conventions.&amp;quot;&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;The skill includes knowledge files covering Dremio CLI, Python SDK (dremioframe), SQL syntax, and REST API endpoints.&lt;/p&gt;
&lt;h3&gt;dremio-agent-md (Community)&lt;/h3&gt;
&lt;p&gt;The &lt;a href=&quot;https://github.com/developer-advocacy-dremio/dremio-agent-md&quot;&gt;dremio-agent-md&lt;/a&gt; repository provides a master protocol file and browsable documentation sitemaps:&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-bash&quot;&gt;git clone https://github.com/developer-advocacy-dremio/dremio-agent-md
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Reference it in your &lt;code&gt;GEMINI.md&lt;/code&gt;:&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-markdown&quot;&gt;## Dremio Documentation
- Read DREMIO_AGENT.md in ./dremio-agent-md/ for the Dremio protocol
- Use sitemaps in dremio_sitemaps/ to verify SQL syntax
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;This pairs well with Gemini CLI&apos;s Google Search grounding. The sitemaps provide structured offline references, while Search grounding provides real-time verification.&lt;/p&gt;
&lt;h2&gt;Approach 4: Build Your Own GEMINI.md Context&lt;/h2&gt;
&lt;p&gt;If the pre-built options do not fit your workflow, build a custom &lt;code&gt;GEMINI.md&lt;/code&gt; tailored to your team&apos;s Dremio environment.&lt;/p&gt;
&lt;h3&gt;Create Project Context Files&lt;/h3&gt;
&lt;pre&gt;&lt;code&gt;.gemini/
  GEMINI.md              # Points to the reference files below
project-docs/
  dremio-conventions.md  # Team SQL rules
  table-schemas.md       # Exported schemas from Dremio
  common-queries.md      # Frequently used query patterns
  dremioframe-patterns.md # Python SDK code snippets
&lt;/code&gt;&lt;/pre&gt;
&lt;h3&gt;Write a Comprehensive GEMINI.md&lt;/h3&gt;
&lt;pre&gt;&lt;code class=&quot;language-markdown&quot;&gt;# Team Dremio Context

## SQL Standards
- All tables are under the analytics namespace
- Bronze: analytics.bronze.*, Silver: analytics.silver.*, Gold: analytics.gold.*
- Always use TIMESTAMP, never DATE
- Validate function names against project-docs/dremio-conventions.md

## Authentication
- Use env var DREMIO_PAT for tokens
- Cloud endpoint: env var DREMIO_URI

## Reference Files
- SQL conventions: project-docs/dremio-conventions.md
- Table schemas (updated weekly): project-docs/table-schemas.md
- Common queries: project-docs/common-queries.md
- Python SDK patterns: project-docs/dremioframe-patterns.md

&amp;lt;PROTOCOL&amp;gt;
When writing Dremio SQL:
1. Read project-docs/table-schemas.md to verify table and column names
2. Read project-docs/dremio-conventions.md to validate function names
3. Use Google Search to verify any Dremio function not in the reference
&amp;lt;/PROTOCOL&amp;gt;
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;The 1-million token context window means Gemini CLI can hold your entire schema reference, convention guide, and query library simultaneously without truncation.&lt;/p&gt;
&lt;h2&gt;Using Dremio with Gemini CLI: Practical Use Cases&lt;/h2&gt;
&lt;p&gt;Once Dremio is connected, Gemini CLI becomes a powerful data engineering partner in your terminal. Here are detailed examples.&lt;/p&gt;
&lt;h3&gt;Ask Natural Language Questions About Your Data&lt;/h3&gt;
&lt;p&gt;Ask Gemini CLI questions about your lakehouse in plain English:&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;&amp;quot;What were our top 10 customers by revenue last quarter? Show month-over-month trends and flag any with declining order frequency.&amp;quot;&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;Gemini CLI uses the MCP connection to discover your tables, writes the SQL, runs it against Dremio, and returns formatted results with analysis. The 1-million token context window means it can hold large result sets and build on them across a session.&lt;/p&gt;
&lt;p&gt;Follow up with multi-step analysis:&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;&amp;quot;For the customers with declining frequency, pull their support ticket history and calculate the correlation between ticket volume and order decline.&amp;quot;&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;Gemini CLI maintains the full conversation context, including previous query results, and generates the follow-up query with cross-table joins. If it is unsure about a table name or column, it can use Google Search grounding to verify against live Dremio documentation.&lt;/p&gt;
&lt;h3&gt;Build a Locally Running Dashboard&lt;/h3&gt;
&lt;p&gt;Ask Gemini CLI to create a complete dashboard:&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;&amp;quot;Query our gold-layer sales views in Dremio and build a local HTML dashboard with Chart.js. Include monthly revenue trends, top products by region, and customer acquisition metrics. Make it filterable by date range and add a dark theme with print-to-PDF.&amp;quot;&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;Gemini CLI will:&lt;/p&gt;
&lt;ol&gt;
&lt;li&gt;Use MCP to discover gold-layer views and their schemas&lt;/li&gt;
&lt;li&gt;Write and execute SQL queries for each visualization&lt;/li&gt;
&lt;li&gt;Generate an HTML file with embedded CSS, JavaScript, and Chart.js&lt;/li&gt;
&lt;li&gt;Add interactive filter controls and export buttons&lt;/li&gt;
&lt;li&gt;Save it to your project directory&lt;/li&gt;
&lt;/ol&gt;
&lt;p&gt;Open the HTML file in a browser for a complete dashboard running from a local file. No server or deployment needed.&lt;/p&gt;
&lt;h3&gt;Create a Data Exploration App&lt;/h3&gt;
&lt;p&gt;Build an interactive tool:&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;&amp;quot;Create a Python Streamlit app that connects to Dremio using dremioframe. Include a schema browser sidebar with table counts, a data preview with pagination, a SQL query editor with syntax highlighting and execution, and CSV download. Generate requirements.txt and README.&amp;quot;&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;Gemini CLI writes the full application:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;code&gt;app.py&lt;/code&gt; with Streamlit layout, dremioframe connection, and query execution&lt;/li&gt;
&lt;li&gt;&lt;code&gt;requirements.txt&lt;/code&gt; with pinned dependencies&lt;/li&gt;
&lt;li&gt;&lt;code&gt;.env.example&lt;/code&gt; with required environment variables&lt;/li&gt;
&lt;li&gt;&lt;code&gt;README.md&lt;/code&gt; with setup and run instructions&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;Run &lt;code&gt;streamlit run app.py&lt;/code&gt; and your team has a local data explorer connected to the lakehouse.&lt;/p&gt;
&lt;h3&gt;Generate Data Pipeline Scripts&lt;/h3&gt;
&lt;p&gt;Automate data engineering workflows:&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;&amp;quot;Write a dremioframe script that implements a Medallion Architecture pipeline for our new product_events table. Bronze: ingest raw data with column renames and TIMESTAMP casts. Silver: deduplicate on event_id, validate required fields, apply business rules. Gold: aggregate daily active products, event counts by type, and conversion funnels. Include error handling, structured logging, and a dry-run mode.&amp;quot;&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;Gemini CLI uses the GEMINI.md conventions and Dremio skill knowledge to produce production-quality pipeline code. Its Google Search grounding means it can verify Dremio function syntax in real time if the reference files do not cover a specific function.&lt;/p&gt;
&lt;h3&gt;Build API Endpoints Over Dremio Data&lt;/h3&gt;
&lt;p&gt;Create backend services:&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;&amp;quot;Build a FastAPI application that connects to Dremio using dremioframe. Create endpoints for customer segments, revenue by geography, and product performance trends. Include Pydantic response models, request validation, caching with TTL, and auto-generated OpenAPI docs.&amp;quot;&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;Gemini CLI generates the complete API server with proper error handling and connection management. Deploy it locally with &lt;code&gt;uvicorn main:app --reload&lt;/code&gt; or containerize for production.&lt;/p&gt;
&lt;h2&gt;Which Approach Should You Use?&lt;/h2&gt;
&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Approach&lt;/th&gt;
&lt;th&gt;Setup Time&lt;/th&gt;
&lt;th&gt;What You Get&lt;/th&gt;
&lt;th&gt;Best For&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;MCP Server&lt;/td&gt;
&lt;td&gt;5 minutes&lt;/td&gt;
&lt;td&gt;Live queries, schema browsing, catalog exploration&lt;/td&gt;
&lt;td&gt;Data analysis, SQL generation, real-time access&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;GEMINI.md&lt;/td&gt;
&lt;td&gt;10 minutes&lt;/td&gt;
&lt;td&gt;Convention enforcement, protocol blocks, Search grounding&lt;/td&gt;
&lt;td&gt;Teams with specific SQL standards or project rules&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Pre-Built Skills&lt;/td&gt;
&lt;td&gt;5 minutes&lt;/td&gt;
&lt;td&gt;Comprehensive Dremio knowledge (CLI, SDK, SQL, API)&lt;/td&gt;
&lt;td&gt;Quick start with broad coverage&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Custom Context&lt;/td&gt;
&lt;td&gt;30+ minutes&lt;/td&gt;
&lt;td&gt;Tailored schemas, patterns, and team conventions&lt;/td&gt;
&lt;td&gt;Mature teams with project-specific needs&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;
&lt;p&gt;Combine them for the strongest setup. The MCP server gives live data access; GEMINI.md enforces conventions with protocol blocks; pre-built skills provide broad Dremio knowledge; and custom context files capture your team&apos;s schemas and patterns.&lt;/p&gt;
&lt;p&gt;Start with the MCP server for immediate value. Add a &lt;code&gt;GEMINI.md&lt;/code&gt; with your SQL conventions. Use Google Search grounding as a safety net for syntax verification. As your team develops patterns, build out the context files with schemas and query libraries.&lt;/p&gt;
&lt;h2&gt;Get Started&lt;/h2&gt;
&lt;ol&gt;
&lt;li&gt;&lt;a href=&quot;https://www.dremio.com/get-started&quot;&gt;Sign up for a free Dremio Cloud trial&lt;/a&gt; (30 days, $400 in compute credits).&lt;/li&gt;
&lt;li&gt;Find your project&apos;s MCP endpoint in &lt;strong&gt;Project Settings &amp;gt; Info&lt;/strong&gt;.&lt;/li&gt;
&lt;li&gt;Add it to &lt;code&gt;~/.gemini/settings.json&lt;/code&gt; or &lt;code&gt;.gemini/settings.json&lt;/code&gt;.&lt;/li&gt;
&lt;li&gt;Clone &lt;a href=&quot;https://github.com/developer-advocacy-dremio/dremio-agent-skill&quot;&gt;dremio-agent-skill&lt;/a&gt; and tell Gemini CLI to read the skill.&lt;/li&gt;
&lt;li&gt;Start a session and ask Gemini CLI to explore your Dremio catalog.&lt;/li&gt;
&lt;/ol&gt;
&lt;p&gt;Dremio&apos;s Agentic Lakehouse gives Gemini CLI accurate data context: the semantic layer provides business meaning, query federation provides universal access, and Reflections provide interactive speed. Gemini CLI&apos;s massive context window holds it all, and Google Search grounding provides real-time verification as a safety net.&lt;/p&gt;
&lt;p&gt;For more on the Dremio MCP Server, check out the &lt;a href=&quot;https://docs.dremio.com/current/developer/mcp-server/&quot;&gt;official documentation&lt;/a&gt; or enroll in the free &lt;a href=&quot;https://university.dremio.com/course/dremio-mcp&quot;&gt;Dremio MCP Server course&lt;/a&gt; on Dremio University.&lt;/p&gt;
</content:encoded><author>Alex Merced</author></item><item><title>How to Use Dremio with Cursor: Connect, Query, and Build Data Apps</title><link>https://iceberglakehouse.com/posts/2026-03-aitool-cursor/</link><guid isPermaLink="true">https://iceberglakehouse.com/posts/2026-03-aitool-cursor/</guid><description>
Cursor is an AI-native code editor built as a fork of VS Code. It integrates AI directly into the editing experience with features like Chat, Compose...</description><pubDate>Thu, 05 Mar 2026 12:00:00 GMT</pubDate><content:encoded>&lt;p&gt;Cursor is an AI-native code editor built as a fork of VS Code. It integrates AI directly into the editing experience with features like Chat, Composer (multi-file editing), and inline code generation. Dremio is a unified lakehouse platform that provides business context through its semantic layer, universal data access through query federation, and interactive speed through Reflections and Apache Arrow.&lt;/p&gt;
&lt;p&gt;Connecting them gives Cursor&apos;s AI the context it needs to write accurate Dremio SQL, generate data pipeline code, and build applications against your lakehouse. Without this connection, Cursor treats Dremio like a generic database and guesses at function names and table paths. With it, the AI knows your schemas, your business logic encoded in views, and the correct Dremio SQL dialect.&lt;/p&gt;
&lt;p&gt;Cursor&apos;s rules system is especially well-suited for Dremio integration. Rules files in &lt;code&gt;.cursor/rules/&lt;/code&gt; let you define granular, pattern-matched instructions that activate only when relevant. You can set Dremio SQL conventions to apply only when editing &lt;code&gt;.sql&lt;/code&gt; files, and dremioframe patterns to apply only in Python files that import the SDK.&lt;/p&gt;
&lt;p&gt;This post covers four approaches, ordered from quickest setup to most customizable.&lt;/p&gt;
&lt;p&gt;&lt;img src=&quot;/assets/images/aitoolblogs/04/cursor-dremio-architecture.png&quot; alt=&quot;Cursor AI code editor connecting to Dremio Agentic Lakehouse via MCP&quot;&gt;&lt;/p&gt;
&lt;h2&gt;Setting Up Cursor&lt;/h2&gt;
&lt;p&gt;If you do not already have Cursor installed:&lt;/p&gt;
&lt;ol&gt;
&lt;li&gt;&lt;strong&gt;Download Cursor&lt;/strong&gt; from &lt;a href=&quot;https://www.cursor.com/&quot;&gt;cursor.com&lt;/a&gt; (available for macOS, Linux, and Windows).&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Install it&lt;/strong&gt; by running the installer. Cursor replaces or runs alongside VS Code since it is a fork with the same extension ecosystem.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Sign in&lt;/strong&gt; with a Cursor account. The free tier includes limited AI requests; Pro ($20/month) provides unlimited access.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Open a project&lt;/strong&gt; by selecting File &amp;gt; Open Folder and pointing to your project directory.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Verify AI access&lt;/strong&gt; by pressing &lt;code&gt;Cmd+K&lt;/code&gt; (macOS) or &lt;code&gt;Ctrl+K&lt;/code&gt; (Windows/Linux) to open the inline AI prompt. Type a question to confirm the AI is responding.&lt;/li&gt;
&lt;/ol&gt;
&lt;p&gt;Cursor supports all VS Code extensions, themes, and keybindings. If you are migrating from VS Code, your existing setup transfers automatically.&lt;/p&gt;
&lt;h2&gt;Approach 1: Connect the Dremio Cloud MCP Server&lt;/h2&gt;
&lt;p&gt;The Model Context Protocol (MCP) is an open standard that lets AI tools call external services. Every Dremio Cloud project ships with a built-in MCP server. Cursor supports MCP natively through its settings panel.&lt;/p&gt;
&lt;p&gt;For Claude-based tools like Claude Code, Dremio provides an &lt;a href=&quot;https://github.com/dremio/claude-plugins&quot;&gt;official Claude plugin&lt;/a&gt; with guided setup. For Cursor, you configure the MCP connection through Cursor&apos;s built-in MCP settings.&lt;/p&gt;
&lt;h3&gt;Find Your Project&apos;s MCP Endpoint&lt;/h3&gt;
&lt;p&gt;Log into &lt;a href=&quot;https://www.dremio.com/get-started&quot;&gt;Dremio Cloud&lt;/a&gt; and open your project. Navigate to &lt;strong&gt;Project Settings &amp;gt; Info&lt;/strong&gt;. The MCP server URL is listed on the project overview page. Copy it.&lt;/p&gt;
&lt;h3&gt;Set Up OAuth in Dremio Cloud&lt;/h3&gt;
&lt;p&gt;Dremio&apos;s hosted MCP server uses OAuth for authentication. Your existing access controls apply to every query the AI runs.&lt;/p&gt;
&lt;ol&gt;
&lt;li&gt;Go to &lt;strong&gt;Settings &amp;gt; Organization Settings &amp;gt; OAuth Applications&lt;/strong&gt;.&lt;/li&gt;
&lt;li&gt;Click &lt;strong&gt;Add Application&lt;/strong&gt; and enter a name (e.g., &amp;quot;Cursor MCP&amp;quot;).&lt;/li&gt;
&lt;li&gt;Add the redirect URIs for Claude:
&lt;ul&gt;
&lt;li&gt;&lt;code&gt;https://claude.ai/api/mcp/auth_callback&lt;/code&gt;&lt;/li&gt;
&lt;li&gt;&lt;code&gt;https://claude.com/api/mcp/auth_callback&lt;/code&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;li&gt;Save and copy the &lt;strong&gt;Client ID&lt;/strong&gt;.&lt;/li&gt;
&lt;/ol&gt;
&lt;h3&gt;Configure Cursor&apos;s MCP Connection&lt;/h3&gt;
&lt;p&gt;In Cursor, go to &lt;strong&gt;Settings &amp;gt; MCP&lt;/strong&gt;. Click &lt;strong&gt;Add new MCP server&lt;/strong&gt; and configure it with your Dremio project&apos;s MCP URL. You can also add the MCP server by creating a &lt;code&gt;.cursor/mcp.json&lt;/code&gt; file in your project root:&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-json&quot;&gt;{
  &amp;quot;mcpServers&amp;quot;: {
    &amp;quot;dremio&amp;quot;: {
      &amp;quot;url&amp;quot;: &amp;quot;https://YOUR_PROJECT_MCP_URL&amp;quot;,
      &amp;quot;auth&amp;quot;: {
        &amp;quot;type&amp;quot;: &amp;quot;oauth&amp;quot;,
        &amp;quot;clientId&amp;quot;: &amp;quot;YOUR_CLIENT_ID&amp;quot;
      }
    }
  }
}
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Restart Cursor. The AI now has access to Dremio&apos;s MCP tools:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;GetUsefulSystemTableNames&lt;/strong&gt; returns an index of available tables with descriptions.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;GetSchemaOfTable&lt;/strong&gt; returns column names, types, and metadata.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;GetDescriptionOfTableOrSchema&lt;/strong&gt; pulls wiki descriptions and labels from the catalog.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;GetTableOrViewLineage&lt;/strong&gt; shows upstream data dependencies.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;RunSqlQuery&lt;/strong&gt; executes SQL and returns results as JSON.&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;Test the connection by opening Cursor Chat (&lt;code&gt;Cmd+L&lt;/code&gt;) and asking: &amp;quot;What tables are available in Dremio?&amp;quot; The AI will call &lt;code&gt;GetUsefulSystemTableNames&lt;/code&gt; and return your catalog contents.&lt;/p&gt;
&lt;h3&gt;Self-Hosted Alternative&lt;/h3&gt;
&lt;p&gt;For Dremio Software deployments, use the open-source &lt;a href=&quot;https://github.com/dremio/dremio-mcp&quot;&gt;dremio-mcp&lt;/a&gt; server. Clone the repo, configure it, then add it to Cursor&apos;s MCP settings:&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-bash&quot;&gt;git clone https://github.com/dremio/dremio-mcp
cd dremio-mcp
uv run dremio-mcp-server config create dremioai \
  --uri https://your-dremio-instance.com \
  --pat YOUR_PERSONAL_ACCESS_TOKEN
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;In &lt;code&gt;.cursor/mcp.json&lt;/code&gt;:&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-json&quot;&gt;{
  &amp;quot;mcpServers&amp;quot;: {
    &amp;quot;dremio&amp;quot;: {
      &amp;quot;command&amp;quot;: &amp;quot;uv&amp;quot;,
      &amp;quot;args&amp;quot;: [
        &amp;quot;run&amp;quot;, &amp;quot;--directory&amp;quot;, &amp;quot;/path/to/dremio-mcp&amp;quot;,
        &amp;quot;dremio-mcp-server&amp;quot;, &amp;quot;run&amp;quot;
      ]
    }
  }
}
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;The self-hosted server supports three modes: &lt;code&gt;FOR_DATA_PATTERNS&lt;/code&gt; for data exploration (default), &lt;code&gt;FOR_SELF&lt;/code&gt; for system analysis, and &lt;code&gt;FOR_PROMETHEUS&lt;/code&gt; for correlating metrics with monitoring.&lt;/p&gt;
&lt;h2&gt;Approach 2: Use Cursor Rules for Dremio Context&lt;/h2&gt;
&lt;p&gt;Cursor&apos;s rules system is one of its strongest differentiators. Rules are markdown files in &lt;code&gt;.cursor/rules/&lt;/code&gt; that provide persistent AI instructions. Unlike a single monolithic context file, Cursor rules support pattern matching, so you can scope instructions to specific file types or directories.&lt;/p&gt;
&lt;h3&gt;Project-Wide Rules with .cursorrules&lt;/h3&gt;
&lt;p&gt;The simplest approach is a &lt;code&gt;.cursorrules&lt;/code&gt; file in your project root. This loads into every AI interaction:&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-markdown&quot;&gt;# Dremio SQL Conventions
- Use CREATE FOLDER IF NOT EXISTS (not CREATE NAMESPACE or CREATE SCHEMA)
- Tables in the Open Catalog use folder.subfolder.table_name without a catalog prefix
- External federated sources use source_name.schema.table_name
- Cast DATE to TIMESTAMP for consistent joins
- Use TIMESTAMPDIFF for duration calculations

# Credentials
- Never hardcode Personal Access Tokens. Use environment variable: DREMIO_PAT
- Dremio Cloud endpoint: environment variable DREMIO_URI

# Terminology
- Call it &amp;quot;Agentic Lakehouse&amp;quot;, not &amp;quot;data warehouse&amp;quot;
- &amp;quot;Reflections&amp;quot; are pre-computed optimizations, not &amp;quot;materialized views&amp;quot;
&lt;/code&gt;&lt;/pre&gt;
&lt;h3&gt;Pattern-Matched Rules with .cursor/rules/&lt;/h3&gt;
&lt;p&gt;For more granular control, create rule files in &lt;code&gt;.cursor/rules/&lt;/code&gt; with &lt;code&gt;.mdc&lt;/code&gt; (Markdown Cursor) extension. These files support YAML-like frontmatter that tells Cursor when to activate the rule:&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-markdown&quot;&gt;---
description: Dremio SQL conventions for query files
globs: [&amp;quot;**/*.sql&amp;quot;, &amp;quot;**/queries/**&amp;quot;]
alwaysApply: false
---

# Dremio SQL Rules

When writing or modifying SQL files for Dremio:
- Use CREATE FOLDER IF NOT EXISTS, never CREATE SCHEMA
- Validate function names against the Dremio SQL reference
- Use TIMESTAMPDIFF for duration calculations, not DATEDIFF
- Cast DATE columns to TIMESTAMP before joins
- Reference tables as folder.subfolder.table_name
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Create a separate rule for Python SDK usage:&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-markdown&quot;&gt;---
description: dremioframe Python SDK patterns
globs: [&amp;quot;**/*.py&amp;quot;]
alwaysApply: false
---

# dremioframe Conventions

When writing Python code that uses dremioframe:
- Import as: from dremioframe import DremioConnection
- Use environment variables for credentials: DREMIO_PAT, DREMIO_URI
- Always close connections in a finally block or use context managers
- For bulk operations, use df.to_dremio() with batch_size parameter
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;The &lt;code&gt;globs&lt;/code&gt; field ensures these rules only activate when editing matching files. The &lt;code&gt;alwaysApply: false&lt;/code&gt; setting means the AI loads them on demand rather than consuming context tokens on every interaction.&lt;/p&gt;
&lt;h3&gt;Referencing External Documentation&lt;/h3&gt;
&lt;p&gt;Keep rules files concise by pointing to reference documents:&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-markdown&quot;&gt;---
description: Dremio documentation references
globs: [&amp;quot;**/*.sql&amp;quot;, &amp;quot;**/*.py&amp;quot;]
alwaysApply: false
---

# Dremio Reference Docs
- For SQL syntax details, read `./docs/dremio-sql-reference.md`
- For Python SDK usage, read `./docs/dremioframe-guide.md`
- For REST API endpoints, read `./docs/dremio-rest-api.md`
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Cursor loads the referenced files only when the AI needs them, keeping the context window efficient.&lt;/p&gt;
&lt;p&gt;&lt;img src=&quot;/assets/images/aitoolblogs/04/four-integration-approaches.png&quot; alt=&quot;Four approaches to connecting AI coding tools to Dremio&quot;&gt;&lt;/p&gt;
&lt;h2&gt;Approach 3: Install Pre-Built Dremio Skills and Docs&lt;/h2&gt;
&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;Official vs. Community Resources:&lt;/strong&gt; Dremio provides an &lt;a href=&quot;https://github.com/dremio/claude-plugins&quot;&gt;official plugin&lt;/a&gt; for Claude Code users and the built-in &lt;a href=&quot;https://docs.dremio.com/current/developer/mcp-server/&quot;&gt;Dremio Cloud MCP server&lt;/a&gt; is an official Dremio product. The repositories below, along with libraries like dremioframe, are community-supported projects from the Dremio Developer Advocacy team. They are actively maintained but not part of the core Dremio product.&lt;/p&gt;
&lt;/blockquote&gt;
&lt;h3&gt;dremio-agent-skill (Community)&lt;/h3&gt;
&lt;p&gt;The &lt;a href=&quot;https://github.com/developer-advocacy-dremio/dremio-agent-skill&quot;&gt;dremio-agent-skill&lt;/a&gt; repository contains a comprehensive skill directory with knowledge files and a &lt;code&gt;.cursorrules&lt;/code&gt; file specifically designed for Cursor.&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-bash&quot;&gt;git clone https://github.com/developer-advocacy-dremio/dremio-agent-skill
cd dremio-agent-skill
./install.sh
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Choose &lt;strong&gt;Local Project Install (Copy)&lt;/strong&gt; to copy the &lt;code&gt;.cursorrules&lt;/code&gt; file and knowledge directory into your project. The &lt;code&gt;.cursorrules&lt;/code&gt; file provides Dremio conventions, and the &lt;code&gt;knowledge/&lt;/code&gt; directory contains detailed references for CLI, Python SDK, SQL syntax, and REST API.&lt;/p&gt;
&lt;p&gt;After installation, Cursor automatically picks up the &lt;code&gt;.cursorrules&lt;/code&gt; file and uses it for all AI interactions in the project.&lt;/p&gt;
&lt;h3&gt;dremio-agent-md (Community)&lt;/h3&gt;
&lt;p&gt;The &lt;a href=&quot;https://github.com/developer-advocacy-dremio/dremio-agent-md&quot;&gt;dremio-agent-md&lt;/a&gt; repository provides a &lt;code&gt;DREMIO_AGENT.md&lt;/code&gt; protocol file and browsable sitemaps of the Dremio documentation.&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-bash&quot;&gt;git clone https://github.com/developer-advocacy-dremio/dremio-agent-md
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Reference it in your &lt;code&gt;.cursorrules&lt;/code&gt;:&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-markdown&quot;&gt;For Dremio SQL validation, read DREMIO_AGENT.md in the dremio-agent-md directory.
Use the sitemaps in dremio_sitemaps/ to verify syntax before generating SQL.
&lt;/code&gt;&lt;/pre&gt;
&lt;h2&gt;Approach 4: Build Your Own Cursor Rules&lt;/h2&gt;
&lt;p&gt;If the pre-built options do not fit your workflow, create a custom rules setup tailored to your team&apos;s Dremio environment.&lt;/p&gt;
&lt;h3&gt;Create Rule Files&lt;/h3&gt;
&lt;pre&gt;&lt;code&gt;.cursor/rules/
  dremio-sql.mdc          # SQL conventions
  dremio-python.mdc       # dremioframe patterns
  dremio-schemas.mdc      # Team-specific table schemas
  dremio-api.mdc          # REST API patterns
&lt;/code&gt;&lt;/pre&gt;
&lt;h3&gt;Populate with Team Context&lt;/h3&gt;
&lt;p&gt;Export your actual table schemas from Dremio and save them as a rule:&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-markdown&quot;&gt;---
description: Team Dremio table schemas
globs: [&amp;quot;**/*.sql&amp;quot;, &amp;quot;**/*.py&amp;quot;]
alwaysApply: false
---

# Team Table Schemas

## analytics.gold.customer_metrics
- customer_id: VARCHAR (primary key)
- lifetime_value: DECIMAL(10,2)
- segment: VARCHAR (values: &apos;enterprise&apos;, &apos;mid-market&apos;, &apos;smb&apos;)
- last_order_date: TIMESTAMP
- churn_risk_score: FLOAT

## analytics.gold.revenue_daily
- date_key: TIMESTAMP
- product_category: VARCHAR
- region: VARCHAR
- revenue: DECIMAL(12,2)
- orders: INT
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;This gives Cursor exact schema knowledge for your project, so the AI generates SQL with correct column names and types instead of guessing.&lt;/p&gt;
&lt;h3&gt;Add Notepads for Reference Knowledge&lt;/h3&gt;
&lt;p&gt;Cursor also supports &lt;strong&gt;Notepads&lt;/strong&gt; for longer reference documents. Create a notepad in &lt;code&gt;.cursor/notepads/dremio-reference.md&lt;/code&gt; with comprehensive documentation. Notepads are available as &lt;code&gt;@notepad&lt;/code&gt; references in Chat and Composer but do not auto-load, keeping your context efficient.&lt;/p&gt;
&lt;h2&gt;Using Dremio with Cursor: Practical Use Cases&lt;/h2&gt;
&lt;p&gt;Once Dremio is connected, Cursor becomes a powerful data development environment. Here are detailed examples.&lt;/p&gt;
&lt;h3&gt;Ask Natural Language Questions About Your Data&lt;/h3&gt;
&lt;p&gt;Open Cursor Chat (&lt;code&gt;Cmd+L&lt;/code&gt;) and ask questions in plain English:&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;&amp;quot;What were our top 10 products by revenue last quarter? Break it down by region and show the trend.&amp;quot;&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;Cursor uses the MCP connection to discover your tables, writes the SQL in the chat, and can run it against Dremio to return results. You get answers without switching to the Dremio UI.&lt;/p&gt;
&lt;p&gt;Follow up with deeper analysis:&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;&amp;quot;Which of those top products has declining margins? Pull cost and revenue data for the last 6 months and show the margin trend.&amp;quot;&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;Cursor maintains context across the chat session, building on previous results. This turns the editor into a conversational data analysis tool.&lt;/p&gt;
&lt;p&gt;For teams with non-SQL users, Cursor Chat provides a natural language interface to the lakehouse directly inside the development environment.&lt;/p&gt;
&lt;h3&gt;Build a Locally Running Dashboard&lt;/h3&gt;
&lt;p&gt;Use Cursor Composer (&lt;code&gt;Cmd+I&lt;/code&gt;) for multi-file generation:&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;&amp;quot;Query our gold-layer sales views in Dremio and build a local HTML dashboard with Chart.js. Include monthly revenue trends, top products by region, and customer acquisition metrics. Make it filterable by date range. Put the HTML, CSS, and JavaScript in separate files.&amp;quot;&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;Cursor Composer will:&lt;/p&gt;
&lt;ol&gt;
&lt;li&gt;Use MCP to discover gold-layer views and their schemas&lt;/li&gt;
&lt;li&gt;Create &lt;code&gt;index.html&lt;/code&gt; with the dashboard layout&lt;/li&gt;
&lt;li&gt;Create &lt;code&gt;styles.css&lt;/code&gt; with the dark theme and responsive design&lt;/li&gt;
&lt;li&gt;Create &lt;code&gt;app.js&lt;/code&gt; with Chart.js configurations and data fetching&lt;/li&gt;
&lt;li&gt;Embed query results as JSON data files&lt;/li&gt;
&lt;li&gt;Add interactive filter controls&lt;/li&gt;
&lt;/ol&gt;
&lt;p&gt;Open &lt;code&gt;index.html&lt;/code&gt; in a browser for a complete dashboard running from local files. Cursor Composer excels at multi-file generation, making it ideal for this kind of project scaffolding.&lt;/p&gt;
&lt;h3&gt;Create a Data Exploration App&lt;/h3&gt;
&lt;p&gt;Build an interactive tool using Composer:&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;&amp;quot;Create a Streamlit app that connects to Dremio using dremioframe. Include a schema browser sidebar, a data preview tab with pagination, a SQL query editor with syntax highlighting, and CSV download buttons. Generate requirements.txt and a README.&amp;quot;&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;Cursor generates the full application across multiple files:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;code&gt;app.py&lt;/code&gt; with Streamlit layout and dremioframe integration&lt;/li&gt;
&lt;li&gt;&lt;code&gt;requirements.txt&lt;/code&gt; with pinned dependencies&lt;/li&gt;
&lt;li&gt;&lt;code&gt;.env.example&lt;/code&gt; with required environment variables&lt;/li&gt;
&lt;li&gt;&lt;code&gt;README.md&lt;/code&gt; with setup and run instructions&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;Run &lt;code&gt;pip install -r requirements.txt &amp;amp;&amp;amp; streamlit run app.py&lt;/code&gt; for a local data explorer connected to your lakehouse.&lt;/p&gt;
&lt;h3&gt;Generate Data Pipeline Scripts&lt;/h3&gt;
&lt;p&gt;Automate data engineering with inline AI:&lt;/p&gt;
&lt;p&gt;Highlight a comment in your Python file like &lt;code&gt;# Create bronze-silver-gold pipeline for user_events table&lt;/code&gt; and press &lt;code&gt;Cmd+K&lt;/code&gt;. Cursor generates the complete pipeline code inline:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Bronze: raw data ingestion with column renames and TIMESTAMP casts&lt;/li&gt;
&lt;li&gt;Silver: deduplication, null checks, and type validation&lt;/li&gt;
&lt;li&gt;Gold: business logic aggregations with CASE WHEN classifications&lt;/li&gt;
&lt;li&gt;Error handling with retry logic&lt;/li&gt;
&lt;li&gt;Structured logging for monitoring&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;The inline generation respects your &lt;code&gt;.cursor/rules/&lt;/code&gt; Dremio conventions, so the SQL follows your team&apos;s standards automatically.&lt;/p&gt;
&lt;h3&gt;Build API Endpoints Over Dremio Data&lt;/h3&gt;
&lt;p&gt;Use Composer to scaffold a REST API:&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;&amp;quot;Build a FastAPI application that connects to Dremio using dremioframe. Create endpoints for customer segments, revenue analytics, and product performance. Include Pydantic models, request validation, response caching, and auto-generated OpenAPI docs.&amp;quot;&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;Cursor generates the complete API across multiple files with proper project structure, ready for &lt;code&gt;uvicorn main:app --reload&lt;/code&gt;.&lt;/p&gt;
&lt;h2&gt;Which Approach Should You Use?&lt;/h2&gt;
&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Approach&lt;/th&gt;
&lt;th&gt;Setup Time&lt;/th&gt;
&lt;th&gt;What You Get&lt;/th&gt;
&lt;th&gt;Best For&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;MCP Server&lt;/td&gt;
&lt;td&gt;5 minutes&lt;/td&gt;
&lt;td&gt;Live queries, schema browsing, catalog exploration&lt;/td&gt;
&lt;td&gt;Data analysis, SQL generation, real-time access&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Cursor Rules&lt;/td&gt;
&lt;td&gt;15 minutes&lt;/td&gt;
&lt;td&gt;Convention enforcement, pattern-matched context&lt;/td&gt;
&lt;td&gt;Teams with specific SQL standards per file type&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Pre-Built Skills&lt;/td&gt;
&lt;td&gt;5 minutes&lt;/td&gt;
&lt;td&gt;Comprehensive Dremio knowledge (CLI, SDK, SQL, API)&lt;/td&gt;
&lt;td&gt;Quick start with broad coverage&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Custom Rules&lt;/td&gt;
&lt;td&gt;30+ minutes&lt;/td&gt;
&lt;td&gt;Tailored schemas, patterns, and team conventions&lt;/td&gt;
&lt;td&gt;Mature teams with project-specific needs&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;
&lt;p&gt;Combine them for the strongest setup. The MCP server gives live data access; Cursor rules enforce conventions scoped to relevant file types; pre-built skills provide broad Dremio knowledge; and custom rules capture your team&apos;s specific schemas and patterns.&lt;/p&gt;
&lt;p&gt;Start with the MCP server for immediate value. Add a &lt;code&gt;.cursorrules&lt;/code&gt; file for project-wide conventions. As your team develops specific patterns, create &lt;code&gt;.cursor/rules/*.mdc&lt;/code&gt; files with pattern matching for granular control.&lt;/p&gt;
&lt;h2&gt;Get Started&lt;/h2&gt;
&lt;ol&gt;
&lt;li&gt;&lt;a href=&quot;https://www.dremio.com/get-started&quot;&gt;Sign up for a free Dremio Cloud trial&lt;/a&gt; (30 days, $400 in compute credits).&lt;/li&gt;
&lt;li&gt;Find your project&apos;s MCP endpoint in &lt;strong&gt;Project Settings &amp;gt; Info&lt;/strong&gt;.&lt;/li&gt;
&lt;li&gt;Add it in Cursor&apos;s &lt;strong&gt;Settings &amp;gt; MCP&lt;/strong&gt; or create &lt;code&gt;.cursor/mcp.json&lt;/code&gt;.&lt;/li&gt;
&lt;li&gt;Clone &lt;a href=&quot;https://github.com/developer-advocacy-dremio/dremio-agent-skill&quot;&gt;dremio-agent-skill&lt;/a&gt; and run &lt;code&gt;./install.sh&lt;/code&gt; with local project install.&lt;/li&gt;
&lt;li&gt;Open Cursor Chat and ask it to explore your Dremio catalog.&lt;/li&gt;
&lt;/ol&gt;
&lt;p&gt;Dremio&apos;s Agentic Lakehouse gives Cursor&apos;s AI accurate data context: the semantic layer provides business meaning, query federation provides universal access, and Reflections provide interactive speed. Cursor&apos;s rules system scopes that context intelligently, activating Dremio knowledge only when relevant.&lt;/p&gt;
&lt;p&gt;For more on the Dremio MCP Server, check out the &lt;a href=&quot;https://docs.dremio.com/current/developer/mcp-server/&quot;&gt;official documentation&lt;/a&gt; or enroll in the free &lt;a href=&quot;https://university.dremio.com/course/dremio-mcp&quot;&gt;Dremio MCP Server course&lt;/a&gt; on Dremio University.&lt;/p&gt;
</content:encoded><author>Alex Merced</author></item><item><title>How to Use Dremio with Claude CoWork: Connect, Query, and Build Data Apps</title><link>https://iceberglakehouse.com/posts/2026-03-aitool-claude-cowork/</link><guid isPermaLink="true">https://iceberglakehouse.com/posts/2026-03-aitool-claude-cowork/</guid><description>
Claude CoWork is Anthropic&apos;s desktop agentic assistant. Unlike Claude Code (a terminal coding agent), CoWork operates as a general-purpose autonomous...</description><pubDate>Thu, 05 Mar 2026 11:00:00 GMT</pubDate><content:encoded>&lt;p&gt;Claude CoWork is Anthropic&apos;s desktop agentic assistant. Unlike Claude Code (a terminal coding agent), CoWork operates as a general-purpose autonomous agent that reads and writes files, browses the web, manages tasks, and generates complete project artifacts. Dremio is a unified lakehouse platform that provides business context through its semantic layer, universal data access through query federation, and interactive speed through Reflections.&lt;/p&gt;
&lt;p&gt;CoWork&apos;s strength is autonomous project execution. Give it a goal and grant it folder access, and it works through the steps independently. For data teams, this means CoWork can query your Dremio lakehouse, analyze the results, build a local dashboard, and write a summary report without you watching over every step.&lt;/p&gt;
&lt;p&gt;The context mechanism in CoWork differs from code editors. There is no &lt;code&gt;CLAUDE.md&lt;/code&gt; or &lt;code&gt;AGENTS.md&lt;/code&gt; file. CoWork uses folder instructions and global instructions configured through the Claude Desktop app. This makes the integration approach different, but the end result is the same: an agent that understands your Dremio environment.&lt;/p&gt;
&lt;p&gt;CoWork also has a unique advantage for Dremio users who are not developers. Because CoWork is a desktop assistant rather than a coding tool, analysts and business users can use it to ask natural language questions about their lakehouse data. The MCP connection handles the SQL generation and execution behind the scenes.&lt;/p&gt;
&lt;p&gt;This post covers four approaches, from the quickest MCP connection to building a full Dremio knowledge folder.&lt;/p&gt;
&lt;p&gt;&lt;img src=&quot;/assets/images/aitoolblogs/03/cowork-dremio-architecture.png&quot; alt=&quot;Claude CoWork desktop assistant connecting to Dremio Agentic Lakehouse&quot;&gt;&lt;/p&gt;
&lt;h2&gt;Setting Up Claude CoWork&lt;/h2&gt;
&lt;p&gt;If you do not already have CoWork set up:&lt;/p&gt;
&lt;ol&gt;
&lt;li&gt;&lt;strong&gt;Download Claude Desktop&lt;/strong&gt; from &lt;a href=&quot;https://claude.ai/download&quot;&gt;claude.ai/download&lt;/a&gt; (available for macOS and Windows).&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Sign in&lt;/strong&gt; with your Anthropic account (Pro, Team, or Enterprise subscription required for CoWork features).&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Enable CoWork&lt;/strong&gt; in the Claude Desktop app under &lt;strong&gt;Settings &amp;gt; Features &amp;gt; CoWork&lt;/strong&gt;.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Grant folder access&lt;/strong&gt; by clicking &lt;strong&gt;Add Folder&lt;/strong&gt; and selecting your project directory.&lt;/li&gt;
&lt;/ol&gt;
&lt;p&gt;CoWork operates as a desktop assistant, not a terminal tool. You interact with it through the Claude Desktop interface, describe tasks in natural language, and it autonomously reads files, writes code, browses the web, and generates project artifacts.&lt;/p&gt;
&lt;h2&gt;Approach 1: Connect the Dremio Cloud MCP Server&lt;/h2&gt;
&lt;p&gt;Every Dremio Cloud project includes a built-in MCP server. CoWork supports MCP through Claude Desktop&apos;s connector system.&lt;/p&gt;
&lt;p&gt;Dremio also provides an &lt;a href=&quot;https://github.com/dremio/claude-plugins&quot;&gt;official plugin for Claude&lt;/a&gt; that streamlines setup. If you use Claude Code alongside CoWork, you can install the plugin directly:&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;/plugin marketplace add dremio/claude-plugins
/plugin install dremio@dremio-plugins
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Create a &lt;code&gt;.env&lt;/code&gt; file with your credentials:&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;DREMIO_PAT=&amp;lt;your_personal_access_token&amp;gt;
DREMIO_PROJECT_ID=&amp;lt;your_project_id&amp;gt;
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Then add the Dremio MCP server through the &lt;a href=&quot;https://claude.ai&quot;&gt;Claude web interface&lt;/a&gt; under &lt;strong&gt;Customize &amp;gt; Connectors &amp;gt; Add custom connector&lt;/strong&gt;. CoWork automatically inherits MCP connections configured through the Claude web interface. Run &lt;code&gt;/dremio-setup&lt;/code&gt; in Claude Code for step-by-step guidance.&lt;/p&gt;
&lt;h3&gt;Set Up OAuth in Dremio Cloud&lt;/h3&gt;
&lt;ol&gt;
&lt;li&gt;Log into &lt;a href=&quot;https://www.dremio.com/get-started&quot;&gt;Dremio Cloud&lt;/a&gt; and go to &lt;strong&gt;Settings &amp;gt; Organization Settings &amp;gt; OAuth Applications&lt;/strong&gt;.&lt;/li&gt;
&lt;li&gt;Click &lt;strong&gt;Add Application&lt;/strong&gt; and name it (e.g., &amp;quot;Claude CoWork&amp;quot;).&lt;/li&gt;
&lt;li&gt;Add the redirect URIs for Claude:
&lt;ul&gt;
&lt;li&gt;&lt;code&gt;https://claude.ai/api/mcp/auth_callback&lt;/code&gt;&lt;/li&gt;
&lt;li&gt;&lt;code&gt;https://claude.com/api/mcp/auth_callback&lt;/code&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;li&gt;Save and copy the &lt;strong&gt;Client ID&lt;/strong&gt;.&lt;/li&gt;
&lt;/ol&gt;
&lt;h3&gt;Configure the MCP Connector&lt;/h3&gt;
&lt;p&gt;In Claude Desktop, open &lt;strong&gt;Settings &amp;gt; Connectors&lt;/strong&gt;. Add a custom MCP connector with your Dremio project&apos;s MCP URL and the OAuth client ID. CoWork will now have access to Dremio&apos;s MCP tools:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;GetUsefulSystemTableNames&lt;/strong&gt; lists available tables.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;GetSchemaOfTable&lt;/strong&gt; returns column names and types.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;GetDescriptionOfTableOrSchema&lt;/strong&gt; pulls wiki descriptions from the catalog.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;RunSqlQuery&lt;/strong&gt; executes SQL and returns results.&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;Test it by telling CoWork: &amp;quot;Connect to Dremio and list the available tables in my project.&amp;quot; The agent will use the MCP tools to browse your catalog.&lt;/p&gt;
&lt;h3&gt;Self-Hosted MCP&lt;/h3&gt;
&lt;p&gt;For Dremio Software, configure the open-source &lt;a href=&quot;https://github.com/dremio/dremio-mcp&quot;&gt;dremio-mcp&lt;/a&gt; server in Claude Desktop&apos;s &lt;code&gt;claude_desktop_config.json&lt;/code&gt;:&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-json&quot;&gt;{
  &amp;quot;mcpServers&amp;quot;: {
    &amp;quot;dremio&amp;quot;: {
      &amp;quot;command&amp;quot;: &amp;quot;uv&amp;quot;,
      &amp;quot;args&amp;quot;: [
        &amp;quot;run&amp;quot;, &amp;quot;--directory&amp;quot;, &amp;quot;/path/to/dremio-mcp&amp;quot;,
        &amp;quot;dremio-mcp-server&amp;quot;, &amp;quot;run&amp;quot;
      ]
    }
  }
}
&lt;/code&gt;&lt;/pre&gt;
&lt;h2&gt;Approach 2: Use Folder Instructions for Dremio Context&lt;/h2&gt;
&lt;p&gt;CoWork uses a folder-based context model. When you grant CoWork access to a folder, you can set instructions that apply whenever the agent works within that folder.&lt;/p&gt;
&lt;h3&gt;Setting Global Dremio Instructions&lt;/h3&gt;
&lt;p&gt;In Claude Desktop, go to &lt;strong&gt;Settings &amp;gt; CoWork &amp;gt; Global Instructions&lt;/strong&gt;. Add Dremio conventions that apply to every task:&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;When working with Dremio:
- Use CREATE FOLDER IF NOT EXISTS, not CREATE NAMESPACE
- Tables in the Open Catalog use folder.subfolder.table_name without a catalog prefix
- External sources use source_name.schema.table_name
- Cast DATE to TIMESTAMP for consistent joins
- Never hardcode Personal Access Tokens; use environment variables
- Dremio is an Agentic Lakehouse, not a data warehouse
&lt;/code&gt;&lt;/pre&gt;
&lt;h3&gt;Setting Folder-Specific Instructions&lt;/h3&gt;
&lt;p&gt;When you grant CoWork access to a project folder, add instructions specific to that project:&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;This folder contains a Dremio analytics project.
- Read dremio-docs/sql-reference.md before writing any SQL
- All tables are under the analytics namespace
- Bronze: analytics.bronze.*, Silver: analytics.silver.*, Gold: analytics.gold.*
- Use environment variable DREMIO_PAT for authentication
- Use environment variable DREMIO_URI for the Dremio endpoint
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Folder instructions load whenever CoWork operates in that directory, giving it project-specific context on top of the global Dremio defaults.&lt;/p&gt;
&lt;h3&gt;Agentic Memories&lt;/h3&gt;
&lt;p&gt;CoWork creates &amp;quot;agentic memories&amp;quot; as it works. After a few sessions with Dremio, CoWork builds persistent knowledge about your table schemas, common query patterns, and the SQL conventions it should follow. These memories survive across sessions, so the agent improves over time.&lt;/p&gt;
&lt;p&gt;For example, after CoWork runs its first few Dremio queries in a project, it remembers which tables exist, which columns tend to be useful, and which SQL patterns work best. The next time you ask a question, CoWork draws on this accumulated knowledge to write better queries faster.&lt;/p&gt;
&lt;p&gt;This is equivalent to CLAUDE.md or AGENTS.md but generated automatically rather than written by hand. For teams that do not want to maintain context files manually, agentic memories provide a self-improving alternative.&lt;/p&gt;
&lt;p&gt;&lt;img src=&quot;/assets/images/aitoolblogs/03/four-integration-approaches.png&quot; alt=&quot;Four approaches to connecting AI coding tools to Dremio&quot;&gt;&lt;/p&gt;
&lt;h2&gt;Approach 3: Load Pre-Built Dremio Docs into CoWork&lt;/h2&gt;
&lt;p&gt;Two community-supported open-source repositories provide Dremio context that CoWork can read directly.&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;Official vs. Community Resources:&lt;/strong&gt; The &lt;a href=&quot;https://github.com/dremio/claude-plugins&quot;&gt;dremio/claude-plugins&lt;/a&gt; plugin and the built-in &lt;a href=&quot;https://docs.dremio.com/current/developer/mcp-server/&quot;&gt;Dremio Cloud MCP server&lt;/a&gt; are officially maintained by Dremio. The repositories below, along with libraries like dremioframe, are community-supported projects from the Dremio Developer Advocacy team. They are actively maintained but not part of the core Dremio product.&lt;/p&gt;
&lt;/blockquote&gt;
&lt;h3&gt;dremio-agent-md: Best Fit for CoWork (Community)&lt;/h3&gt;
&lt;p&gt;The &lt;a href=&quot;https://github.com/developer-advocacy-dremio/dremio-agent-md&quot;&gt;dremio-agent-md&lt;/a&gt; repository is the best fit for CoWork. It contains &lt;code&gt;DREMIO_AGENT.md&lt;/code&gt; (a master protocol file) and &lt;code&gt;dremio_sitemaps/&lt;/code&gt; (hierarchical documentation indices).&lt;/p&gt;
&lt;p&gt;Clone it and grant CoWork access to the folder:&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-bash&quot;&gt;git clone https://github.com/developer-advocacy-dremio/dremio-agent-md
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Set the folder instructions to:&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;Before answering any Dremio questions, read DREMIO_AGENT.md in this folder.
Use the sitemaps in dremio_sitemaps/ to verify SQL syntax and find documentation.
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;CoWork will read the protocol file, learn the SQL conventions, and use the sitemaps to validate any Dremio queries it generates.&lt;/p&gt;
&lt;h3&gt;dremio-agent-skill: Knowledge Files (Community)&lt;/h3&gt;
&lt;p&gt;The &lt;a href=&quot;https://github.com/developer-advocacy-dremio/dremio-agent-skill&quot;&gt;dremio-agent-skill&lt;/a&gt; repository provides a broader set of knowledge files covering CLI, Python SDK, SQL, and REST API:&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-bash&quot;&gt;git clone https://github.com/developer-advocacy-dremio/dremio-agent-skill
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Grant CoWork access to this folder and set instructions to: &amp;quot;Read dremio-skill/SKILL.md for Dremio capabilities. Reference the knowledge/ directory for SQL syntax, REST API, and Python SDK documentation.&amp;quot;&lt;/p&gt;
&lt;h2&gt;Approach 4: Build a Custom Dremio Knowledge Folder&lt;/h2&gt;
&lt;p&gt;Create a dedicated folder with everything CoWork needs for your Dremio project:&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;dremio-context/
  README.md               # Overview and instructions
  sql-conventions.md       # Team SQL rules
  table-schemas.md         # Exported schemas from Dremio
  common-queries.md        # Frequently used query patterns
  dremioframe-examples.md  # Python SDK code snippets
  rest-api-patterns.md     # API call examples
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Write a &lt;code&gt;README.md&lt;/code&gt; that tells CoWork how to use the folder:&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-markdown&quot;&gt;# Dremio Project Context

Read this folder to understand our Dremio setup before working on data tasks.

## Quick Reference
- SQL conventions: sql-conventions.md
- Table schemas: table-schemas.md (updated weekly)
- Common queries: common-queries.md
- Python SDK: dremioframe-examples.md
- REST API: rest-api-patterns.md

## Rules
- Always use CREATE FOLDER IF NOT EXISTS
- Use TIMESTAMPDIFF for duration calculations
- Credentials are in environment variables, never hardcoded
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Grant CoWork access to this folder and set folder instructions to: &amp;quot;Before any Dremio task, read README.md in the dremio-context folder.&amp;quot;&lt;/p&gt;
&lt;p&gt;Export your actual table schemas from Dremio regularly and update &lt;code&gt;table-schemas.md&lt;/code&gt;. Include the queries your team runs most often in &lt;code&gt;common-queries.md&lt;/code&gt;. This grows into a living knowledge base that CoWork uses to generate increasingly accurate output.&lt;/p&gt;
&lt;h2&gt;Using Dremio with CoWork: Practical Use Cases&lt;/h2&gt;
&lt;p&gt;Once Dremio is connected, CoWork can execute complete data projects autonomously. Here are detailed examples.&lt;/p&gt;
&lt;h3&gt;Ask Natural Language Questions About Your Data&lt;/h3&gt;
&lt;p&gt;Ask CoWork plain questions about your data:&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;&amp;quot;What were our top 10 products by revenue last quarter? Break it down by region.&amp;quot;&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;CoWork uses the MCP connection to discover relevant tables, writes the SQL, runs the query against Dremio, and returns formatted results with analysis. No SQL knowledge required on your part.&lt;/p&gt;
&lt;p&gt;Take it further with follow-up questions:&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;&amp;quot;Which of those top products had the highest return rates? Pull the return reasons and show the most common issues.&amp;quot;&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;CoWork remembers the previous results and builds on them. Its agentic memory system stores what it learns about your tables, so subsequent questions in the same project get faster, more accurate answers.&lt;/p&gt;
&lt;p&gt;This pattern is especially valuable for non-technical users. Business analysts, product managers, and executives can use CoWork to query the lakehouse without learning SQL or navigating the Dremio UI.&lt;/p&gt;
&lt;h3&gt;Build Locally Running Dashboards&lt;/h3&gt;
&lt;p&gt;Tell CoWork to build a complete dashboard:&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;&amp;quot;Query our gold-layer sales views in Dremio, then build me a local HTML dashboard with charts showing monthly revenue trends, top customers, and regional breakdowns. Use Chart.js for the visualizations. Add date filters and a dark theme.&amp;quot;&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;CoWork will:&lt;/p&gt;
&lt;ol&gt;
&lt;li&gt;Use MCP to discover gold-layer views and their schemas&lt;/li&gt;
&lt;li&gt;Write and execute SQL queries for each visualization&lt;/li&gt;
&lt;li&gt;Generate an HTML file with embedded CSS, JavaScript, and Chart.js&lt;/li&gt;
&lt;li&gt;Add interactive filter controls for date range and region&lt;/li&gt;
&lt;li&gt;Save it to your project folder&lt;/li&gt;
&lt;/ol&gt;
&lt;p&gt;Open the HTML file in a browser, and you have a working dashboard running entirely from a local file. Share it with stakeholders by dropping it in Slack or email. No server or deployment needed.&lt;/p&gt;
&lt;p&gt;For recurring dashboards, tell CoWork to regenerate it weekly. Its agentic memory remembers the queries and file structure from the previous run.&lt;/p&gt;
&lt;h3&gt;Create Data Exploration Apps&lt;/h3&gt;
&lt;p&gt;Ask CoWork to build a more interactive tool:&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;&amp;quot;Create a Python Flask app that connects to Dremio using dremioframe. It should let me type a table name and see the schema, preview 100 rows, and run custom SQL queries. Include a clean UI with syntax highlighting and CSV download.&amp;quot;&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;CoWork writes the Python code, creates the HTML templates, and generates a &lt;code&gt;requirements.txt&lt;/code&gt;. Run &lt;code&gt;pip install -r requirements.txt &amp;amp;&amp;amp; python app.py&lt;/code&gt; and you have a local data exploration app connected to your lakehouse.&lt;/p&gt;
&lt;p&gt;This is especially useful for teams who need quick internal tools without going through a formal development cycle.&lt;/p&gt;
&lt;h3&gt;Generate Automated Reports&lt;/h3&gt;
&lt;p&gt;Schedule CoWork to generate recurring analytical reports:&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;&amp;quot;Query this week&apos;s data quality metrics from Dremio&apos;s gold layer, compare them to last week, and write a markdown report with tables and recommendations. Include row count trends, null percentages by column, and any columns that exceeded the 5% null threshold. Save it to the reports/ folder.&amp;quot;&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;CoWork runs the queries, computes the comparisons, generates a formatted report with tables, and writes recommendations based on the data. The report is ready to share with stakeholders without any manual analysis.&lt;/p&gt;
&lt;h3&gt;Build API Endpoints Over Dremio Data&lt;/h3&gt;
&lt;p&gt;Create backend services that serve lakehouse data:&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;&amp;quot;Build a FastAPI application that queries Dremio&apos;s gold-layer views and serves customer analytics through REST endpoints. Add endpoints for customer segments, revenue by geography, and cohort retention. Include request validation and JSON response formatting.&amp;quot;&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;CoWork generates the full application with proper error handling and dremioframe connection management. Deploy it locally or containerize it for production.&lt;/p&gt;
&lt;h2&gt;Which Approach Should You Use?&lt;/h2&gt;
&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Approach&lt;/th&gt;
&lt;th&gt;Setup Time&lt;/th&gt;
&lt;th&gt;What You Get&lt;/th&gt;
&lt;th&gt;Best For&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;MCP Connector&lt;/td&gt;
&lt;td&gt;5 minutes&lt;/td&gt;
&lt;td&gt;Live queries, schema browsing, catalog access&lt;/td&gt;
&lt;td&gt;Natural language data exploration, ad-hoc analysis&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Folder Instructions&lt;/td&gt;
&lt;td&gt;10 minutes&lt;/td&gt;
&lt;td&gt;Convention enforcement, project context&lt;/td&gt;
&lt;td&gt;Teams with specific SQL standards&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Pre-Built Docs&lt;/td&gt;
&lt;td&gt;5 minutes&lt;/td&gt;
&lt;td&gt;Comprehensive Dremio knowledge&lt;/td&gt;
&lt;td&gt;Quick setup with broad coverage&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Custom Knowledge Folder&lt;/td&gt;
&lt;td&gt;30+ minutes&lt;/td&gt;
&lt;td&gt;Tailored schemas, queries, and patterns&lt;/td&gt;
&lt;td&gt;Mature teams with specific data models&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;
&lt;p&gt;Start with the MCP connector. It gives CoWork live data access in five minutes, and you can immediately start asking natural language questions. Add folder instructions and knowledge files as you develop team-specific conventions.&lt;/p&gt;
&lt;h2&gt;Get Started&lt;/h2&gt;
&lt;ol&gt;
&lt;li&gt;&lt;a href=&quot;https://www.dremio.com/get-started&quot;&gt;Sign up for Dremio Cloud free for 30 days&lt;/a&gt; ($400 in compute credits).&lt;/li&gt;
&lt;li&gt;Set up OAuth and configure the MCP connector in Claude Desktop.&lt;/li&gt;
&lt;li&gt;Clone &lt;a href=&quot;https://github.com/developer-advocacy-dremio/dremio-agent-md&quot;&gt;dremio-agent-md&lt;/a&gt; and grant CoWork folder access.&lt;/li&gt;
&lt;li&gt;Ask CoWork to explore your Dremio catalog.&lt;/li&gt;
&lt;li&gt;Try: &amp;quot;Query my sales data in Dremio and build a local dashboard with Chart.js.&amp;quot;&lt;/li&gt;
&lt;/ol&gt;
&lt;p&gt;Dremio&apos;s Agentic Lakehouse gives CoWork the data foundation it needs: the semantic layer provides business context, query federation provides universal data access, and Reflections provide interactive speed. CoWork&apos;s autonomous execution model turns that data access into complete deliverables, from dashboards to reports to data apps.&lt;/p&gt;
&lt;p&gt;For more on the Dremio MCP Server, see the &lt;a href=&quot;https://docs.dremio.com/current/developer/mcp-server/&quot;&gt;official documentation&lt;/a&gt; or take the free &lt;a href=&quot;https://university.dremio.com/course/dremio-mcp&quot;&gt;Dremio MCP Server course&lt;/a&gt; on Dremio University.&lt;/p&gt;
</content:encoded><author>Alex Merced</author></item><item><title>How to Use Dremio with Claude Code: Connect, Query, and Build Data Apps</title><link>https://iceberglakehouse.com/posts/2026-03-aitool-claude-code/</link><guid isPermaLink="true">https://iceberglakehouse.com/posts/2026-03-aitool-claude-code/</guid><description>
Claude Code is Anthropic&apos;s terminal-based coding agent. It reads your files, writes code, runs commands, and maintains context across a session. Drem...</description><pubDate>Thu, 05 Mar 2026 10:00:00 GMT</pubDate><content:encoded>&lt;p&gt;Claude Code is Anthropic&apos;s terminal-based coding agent. It reads your files, writes code, runs commands, and maintains context across a session. Dremio is a unified lakehouse platform that gives AI agents three things they need to answer business questions accurately: deep business context through its semantic layer, universal data access through query federation, and interactive speed through Reflections and Apache Arrow.&lt;/p&gt;
&lt;p&gt;Connecting them means your coding agent can query live data, validate SQL against real schemas, and generate scripts that actually work against your lakehouse. Without this connection, Claude Code treats Dremio like any other database and often hallucinates function names or syntax. With it, the agent knows your table schemas, your business logic encoded in views, and the correct Dremio SQL dialect.&lt;/p&gt;
&lt;p&gt;This post covers four approaches, ordered from quickest setup to most customizable.&lt;/p&gt;
&lt;p&gt;&lt;img src=&quot;/assets/images/aitoolblogs/02/claude-code-dremio-mcp-architecture.png&quot; alt=&quot;Claude Code connecting to Dremio Agentic Lakehouse via MCP&quot;&gt;&lt;/p&gt;
&lt;h2&gt;Setting Up Claude Code&lt;/h2&gt;
&lt;p&gt;If you do not already have Claude Code installed, here is how to get started:&lt;/p&gt;
&lt;ol&gt;
&lt;li&gt;&lt;strong&gt;Install Node.js&lt;/strong&gt; (version 18 or later) from &lt;a href=&quot;https://nodejs.org/&quot;&gt;nodejs.org&lt;/a&gt;.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Install Claude Code&lt;/strong&gt; globally via npm:&lt;pre&gt;&lt;code class=&quot;language-bash&quot;&gt;npm install -g @anthropic-ai/claude-code
&lt;/code&gt;&lt;/pre&gt;
&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Launch Claude Code&lt;/strong&gt; by running &lt;code&gt;claude&lt;/code&gt; in your terminal from any project directory.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Authenticate&lt;/strong&gt; with your Anthropic API key or Claude Pro/Team subscription on first launch.&lt;/li&gt;
&lt;/ol&gt;
&lt;p&gt;Claude Code runs in your terminal and reads your project files for context. It can execute shell commands, edit files, and interact with MCP servers. No IDE or editor is required.&lt;/p&gt;
&lt;h2&gt;Approach 1: Connect the Dremio Cloud MCP Server&lt;/h2&gt;
&lt;p&gt;The Model Context Protocol (MCP) is an open standard that lets AI tools call external services. Every Dremio Cloud project ships with a built-in MCP server. Claude Code supports MCP natively. Connecting them takes about five minutes.&lt;/p&gt;
&lt;p&gt;The fastest path is the &lt;strong&gt;official Dremio plugin for Claude Code&lt;/strong&gt; from the &lt;a href=&quot;https://github.com/dremio/claude-plugins&quot;&gt;dremio/claude-plugins&lt;/a&gt; repository. This is maintained by Dremio and provides guided setup.&lt;/p&gt;
&lt;h3&gt;Find Your Project&apos;s MCP Endpoint&lt;/h3&gt;
&lt;p&gt;Log into &lt;a href=&quot;https://www.dremio.com/get-started&quot;&gt;Dremio Cloud&lt;/a&gt; and open your project. Navigate to &lt;strong&gt;Project Settings &amp;gt; Info&lt;/strong&gt;. The MCP server URL is listed on the project overview page. Copy it.&lt;/p&gt;
&lt;h3&gt;Set Up OAuth in Dremio Cloud&lt;/h3&gt;
&lt;p&gt;Dremio&apos;s hosted MCP server uses OAuth for authentication. This means Claude Code connects with your identity and your existing access controls apply to every query the agent runs.&lt;/p&gt;
&lt;ol&gt;
&lt;li&gt;Go to &lt;strong&gt;Settings &amp;gt; Organization Settings &amp;gt; OAuth Applications&lt;/strong&gt;.&lt;/li&gt;
&lt;li&gt;Click &lt;strong&gt;Add Application&lt;/strong&gt;.&lt;/li&gt;
&lt;li&gt;Enter an application name (e.g., &amp;quot;Claude Code MCP&amp;quot;).&lt;/li&gt;
&lt;li&gt;Add the redirect URIs for Claude:
&lt;ul&gt;
&lt;li&gt;&lt;code&gt;https://claude.ai/api/mcp/auth_callback&lt;/code&gt;&lt;/li&gt;
&lt;li&gt;&lt;code&gt;https://claude.com/api/mcp/auth_callback&lt;/code&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;li&gt;Save the application and copy the &lt;strong&gt;Client ID&lt;/strong&gt;.&lt;/li&gt;
&lt;/ol&gt;
&lt;h3&gt;Configure Claude Code&apos;s MCP Client&lt;/h3&gt;
&lt;p&gt;Claude Code reads MCP server definitions from a &lt;code&gt;.mcp.json&lt;/code&gt; file. Create one in your project root:&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-json&quot;&gt;{
  &amp;quot;mcpServers&amp;quot;: {
    &amp;quot;dremio&amp;quot;: {
      &amp;quot;url&amp;quot;: &amp;quot;https://YOUR_PROJECT_MCP_URL&amp;quot;,
      &amp;quot;auth&amp;quot;: {
        &amp;quot;type&amp;quot;: &amp;quot;oauth&amp;quot;,
        &amp;quot;clientId&amp;quot;: &amp;quot;YOUR_CLIENT_ID&amp;quot;
      }
    }
  }
}
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;For a global configuration that applies across all your projects, place the file at &lt;code&gt;~/.mcp.json&lt;/code&gt; instead.&lt;/p&gt;
&lt;p&gt;Restart Claude Code. The agent now has access to Dremio&apos;s MCP tools:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;GetUsefulSystemTableNames&lt;/strong&gt; returns an index of available tables with descriptions.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;GetSchemaOfTable&lt;/strong&gt; returns column names, types, and metadata for any table or view.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;GetDescriptionOfTableOrSchema&lt;/strong&gt; pulls wiki descriptions and labels you have set in the Dremio catalog.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;GetTableOrViewLineage&lt;/strong&gt; shows upstream dependencies.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;RunSqlQuery&lt;/strong&gt; executes SQL and returns results as JSON.&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;You can verify the connection by asking Claude Code: &amp;quot;What tables are available in Dremio?&amp;quot; The agent will call &lt;code&gt;GetUsefulSystemTableNames&lt;/code&gt; and return your catalog contents.&lt;/p&gt;
&lt;h3&gt;Official Dremio Plugin for Claude Code&lt;/h3&gt;
&lt;p&gt;Dremio provides an &lt;a href=&quot;https://github.com/dremio/claude-plugins&quot;&gt;official Claude Code plugin&lt;/a&gt; that streamlines setup. Install it from the plugin marketplace:&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;/plugin marketplace add dremio/claude-plugins
/plugin install dremio@dremio-plugins
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Create a &lt;code&gt;.env&lt;/code&gt; file in your project directory with your credentials:&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;DREMIO_PAT=&amp;lt;your_personal_access_token&amp;gt;
DREMIO_PROJECT_ID=&amp;lt;your_project_id&amp;gt;
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Then add the Dremio MCP server through the &lt;a href=&quot;https://claude.ai&quot;&gt;Claude web interface&lt;/a&gt; under &lt;strong&gt;Customize &amp;gt; Connectors &amp;gt; Add custom connector&lt;/strong&gt;. Claude Code automatically inherits the connection.&lt;/p&gt;
&lt;p&gt;Run &lt;code&gt;/dremio-setup&lt;/code&gt; in Claude Code for step-by-step guidance. The plugin walks you through OAuth configuration, including setting the redirect URI to &lt;code&gt;http://localhost/callback,https://claude.ai/api/mcp/auth_callback&lt;/code&gt;.&lt;/p&gt;
&lt;p&gt;This is the recommended starting point for Claude Code users because it is officially maintained by Dremio and handles the configuration details for you.&lt;/p&gt;
&lt;h3&gt;Self-Hosted Alternative&lt;/h3&gt;
&lt;p&gt;If you run Dremio Software instead of Dremio Cloud, use the open-source &lt;a href=&quot;https://github.com/dremio/dremio-mcp&quot;&gt;dremio-mcp&lt;/a&gt; repository:&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-bash&quot;&gt;git clone https://github.com/dremio/dremio-mcp
cd dremio-mcp
uv run dremio-mcp-server config create dremioai \
  --uri https://your-dremio-instance.com \
  --pat YOUR_PERSONAL_ACCESS_TOKEN
uv run dremio-mcp-server config create claude
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;The second command writes the MCP server entry directly into Claude&apos;s desktop config. For Claude Code (terminal), add the server to your &lt;code&gt;.mcp.json&lt;/code&gt;:&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-json&quot;&gt;{
  &amp;quot;mcpServers&amp;quot;: {
    &amp;quot;dremio&amp;quot;: {
      &amp;quot;command&amp;quot;: &amp;quot;uv&amp;quot;,
      &amp;quot;args&amp;quot;: [
        &amp;quot;run&amp;quot;, &amp;quot;--directory&amp;quot;, &amp;quot;/path/to/dremio-mcp&amp;quot;,
        &amp;quot;dremio-mcp-server&amp;quot;, &amp;quot;run&amp;quot;
      ]
    }
  }
}
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;The self-hosted server supports three modes:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;code&gt;FOR_DATA_PATTERNS&lt;/code&gt; for exploring and querying data (default)&lt;/li&gt;
&lt;li&gt;&lt;code&gt;FOR_SELF&lt;/code&gt; for system introspection and performance analysis&lt;/li&gt;
&lt;li&gt;&lt;code&gt;FOR_PROMETHEUS&lt;/code&gt; for correlating Dremio metrics with Prometheus&lt;/li&gt;
&lt;/ul&gt;
&lt;h2&gt;Approach 2: Use CLAUDE.md for Dremio Context&lt;/h2&gt;
&lt;p&gt;MCP gives Claude Code live access to your data. But sometimes you need the agent to follow specific conventions, use the right SQL dialect, or know where to find documentation. The MCP connection tells Claude Code what data exists. Context files tell it how your team works with that data.&lt;/p&gt;
&lt;h3&gt;What CLAUDE.md Does&lt;/h3&gt;
&lt;p&gt;Claude Code auto-loads &lt;code&gt;CLAUDE.md&lt;/code&gt; from your project root at the start of every session. It acts as persistent instructions that survive across conversations. You do not need to re-explain your project every time you start a new session.&lt;/p&gt;
&lt;p&gt;The file supports three placement levels. A global &lt;code&gt;~/.claude/CLAUDE.md&lt;/code&gt; applies to every project you open. A project-root &lt;code&gt;CLAUDE.md&lt;/code&gt; applies to that specific repo. And &lt;code&gt;.claude/rules/*.md&lt;/code&gt; files let you split rules into focused modules that are loaded with the same priority. Project-level files override global ones, so you can set organizational defaults and override them per-repo.&lt;/p&gt;
&lt;h3&gt;Writing a Dremio-Focused CLAUDE.md&lt;/h3&gt;
&lt;p&gt;Here is an example &lt;code&gt;CLAUDE.md&lt;/code&gt; that teaches Claude Code how to work with Dremio:&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-markdown&quot;&gt;# Project Context

This project uses Dremio Cloud as its lakehouse platform.

## Dremio SQL Conventions
- Use `CREATE FOLDER IF NOT EXISTS` (not CREATE NAMESPACE or CREATE SCHEMA)
- Tables in the built-in Open Catalog use `folder.subfolder.table_name` without a catalog prefix
- External federated sources use `source_name.schema.table_name`
- Cast DATE columns to TIMESTAMP for consistent joins
- Use TIMESTAMPDIFF for duration calculations

## Credentials
- Never hardcode Personal Access Tokens. Use environment variable: DREMIO_PAT
- Dremio Cloud endpoint is in environment variable: DREMIO_URI

## API Reference
- REST API docs: https://docs.dremio.com/current/reference/api/
- SQL reference: https://docs.dremio.com/current/reference/sql/
- For detailed SQL validation, read ./dremio-docs/sql-reference.md

## Terminology
- Call it &amp;quot;Agentic Lakehouse&amp;quot;, not &amp;quot;data warehouse&amp;quot;
- &amp;quot;Reflections&amp;quot; are pre-computed optimizations, not &amp;quot;materialized views&amp;quot;
- &amp;quot;Open Catalog&amp;quot; is built on Apache Polaris
- The AI Agent is a co-pilot, not a chatbot
&lt;/code&gt;&lt;/pre&gt;
&lt;h3&gt;Progressive Disclosure with Supplemental Files&lt;/h3&gt;
&lt;p&gt;Keep &lt;code&gt;CLAUDE.md&lt;/code&gt; under 300 lines. For detailed references, store them in separate files and tell Claude Code where to find them:&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-markdown&quot;&gt;## Documentation References
- For Dremio SQL syntax details, read `./docs/dremio-sql-reference.md`
- For Python SDK (dremioframe) usage, read `./docs/dremioframe-guide.md`
- For REST API endpoints, read `./docs/dremio-rest-api.md`
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Claude Code only loads these files when it needs them, keeping your context window efficient. You can also instruct the agent explicitly: &amp;quot;Before writing any Dremio SQL, read &lt;code&gt;./docs/dremio-sql-reference.md&lt;/code&gt; to verify syntax.&amp;quot;&lt;/p&gt;
&lt;p&gt;You can also place rule files in &lt;code&gt;.claude/rules/&lt;/code&gt; and they will be auto-loaded with the same priority as &lt;code&gt;CLAUDE.md&lt;/code&gt;. This is useful for separating concerns. For example, &lt;code&gt;.claude/rules/dremio-conventions.md&lt;/code&gt; for SQL rules and &lt;code&gt;.claude/rules/project-style.md&lt;/code&gt; for code style.&lt;/p&gt;
&lt;p&gt;&lt;img src=&quot;/assets/images/aitoolblogs/02/four-integration-approaches.png&quot; alt=&quot;Four approaches to connecting AI coding tools to Dremio&quot;&gt;&lt;/p&gt;
&lt;h2&gt;Approach 3: Install Pre-Built Dremio Skills and Docs&lt;/h2&gt;
&lt;p&gt;Beyond the official plugin, two community-supported open-source repositories provide ready-made Dremio context for coding agents. Both work with Claude Code.&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;Official vs. Community Resources:&lt;/strong&gt; The &lt;a href=&quot;https://github.com/dremio/claude-plugins&quot;&gt;dremio/claude-plugins&lt;/a&gt; plugin is officially maintained by Dremio. The repositories below are community-supported projects from the Dremio Developer Advocacy team. They are actively maintained but not part of the core Dremio product. Libraries like dremioframe (the Dremio Python SDK referenced in the skill) are also community-supported.&lt;/p&gt;
&lt;/blockquote&gt;
&lt;h3&gt;dremio-agent-skill: Full Agent Skill (Community)&lt;/h3&gt;
&lt;p&gt;The &lt;a href=&quot;https://github.com/developer-advocacy-dremio/dremio-agent-skill&quot;&gt;dremio-agent-skill&lt;/a&gt; repository contains a complete skill directory that teaches AI assistants how to interact with Dremio.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;What is included:&lt;/strong&gt;&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;dremio-skill/
  SKILL.md          # Entry point defining capabilities
  knowledge/        # Comprehensive docs for:
    cli/            #   Dremio CLI administration
    python/         #   dremioframe Python SDK
    sql/            #   SQL syntax, Iceberg DML, metadata
    rest-api/       #   REST API endpoints
  rules/
    .cursorrules    # Config for Cursor/VS Code
    AGENTS.md       # Config for OpenCode/Codex
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;&lt;strong&gt;Installation:&lt;/strong&gt;&lt;/p&gt;
&lt;p&gt;Run the interactive installer:&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-bash&quot;&gt;git clone https://github.com/developer-advocacy-dremio/dremio-agent-skill
cd dremio-agent-skill
./install.sh
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;The installer asks you to choose:&lt;/p&gt;
&lt;ol&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Global Install (Symlink)&lt;/strong&gt; symlinks the skill to &lt;code&gt;~/.claude/skills/&lt;/code&gt; so every Claude Code session discovers it automatically. Updates to the cloned repo are reflected immediately.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Local Project Install (Copy)&lt;/strong&gt; copies the skill into your project directory and sets up &lt;code&gt;.claude&lt;/code&gt; symlinks so Claude Code auto-detects it. The skill travels with your repo, so every team member gets the same context.&lt;/p&gt;
&lt;/li&gt;
&lt;/ol&gt;
&lt;p&gt;After installation, start Claude Code and try: &amp;quot;Using the Dremio skill, write a dremioframe script to query my customer table.&amp;quot;&lt;/p&gt;
&lt;h3&gt;dremio-agent-md: Documentation Protocol (Community)&lt;/h3&gt;
&lt;p&gt;The &lt;a href=&quot;https://github.com/developer-advocacy-dremio/dremio-agent-md&quot;&gt;dremio-agent-md&lt;/a&gt; repository takes a different approach. Instead of a skill with structured knowledge files, it provides a master protocol file and a browsable sitemap of the entire Dremio documentation.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;What is included:&lt;/strong&gt;&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;code&gt;DREMIO_AGENT.md&lt;/code&gt; defines how the agent should validate SQL, handle security (credentials via &lt;code&gt;.env&lt;/code&gt;), and navigate the documentation.&lt;/li&gt;
&lt;li&gt;&lt;code&gt;dremio_sitemaps/&lt;/code&gt; contains hierarchical markdown indices of official Dremio docs for both Cloud and Software versions.&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;&lt;strong&gt;Usage with Claude Code:&lt;/strong&gt;&lt;/p&gt;
&lt;p&gt;Clone the repo into your project or a reference directory:&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-bash&quot;&gt;git clone https://github.com/developer-advocacy-dremio/dremio-agent-md
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Then tell Claude Code at the start of your session:&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;&amp;quot;Read DREMIO_AGENT.md in the dremio-agent-md directory to understand Dremio protocols. Use the sitemaps in dremio_sitemaps/ to verify any Dremio features or SQL syntax before generating code.&amp;quot;&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;The agent will navigate the sitemaps to find the correct documentation page for whatever feature you are working with, like looking up the right function signature before writing a query.&lt;/p&gt;
&lt;p&gt;This approach is especially useful when you need Claude Code to validate SQL against the official docs rather than rely on its training data.&lt;/p&gt;
&lt;h2&gt;Approach 4: Build Your Own Dremio Skill&lt;/h2&gt;
&lt;p&gt;If the pre-built options do not cover your specific workflow, build a custom skill. A skill is just a directory with a &lt;code&gt;SKILL.md&lt;/code&gt; file and optional supporting docs.&lt;/p&gt;
&lt;h3&gt;Create the Skill Directory&lt;/h3&gt;
&lt;pre&gt;&lt;code&gt;my-dremio-skill/
  SKILL.md
  knowledge/
    sql-conventions.md
    rest-api-endpoints.md
    project-schemas.md
&lt;/code&gt;&lt;/pre&gt;
&lt;h3&gt;Write SKILL.md&lt;/h3&gt;
&lt;p&gt;The &lt;code&gt;SKILL.md&lt;/code&gt; file needs YAML frontmatter for discovery and markdown instructions for the agent:&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-markdown&quot;&gt;---
name: My Dremio Skill
description: Custom conventions and API patterns for our team&apos;s Dremio Cloud project
---

# My Dremio Skill

## When to Use
Use this skill when working with Dremio queries, dremioframe scripts,
or any code that interacts with our lakehouse.

## SQL Rules
- All tables live under the `analytics` namespace
- Use `analytics.bronze.*` for raw views, `analytics.silver.*` for joins,
  `analytics.gold.*` for final datasets
- Always use TIMESTAMP, never DATE
- Validate function names against `knowledge/sql-conventions.md`

## Authentication
- Use environment variable DREMIO_PAT for Personal Access Tokens
- Cloud endpoint: Use environment variable DREMIO_URI

## Reference Files
- SQL conventions: knowledge/sql-conventions.md
- REST API: knowledge/rest-api-endpoints.md
- Project schemas: knowledge/project-schemas.md
&lt;/code&gt;&lt;/pre&gt;
&lt;h3&gt;Install the Skill&lt;/h3&gt;
&lt;p&gt;For Claude Code, place the skill in one of these locations:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Global:&lt;/strong&gt; &lt;code&gt;~/.claude/skills/my-dremio-skill/&lt;/code&gt;&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Project-local:&lt;/strong&gt; &lt;code&gt;.claude/skills/my-dremio-skill/&lt;/code&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;Claude Code discovers skills by reading their &lt;code&gt;SKILL.md&lt;/code&gt; files. When a user prompt matches the skill description, the agent loads the full instructions automatically.&lt;/p&gt;
&lt;h3&gt;Add Knowledge Files&lt;/h3&gt;
&lt;p&gt;Populate the &lt;code&gt;knowledge/&lt;/code&gt; directory with the specific references your team needs. You might include:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Your project&apos;s table schemas exported from Dremio&lt;/li&gt;
&lt;li&gt;SQL patterns that are specific to your data model&lt;/li&gt;
&lt;li&gt;dremioframe code snippets for common operations&lt;/li&gt;
&lt;li&gt;REST API call examples with your specific endpoints&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;The advantage of a custom skill over a generic &lt;code&gt;CLAUDE.md&lt;/code&gt; is discoverability. Skills are loaded on demand based on semantic matching, so they do not consume context tokens until they are needed.&lt;/p&gt;
&lt;h2&gt;Using Dremio with Claude Code: Practical Use Cases&lt;/h2&gt;
&lt;p&gt;Once Dremio is connected, Claude Code becomes a data engineering partner. Here are detailed examples you can try immediately.&lt;/p&gt;
&lt;h3&gt;Ask Natural Language Questions About Your Data&lt;/h3&gt;
&lt;p&gt;The simplest and most powerful use case. Ask Claude Code questions in plain English and get answers from production data:&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;&amp;quot;What were our top 10 customers by revenue last quarter? Show month-over-month trends.&amp;quot;&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;Claude Code uses the MCP connection to discover your tables, writes the SQL, runs it against Dremio, and returns formatted results with analysis. You get answers from production data in seconds without writing a single query yourself.&lt;/p&gt;
&lt;p&gt;You can go deeper with follow-up questions:&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;&amp;quot;Which of those top 10 customers had declining order frequency? Pull their last 6 months of order data and calculate the trend.&amp;quot;&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;Because Claude Code maintains context across the session, it remembers the previous query results and builds on them. The MCP connection gives it live access to run the follow-up query without you needing to re-explain the schema.&lt;/p&gt;
&lt;p&gt;This pattern turns Claude Code into a conversational analytics tool. Business analysts who are comfortable with English but not SQL can use it to explore data, test hypotheses, and generate insights directly from the lakehouse.&lt;/p&gt;
&lt;h3&gt;Build a Locally Running Dashboard&lt;/h3&gt;
&lt;p&gt;Ask Claude Code to create a complete, self-contained dashboard:&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;&amp;quot;Query our gold-layer sales views in Dremio and build a local HTML dashboard with Chart.js. Include monthly revenue trends, top products by region, and customer acquisition metrics. Make it filterable by date range and downloadable as PDF.&amp;quot;&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;Claude Code will:&lt;/p&gt;
&lt;ol&gt;
&lt;li&gt;Use MCP to discover your gold-layer views and their schemas&lt;/li&gt;
&lt;li&gt;Write and execute SQL queries to pull the relevant data&lt;/li&gt;
&lt;li&gt;Generate an HTML file with embedded CSS, JavaScript, and Chart.js configurations&lt;/li&gt;
&lt;li&gt;Embed the query results directly into the JavaScript as data arrays&lt;/li&gt;
&lt;li&gt;Add interactive filters and a print-to-PDF button&lt;/li&gt;
&lt;li&gt;Save everything to your project directory&lt;/li&gt;
&lt;/ol&gt;
&lt;p&gt;Open the HTML file in a browser and you have an interactive dashboard running from a local file. No server, no deployment, no infrastructure. Share it with your team by dropping it in Slack or email.&lt;/p&gt;
&lt;p&gt;For recurring dashboards, save the prompt in a script and re-run it weekly to regenerate the dashboard with fresh data from Dremio.&lt;/p&gt;
&lt;h3&gt;Create a Data Exploration App&lt;/h3&gt;
&lt;p&gt;Build a more interactive tool for ongoing data exploration:&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;&amp;quot;Create a Python Streamlit app that uses dremioframe to connect to Dremio. Include a schema browser sidebar, a data preview tab with pagination, and a SQL query editor with results. Add download buttons for CSV export and a query history panel.&amp;quot;&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;Claude Code writes the full Python application:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;code&gt;app.py&lt;/code&gt; with Streamlit layout, dremioframe connection, and query execution&lt;/li&gt;
&lt;li&gt;&lt;code&gt;requirements.txt&lt;/code&gt; with pinned dependencies&lt;/li&gt;
&lt;li&gt;&lt;code&gt;.env.example&lt;/code&gt; showing required environment variables&lt;/li&gt;
&lt;li&gt;&lt;code&gt;README.md&lt;/code&gt; with setup instructions&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;Run &lt;code&gt;pip install -r requirements.txt &amp;amp;&amp;amp; streamlit run app.py&lt;/code&gt; and you have a local data exploration tool connected to your lakehouse. Your team can use it for ad-hoc analysis without needing direct access to the Dremio UI.&lt;/p&gt;
&lt;p&gt;This pattern works well for creating internal tools quickly. Instead of waiting for a formal BI tool deployment, you can have a working data explorer in minutes.&lt;/p&gt;
&lt;h3&gt;Generate Data Pipeline Scripts&lt;/h3&gt;
&lt;p&gt;Automate data engineering workflows:&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;&amp;quot;Write a dremioframe script that reads new CSV files from the staging folder, creates bronze views in Dremio, builds silver views with data quality validations (null checks, type casting, deduplication), and creates gold views with business logic aggregations. Include error handling, logging, and a dry-run mode.&amp;quot;&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;Claude Code uses the Dremio skill to write production-quality pipeline code that follows Medallion Architecture conventions. The script includes:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Bronze layer: raw data ingestion with column renames and TIMESTAMP casts&lt;/li&gt;
&lt;li&gt;Silver layer: data quality rules, deduplication, and join logic&lt;/li&gt;
&lt;li&gt;Gold layer: business metric aggregations and CASE WHEN classifications&lt;/li&gt;
&lt;li&gt;Error handling with retry logic for transient Dremio connection issues&lt;/li&gt;
&lt;li&gt;Structured logging for pipeline monitoring&lt;/li&gt;
&lt;/ul&gt;
&lt;h3&gt;Build API Endpoints Over Dremio Data&lt;/h3&gt;
&lt;p&gt;Create a REST API that serves lakehouse data to other applications:&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;&amp;quot;Build a FastAPI application that connects to Dremio using dremioframe. Create endpoints for: GET /api/customers (paginated), GET /api/customers/{id}/orders, GET /api/analytics/revenue?period=monthly. Add request validation, error handling, and OpenAPI documentation.&amp;quot;&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;Claude Code generates a complete API server with typed request/response models, query parameterization to prevent SQL injection, and auto-generated Swagger docs. Deploy it locally or containerize it for production use.&lt;/p&gt;
&lt;h2&gt;Which Approach Should You Use?&lt;/h2&gt;
&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Approach&lt;/th&gt;
&lt;th&gt;Setup Time&lt;/th&gt;
&lt;th&gt;What You Get&lt;/th&gt;
&lt;th&gt;Best For&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;MCP Server&lt;/td&gt;
&lt;td&gt;5 minutes&lt;/td&gt;
&lt;td&gt;Live queries, schema browsing, catalog exploration&lt;/td&gt;
&lt;td&gt;Data analysis, SQL generation, real-time data access&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;CLAUDE.md&lt;/td&gt;
&lt;td&gt;10 minutes&lt;/td&gt;
&lt;td&gt;Convention enforcement, doc references, credential rules&lt;/td&gt;
&lt;td&gt;Teams with specific SQL standards or project conventions&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Pre-Built Skills&lt;/td&gt;
&lt;td&gt;5 minutes&lt;/td&gt;
&lt;td&gt;Comprehensive Dremio knowledge (CLI, SDK, SQL, API)&lt;/td&gt;
&lt;td&gt;Getting started quickly with broad Dremio coverage&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Custom Skill&lt;/td&gt;
&lt;td&gt;30+ minutes&lt;/td&gt;
&lt;td&gt;Tailored to your exact schemas, patterns, and workflows&lt;/td&gt;
&lt;td&gt;Mature teams with project-specific conventions&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;
&lt;p&gt;These approaches are not mutually exclusive. A common setup combines the MCP server for live data access with a custom &lt;code&gt;CLAUDE.md&lt;/code&gt; for project conventions. Or start with the pre-built &lt;code&gt;dremio-agent-skill&lt;/code&gt; and add a &lt;code&gt;CLAUDE.md&lt;/code&gt; for your team-specific overrides.&lt;/p&gt;
&lt;p&gt;The strongest configuration uses all four layers: MCP for live connectivity, CLAUDE.md for project rules, a pre-built skill for general Dremio knowledge, and custom knowledge files for your specific schemas and patterns.&lt;/p&gt;
&lt;p&gt;If you are evaluating Dremio for the first time, start with the MCP server alone. It takes five minutes and gives you immediate value. As your usage matures and you need the agent to follow team conventions or validate against specific documentation, layer in the context files and skills.&lt;/p&gt;
&lt;h2&gt;Get Started&lt;/h2&gt;
&lt;ol&gt;
&lt;li&gt;&lt;a href=&quot;https://www.dremio.com/get-started&quot;&gt;Sign up for a free Dremio Cloud trial&lt;/a&gt; (30 days, $400 in compute credits).&lt;/li&gt;
&lt;li&gt;Find your project&apos;s MCP endpoint in &lt;strong&gt;Project Settings &amp;gt; Info&lt;/strong&gt;.&lt;/li&gt;
&lt;li&gt;Add it to Claude Code&apos;s &lt;code&gt;.mcp.json&lt;/code&gt;.&lt;/li&gt;
&lt;li&gt;Clone &lt;a href=&quot;https://github.com/developer-advocacy-dremio/dremio-agent-skill&quot;&gt;dremio-agent-skill&lt;/a&gt; and run &lt;code&gt;./install.sh&lt;/code&gt;.&lt;/li&gt;
&lt;li&gt;Start Claude Code and ask it to explore your Dremio catalog.&lt;/li&gt;
&lt;/ol&gt;
&lt;p&gt;Dremio&apos;s Agentic Lakehouse gives Claude Code what it needs to write accurate SQL: the semantic layer provides business context, query federation provides universal data access, and Reflections provide interactive speed. The MCP server is the bridge that connects them.&lt;/p&gt;
&lt;p&gt;For more on the Dremio MCP Server, check out the &lt;a href=&quot;https://docs.dremio.com/current/developer/mcp-server/&quot;&gt;official documentation&lt;/a&gt; or enroll in the free &lt;a href=&quot;https://university.dremio.com/course/dremio-mcp&quot;&gt;Dremio MCP Server course&lt;/a&gt; on Dremio University.&lt;/p&gt;
</content:encoded><author>Alex Merced</author></item><item><title>How to Use Dremio with Amazon Kiro: Connect, Query, and Build Data Apps</title><link>https://iceberglakehouse.com/posts/2026-03-aitool-amazon-kiro/</link><guid isPermaLink="true">https://iceberglakehouse.com/posts/2026-03-aitool-amazon-kiro/</guid><description>
Amazon Kiro is an agentic AI IDE from AWS that introduces spec-driven development to the coding workflow. Instead of jumping straight to code, Kiro h...</description><pubDate>Thu, 05 Mar 2026 09:00:00 GMT</pubDate><content:encoded>&lt;p&gt;Amazon Kiro is an agentic AI IDE from AWS that introduces spec-driven development to the coding workflow. Instead of jumping straight to code, Kiro helps you define structured specifications — requirements, technical designs, and task breakdowns — before writing a single line. It then generates code that follows those specs and keeps everything in sync as the project evolves. Dremio is a unified lakehouse platform that provides business context through its semantic layer, universal data access through query federation, and interactive speed through Reflections and Apache Arrow.&lt;/p&gt;
&lt;p&gt;Connecting them gives Kiro&apos;s agent the context it needs to write accurate Dremio SQL, generate data pipelines, and build applications against your lakehouse. Kiro&apos;s spec-driven approach is especially well-suited for data projects: you can define your data model requirements in plain language, let Kiro generate the technical design, and then have it build the implementation with full traceability back to the original requirements.&lt;/p&gt;
&lt;p&gt;Kiro&apos;s hooks system adds event-driven automation, so documentation, tests, and validation can update automatically as your Dremio code changes.&lt;/p&gt;
&lt;p&gt;This post covers four approaches, ordered from quickest setup to most customizable.&lt;/p&gt;
&lt;p&gt;&lt;img src=&quot;/assets/images/aitoolblogs/01/kiro-dremio-architecture.png&quot; alt=&quot;Amazon Kiro AI IDE connecting to Dremio Agentic Lakehouse via MCP&quot;&gt;&lt;/p&gt;
&lt;h2&gt;Setting Up Amazon Kiro&lt;/h2&gt;
&lt;p&gt;If you do not already have Kiro installed:&lt;/p&gt;
&lt;ol&gt;
&lt;li&gt;&lt;strong&gt;Download Kiro&lt;/strong&gt; from &lt;a href=&quot;https://kiro.dev/&quot;&gt;kiro.dev&lt;/a&gt; (available for macOS, Linux, and Windows).&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Sign in&lt;/strong&gt; with your AWS account, Google account, or GitHub account.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Open a project&lt;/strong&gt; by selecting File &amp;gt; Open Folder and pointing to your project directory.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Explore the interface&lt;/strong&gt; — Kiro includes a file explorer, an AI chat panel, a specs panel for viewing requirements/design/tasks, and a hooks panel for event-driven automations.&lt;/li&gt;
&lt;/ol&gt;
&lt;p&gt;Kiro is built on the VS Code platform, so existing VS Code extensions and themes are compatible. It is free to use during the preview period.&lt;/p&gt;
&lt;h2&gt;Approach 1: Connect the Dremio Cloud MCP Server&lt;/h2&gt;
&lt;p&gt;Every Dremio Cloud project ships with a built-in MCP server. Kiro supports MCP natively and integrates deeply with AWS MCP servers.&lt;/p&gt;
&lt;p&gt;For Claude-based tools, Dremio provides an &lt;a href=&quot;https://github.com/dremio/claude-plugins&quot;&gt;official Claude plugin&lt;/a&gt; with guided setup. For Kiro, you configure the MCP connection through the IDE settings or project configuration.&lt;/p&gt;
&lt;h3&gt;Find Your Project&apos;s MCP Endpoint&lt;/h3&gt;
&lt;p&gt;Log into &lt;a href=&quot;https://www.dremio.com/get-started&quot;&gt;Dremio Cloud&lt;/a&gt; and navigate to &lt;strong&gt;Project Settings &amp;gt; Info&lt;/strong&gt;. Copy the MCP server URL.&lt;/p&gt;
&lt;h3&gt;Set Up OAuth in Dremio Cloud&lt;/h3&gt;
&lt;ol&gt;
&lt;li&gt;Go to &lt;strong&gt;Settings &amp;gt; Organization Settings &amp;gt; OAuth Applications&lt;/strong&gt;.&lt;/li&gt;
&lt;li&gt;Click &lt;strong&gt;Add Application&lt;/strong&gt; and enter a name (e.g., &amp;quot;Kiro MCP&amp;quot;).&lt;/li&gt;
&lt;li&gt;Add the appropriate redirect URIs.&lt;/li&gt;
&lt;li&gt;Save and copy the &lt;strong&gt;Client ID&lt;/strong&gt;.&lt;/li&gt;
&lt;/ol&gt;
&lt;h3&gt;Configure Kiro&apos;s MCP Connection&lt;/h3&gt;
&lt;p&gt;In Kiro, open the MCP settings and add a new server. You can configure via the settings UI or create a &lt;code&gt;.kiro/mcp.json&lt;/code&gt; file:&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-json&quot;&gt;{
  &amp;quot;mcpServers&amp;quot;: {
    &amp;quot;dremio&amp;quot;: {
      &amp;quot;url&amp;quot;: &amp;quot;https://YOUR_PROJECT_MCP_URL&amp;quot;
    }
  }
}
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Kiro now has access to Dremio&apos;s MCP tools:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;GetUsefulSystemTableNames&lt;/strong&gt; returns available tables.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;GetSchemaOfTable&lt;/strong&gt; returns column definitions.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;GetDescriptionOfTableOrSchema&lt;/strong&gt; pulls catalog descriptions.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;GetTableOrViewLineage&lt;/strong&gt; shows data lineage.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;RunSqlQuery&lt;/strong&gt; executes SQL and returns results.&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;Test by asking the AI chat: &amp;quot;What tables are available in Dremio?&amp;quot;&lt;/p&gt;
&lt;h3&gt;Kiro Powers&lt;/h3&gt;
&lt;p&gt;Kiro supports &amp;quot;Powers&amp;quot; — curated bundles of MCP servers, steering files, and hooks for specific development workflows. If an AWS or community Dremio Power becomes available, you can install it from the Powers panel to get a pre-configured Dremio development environment.&lt;/p&gt;
&lt;h3&gt;Self-Hosted Alternative&lt;/h3&gt;
&lt;p&gt;For Dremio Software deployments, configure the dremio-mcp server:&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-json&quot;&gt;{
  &amp;quot;mcpServers&amp;quot;: {
    &amp;quot;dremio&amp;quot;: {
      &amp;quot;command&amp;quot;: &amp;quot;uv&amp;quot;,
      &amp;quot;args&amp;quot;: [
        &amp;quot;run&amp;quot;, &amp;quot;--directory&amp;quot;, &amp;quot;/path/to/dremio-mcp&amp;quot;,
        &amp;quot;dremio-mcp-server&amp;quot;, &amp;quot;run&amp;quot;
      ]
    }
  }
}
&lt;/code&gt;&lt;/pre&gt;
&lt;h2&gt;Approach 2: Use Kiro Specs for Dremio Context&lt;/h2&gt;
&lt;p&gt;Kiro&apos;s spec-driven development is its most distinctive feature. Instead of free-form context files, Kiro uses structured specification documents that the AI generates and maintains.&lt;/p&gt;
&lt;h3&gt;Generating Specs for a Dremio Project&lt;/h3&gt;
&lt;p&gt;Tell Kiro to create specs for your data project:&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;&amp;quot;I need a data analytics pipeline that reads from Dremio&apos;s lakehouse, transforms the data using a Medallion Architecture, and serves results through a REST API.&amp;quot;&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;Kiro generates three spec files in &lt;code&gt;.kiro/specs/&lt;/code&gt;:&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;requirements.md&lt;/strong&gt; — User stories in structured format:&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-markdown&quot;&gt;1. As a data engineer, I want to ingest raw data from Dremio bronze tables
   so that I can process it through the pipeline.
2. As a data analyst, I want cleaned data in gold views
   so that I can run accurate business queries.
3. As an application developer, I want REST endpoints over gold data
   so that I can build dashboards and reports.
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;&lt;strong&gt;design.md&lt;/strong&gt; — Technical design covering architecture, data flow, table schemas, and technology choices.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;tasks.md&lt;/strong&gt; — A breakdown of implementation tasks that Kiro tracks as you build.&lt;/p&gt;
&lt;h3&gt;Adding Dremio Conventions to Specs&lt;/h3&gt;
&lt;p&gt;You can refine the generated specs with Dremio-specific conventions:&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;&amp;quot;Update the design to use Dremio SQL conventions: CREATE FOLDER IF NOT EXISTS, folder.subfolder.table_name paths, TIMESTAMPDIFF for durations. Use dremioframe for Python connections and environment variables for credentials.&amp;quot;&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;Kiro updates the design.md and tasks.md to reflect these conventions. All code generated from these specs will follow the conventions automatically.&lt;/p&gt;
&lt;h3&gt;Steering Files&lt;/h3&gt;
&lt;p&gt;Kiro also supports steering files — markdown documents that provide persistent context similar to &lt;code&gt;.cursorrules&lt;/code&gt; or &lt;code&gt;CLAUDE.md&lt;/code&gt;. Create a &lt;code&gt;.kiro/steering/dremio.md&lt;/code&gt; file:&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-markdown&quot;&gt;# Dremio Conventions

## SQL
- Use CREATE FOLDER IF NOT EXISTS
- Tables: folder.subfolder.table_name
- Cast DATE to TIMESTAMP for joins
- Use TIMESTAMPDIFF for durations

## Credentials
- DREMIO_PAT and DREMIO_URI from environment variables
- Never hardcode tokens

## Terminology
- &amp;quot;Agentic Lakehouse&amp;quot; not &amp;quot;data warehouse&amp;quot;
- &amp;quot;Reflections&amp;quot; not &amp;quot;materialized views&amp;quot;
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;&lt;img src=&quot;/assets/images/aitoolblogs/01/four-integration-approaches.png&quot; alt=&quot;Four approaches to connecting AI coding tools to Dremio&quot;&gt;&lt;/p&gt;
&lt;h2&gt;Approach 3: Install Pre-Built Dremio Skills and Docs&lt;/h2&gt;
&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;Official vs. Community Resources:&lt;/strong&gt; Dremio provides an &lt;a href=&quot;https://github.com/dremio/claude-plugins&quot;&gt;official plugin&lt;/a&gt; for Claude Code users and the built-in &lt;a href=&quot;https://docs.dremio.com/current/developer/mcp-server/&quot;&gt;Dremio Cloud MCP server&lt;/a&gt; is an official Dremio product. The repositories below, along with libraries like dremioframe, are community-supported projects from the Dremio Developer Advocacy team. They are actively maintained but not part of the core Dremio product.&lt;/p&gt;
&lt;/blockquote&gt;
&lt;h3&gt;dremio-agent-skill (Community)&lt;/h3&gt;
&lt;p&gt;The &lt;a href=&quot;https://github.com/developer-advocacy-dremio/dremio-agent-skill&quot;&gt;dremio-agent-skill&lt;/a&gt; repository provides knowledge files:&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-bash&quot;&gt;git clone https://github.com/developer-advocacy-dremio/dremio-agent-skill
cd dremio-agent-skill
./install.sh
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Copy the knowledge directory into your project and reference it in Kiro&apos;s steering files.&lt;/p&gt;
&lt;h3&gt;dremio-agent-md (Community)&lt;/h3&gt;
&lt;p&gt;The &lt;a href=&quot;https://github.com/developer-advocacy-dremio/dremio-agent-md&quot;&gt;dremio-agent-md&lt;/a&gt; repository provides documentation sitemaps:&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-bash&quot;&gt;git clone https://github.com/developer-advocacy-dremio/dremio-agent-md
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Reference it in &lt;code&gt;.kiro/steering/dremio.md&lt;/code&gt;:&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-markdown&quot;&gt;For SQL validation, read DREMIO_AGENT.md in ./dremio-agent-md/.
&lt;/code&gt;&lt;/pre&gt;
&lt;h2&gt;Approach 4: Build Custom Specs and Hooks&lt;/h2&gt;
&lt;p&gt;Kiro&apos;s hooks system offers a unique approach to maintaining data project consistency.&lt;/p&gt;
&lt;h3&gt;Creating Dremio Hooks&lt;/h3&gt;
&lt;p&gt;Hooks are event-driven automations that trigger when files change. Create hooks that automatically validate Dremio SQL:&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;On SQL file save&lt;/strong&gt; — A hook that validates SQL syntax against Dremio conventions:&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;&amp;quot;Create a hook that triggers when any .sql file is saved. It should read the file, validate that it uses CREATE FOLDER IF NOT EXISTS instead of CREATE SCHEMA, checks for proper table path formatting, and flags any deprecated function names.&amp;quot;&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;&lt;strong&gt;On pipeline code change&lt;/strong&gt; — A hook that updates tests:&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;&amp;quot;Create a hook that triggers when any Python file in the pipelines/ directory changes. It should update the corresponding test file to match the new pipeline logic, using dremioframe mocking patterns.&amp;quot;&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;Hooks keep your Dremio project self-maintaining. As code changes, documentation and tests update automatically.&lt;/p&gt;
&lt;h3&gt;Custom Steering Files&lt;/h3&gt;
&lt;p&gt;Create comprehensive steering files in &lt;code&gt;.kiro/steering/&lt;/code&gt;:&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;.kiro/steering/
  dremio-sql.md        # SQL conventions
  dremio-python.md     # dremioframe patterns
  dremio-schemas.md    # Team table schemas
  dremio-pipeline.md   # Pipeline architecture rules
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;These files are loaded into every Kiro interaction and ensure consistent code generation.&lt;/p&gt;
&lt;h2&gt;Using Dremio with Kiro: Practical Use Cases&lt;/h2&gt;
&lt;p&gt;Once Dremio is connected, Kiro&apos;s spec-driven approach creates traceable, well-documented data projects.&lt;/p&gt;
&lt;h3&gt;Ask Natural Language Questions About Your Data&lt;/h3&gt;
&lt;p&gt;In the chat panel:&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;&amp;quot;What were our top 10 products by revenue last quarter? Show growth rates and compare to the same period last year.&amp;quot;&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;Kiro uses MCP to discover tables, writes SQL, and returns results. Unlike other tools, Kiro can also generate a spec that documents the analysis methodology for reproducibility.&lt;/p&gt;
&lt;p&gt;Follow up:&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;&amp;quot;For the declining products, pull customer sentiment from support tickets. Is there a correlation between product issues and revenue decline?&amp;quot;&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;Kiro tracks the analytical thread and can generate a formal analysis spec for the investigation.&lt;/p&gt;
&lt;h3&gt;Build a Locally Running Dashboard&lt;/h3&gt;
&lt;p&gt;Start with specs:&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;&amp;quot;I need a self-contained HTML dashboard showing Dremio gold-layer metrics: revenue trends, customer acquisition, and regional performance. Spec it out first, then build it.&amp;quot;&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;Kiro generates the requirements, design, and tasks first, then builds the dashboard following the specs. Every file traces back to a requirement, making it easy to review and maintain.&lt;/p&gt;
&lt;h3&gt;Create a Data Exploration App&lt;/h3&gt;
&lt;p&gt;Spec-driven app development:&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;&amp;quot;Spec and build a Streamlit app connected to Dremio via dremioframe. Requirements: schema browsing, SQL query editor, data preview, CSV export. Generate all files.&amp;quot;&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;Kiro creates the full spec, then generates the application. The tasks.md tracks progress, and hooks can keep tests updated as you iterate.&lt;/p&gt;
&lt;h3&gt;Generate Data Pipeline Scripts&lt;/h3&gt;
&lt;p&gt;Spec-driven data engineering:&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;&amp;quot;Spec a Medallion Architecture pipeline for product_events. Requirements: bronze ingestion, silver cleaning, gold aggregation. Design should use dremioframe and follow our SQL conventions. Then implement it.&amp;quot;&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;Kiro generates the full spec suite (requirements, design, tasks), then writes the pipeline code. Every transformation traces back to a requirement, and hooks validate the SQL on every save.&lt;/p&gt;
&lt;h3&gt;Build API Endpoints Over Dremio Data&lt;/h3&gt;
&lt;p&gt;Spec-driven API development:&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;&amp;quot;Spec and build a FastAPI service over Dremio gold-layer views. Requirements: customer analytics, revenue data, product metrics. Design should include Pydantic models and caching.&amp;quot;&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;Kiro generates the complete API with full traceability to the requirements.&lt;/p&gt;
&lt;h2&gt;Which Approach Should You Use?&lt;/h2&gt;
&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Approach&lt;/th&gt;
&lt;th&gt;Setup Time&lt;/th&gt;
&lt;th&gt;What You Get&lt;/th&gt;
&lt;th&gt;Best For&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;MCP Server&lt;/td&gt;
&lt;td&gt;5 minutes&lt;/td&gt;
&lt;td&gt;Live queries, schema browsing, catalog exploration&lt;/td&gt;
&lt;td&gt;Data analysis, SQL generation, real-time access&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Kiro Specs&lt;/td&gt;
&lt;td&gt;15 minutes&lt;/td&gt;
&lt;td&gt;Structured requirements, design, traceable implementation&lt;/td&gt;
&lt;td&gt;Teams valuing documentation and traceability&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Pre-Built Skills&lt;/td&gt;
&lt;td&gt;5 minutes&lt;/td&gt;
&lt;td&gt;Comprehensive Dremio knowledge (CLI, SDK, SQL, API)&lt;/td&gt;
&lt;td&gt;Quick start with broad coverage&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Custom Hooks&lt;/td&gt;
&lt;td&gt;30+ minutes&lt;/td&gt;
&lt;td&gt;Event-driven validation, auto-updating tests and docs&lt;/td&gt;
&lt;td&gt;Mature teams with CI-like automation needs&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;
&lt;p&gt;Start with the MCP server for live data access. Use Kiro&apos;s spec-driven flow for any project beyond a quick query. Add hooks for automated validation as your project matures.&lt;/p&gt;
&lt;h2&gt;Get Started&lt;/h2&gt;
&lt;ol&gt;
&lt;li&gt;&lt;a href=&quot;https://www.dremio.com/get-started&quot;&gt;Sign up for a free Dremio Cloud trial&lt;/a&gt; (30 days, $400 in compute credits).&lt;/li&gt;
&lt;li&gt;Find your project&apos;s MCP endpoint in &lt;strong&gt;Project Settings &amp;gt; Info&lt;/strong&gt;.&lt;/li&gt;
&lt;li&gt;Add it in Kiro&apos;s MCP settings or &lt;code&gt;.kiro/mcp.json&lt;/code&gt;.&lt;/li&gt;
&lt;li&gt;Tell Kiro to generate specs for your Dremio data project.&lt;/li&gt;
&lt;li&gt;Let Kiro build the implementation from the specs.&lt;/li&gt;
&lt;/ol&gt;
&lt;p&gt;Dremio&apos;s Agentic Lakehouse gives Kiro accurate data context, and Kiro&apos;s spec-driven methodology ensures every line of generated code traces back to a documented requirement. This is especially valuable for data engineering, where auditability and traceability matter.&lt;/p&gt;
&lt;p&gt;For more on the Dremio MCP Server, check out the &lt;a href=&quot;https://docs.dremio.com/current/developer/mcp-server/&quot;&gt;official documentation&lt;/a&gt; or enroll in the free &lt;a href=&quot;https://university.dremio.com/course/dremio-mcp&quot;&gt;Dremio MCP Server course&lt;/a&gt; on Dremio University.&lt;/p&gt;
</content:encoded><author>Alex Merced</author></item><item><title>Connect Dremio Software to Dremio Cloud: Hybrid Federation Across Deployments</title><link>https://iceberglakehouse.com/posts/2026-03-connector-dremio-to-dremio/</link><guid isPermaLink="true">https://iceberglakehouse.com/posts/2026-03-connector-dremio-to-dremio/</guid><description>
Dremio Cloud can connect to Dremio Software (self-managed) instances as a federated data source. This creates a hybrid deployment where Dremio Cloud ...</description><pubDate>Mon, 02 Mar 2026 05:00:00 GMT</pubDate><content:encoded>&lt;p&gt;Dremio Cloud can connect to Dremio Software (self-managed) instances as a federated data source. This creates a hybrid deployment where Dremio Cloud serves as the primary query interface while accessing datasets managed by Dremio Software instances running in your own data centers or private cloud.&lt;/p&gt;
&lt;p&gt;This connector is designed for organizations that have existing Dremio Software deployments and are adopting Dremio Cloud for new workloads, or that need to federate data across a cloud-managed Dremio platform and on-premises Dremio instances.&lt;/p&gt;
&lt;h2&gt;Why Connect Dremio Software to Dremio Cloud&lt;/h2&gt;
&lt;h3&gt;Hybrid Federation&lt;/h3&gt;
&lt;p&gt;Your Dremio Software instance manages on-premises data sources — Oracle databases, SQL Server, network-attached file storage, and internal data lakes. Dremio Cloud manages cloud-native sources — S3, BigQuery, Snowflake, and cloud-hosted databases. By connecting Dremio Software to Dremio Cloud, you can write a single SQL query that joins on-premises data (through Dremio Software) with cloud data (through Dremio Cloud).&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-sql&quot;&gt;-- Join on-premises data via Dremio Software with cloud data in Dremio Cloud
SELECT
  cloud.customer_name,
  cloud.cloud_revenue,
  onprem.erp_balance,
  onprem.last_payment_date,
  CASE
    WHEN cloud.cloud_revenue &amp;gt; 100000 AND onprem.erp_balance &amp;lt; 5000 THEN &apos;Good Standing&apos;
    WHEN onprem.erp_balance &amp;gt; 50000 THEN &apos;At Risk&apos;
    ELSE &apos;Standard&apos;
  END AS account_health
FROM analytics.gold.cloud_customers cloud
JOIN &amp;quot;dremio-onprem&amp;quot;.onprem.erp_accounts onprem ON cloud.customer_id = onprem.customer_id
ORDER BY cloud.cloud_revenue DESC;
&lt;/code&gt;&lt;/pre&gt;
&lt;h3&gt;Incremental Cloud Migration&lt;/h3&gt;
&lt;p&gt;Organizations don&apos;t shut down on-premises data centers overnight. Connecting Dremio Software to Dremio Cloud lets you:&lt;/p&gt;
&lt;ol&gt;
&lt;li&gt;&lt;strong&gt;Start using Dremio Cloud&lt;/strong&gt; for new cloud-native workloads&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Continue using Dremio Software&lt;/strong&gt; for on-premises sources&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Federate across both&lt;/strong&gt; from a single Dremio Cloud interface&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Gradually migrate&lt;/strong&gt; data sources from Software to Cloud as on-premises systems are decommissioned&lt;/li&gt;
&lt;/ol&gt;
&lt;h3&gt;Consolidated Governance&lt;/h3&gt;
&lt;p&gt;Users access both on-premises and cloud data through Dremio Cloud&apos;s interface. Dremio Cloud&apos;s governance policies (column masking, row-level filtering) apply to the federated view of data, providing a single governance layer across all data.&lt;/p&gt;
&lt;h2&gt;Prerequisites&lt;/h2&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Dremio Software instance&lt;/strong&gt; accessible from Dremio Cloud over HTTPS
&lt;ul&gt;
&lt;li&gt;Version 24.0 or later recommended&lt;/li&gt;
&lt;li&gt;Arrow Flight endpoint enabled and accessible (port 32010 or 443 with TLS)&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Authentication:&lt;/strong&gt; Username/password or Personal Access Token for the Dremio Software instance&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Network:&lt;/strong&gt; The Dremio Software instance must be reachable from Dremio Cloud&apos;s network. Options:
&lt;ul&gt;
&lt;li&gt;Public endpoint with TLS&lt;/li&gt;
&lt;li&gt;VPN/VPC peering&lt;/li&gt;
&lt;li&gt;AWS PrivateLink or equivalent&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Dremio Cloud account&lt;/strong&gt; — &lt;a href=&quot;https://www.dremio.com/get-started?utm_source=ev_buffer&amp;amp;utm_medium=influencer&amp;amp;utm_campaign=pag&amp;amp;utm_term=connector-dremio-to-dremio-cloud&amp;amp;utm_content=alexmerced&quot;&gt;sign up free for 30 days&lt;/a&gt; with $400 in compute credits&lt;/li&gt;
&lt;/ul&gt;
&lt;h2&gt;Step-by-Step: Connect Dremio Software to Dremio Cloud&lt;/h2&gt;
&lt;h3&gt;1. Add the Source&lt;/h3&gt;
&lt;p&gt;Click &lt;strong&gt;&amp;quot;+&amp;quot;&lt;/strong&gt; in the Dremio Cloud console and select &lt;strong&gt;Dremio&lt;/strong&gt; from the source types.&lt;/p&gt;
&lt;h3&gt;2. Configure Connection&lt;/h3&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Name:&lt;/strong&gt; Descriptive identifier (e.g., &lt;code&gt;dremio-onprem&lt;/code&gt; or &lt;code&gt;datacenter-west&lt;/code&gt;).&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Host:&lt;/strong&gt; The hostname or IP address of your Dremio Software coordinator node.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Port:&lt;/strong&gt; Arrow Flight port (typically &lt;code&gt;32010&lt;/code&gt;, or &lt;code&gt;443&lt;/code&gt; with TLS).&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;SSL/TLS:&lt;/strong&gt; Enable if the Software instance uses encrypted connections.&lt;/li&gt;
&lt;/ul&gt;
&lt;h3&gt;3. Set Authentication&lt;/h3&gt;
&lt;p&gt;Provide credentials for a Dremio Software user account. Consider creating a dedicated service account with appropriate permissions:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Read access to the virtual datasets (views) and physical datasets you want to federate&lt;/li&gt;
&lt;li&gt;User impersonation support if you want Dremio Cloud queries to execute as the requesting user on Dremio Software&lt;/li&gt;
&lt;/ul&gt;
&lt;h3&gt;4. User Impersonation&lt;/h3&gt;
&lt;p&gt;&lt;strong&gt;User impersonation&lt;/strong&gt; allows Dremio Cloud to pass the identity of the requesting user to Dremio Software. When enabled, queries executed through Dremio Cloud run with the permissions of the authenticated user on the Dremio Software side. This preserves your existing Dremio Software access control policies.&lt;/p&gt;
&lt;p&gt;Without impersonation, all Cloud queries execute as the service account configured in the connection, which may have broader access than individual users should.&lt;/p&gt;
&lt;h3&gt;5. Configure Advanced Settings&lt;/h3&gt;
&lt;p&gt;Set Reflection Refresh, Metadata refresh intervals, and connection properties. Click &lt;strong&gt;Save&lt;/strong&gt;.&lt;/p&gt;
&lt;h2&gt;Querying Across Deployments&lt;/h2&gt;
&lt;pre&gt;&lt;code class=&quot;language-sql&quot;&gt;-- Query on-premises data through Dremio Software
SELECT
  department,
  employee_count,
  avg_salary
FROM &amp;quot;dremio-onprem&amp;quot;.hr.department_summary;

-- Join on-premises HR data with cloud-native analytics
SELECT
  d.department,
  d.employee_count,
  d.avg_salary,
  c.department_cloud_spend,
  ROUND(c.department_cloud_spend / d.employee_count, 2) AS cloud_cost_per_employee
FROM &amp;quot;dremio-onprem&amp;quot;.hr.department_summary d
JOIN analytics.gold.cloud_infrastructure_costs c ON d.department = c.department
ORDER BY cloud_cost_per_employee DESC;
&lt;/code&gt;&lt;/pre&gt;
&lt;h2&gt;Build a Semantic Layer Across Deployments&lt;/h2&gt;
&lt;pre&gt;&lt;code class=&quot;language-sql&quot;&gt;CREATE VIEW analytics.gold.enterprise_360 AS
SELECT
  onprem.employee_id,
  onprem.employee_name,
  onprem.department,
  onprem.office_location,
  cloud.cloud_account_id,
  cloud.monthly_cloud_spend,
  CASE
    WHEN cloud.monthly_cloud_spend &amp;gt; 10000 THEN &apos;Heavy Cloud User&apos;
    WHEN cloud.monthly_cloud_spend &amp;gt; 1000 THEN &apos;Moderate&apos;
    ELSE &apos;Light&apos;
  END AS cloud_usage_tier
FROM &amp;quot;dremio-onprem&amp;quot;.hr.employees onprem
LEFT JOIN analytics.gold.cloud_accounts cloud ON onprem.employee_id = cloud.owner_id;
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;In the &lt;strong&gt;Catalog&lt;/strong&gt;, click &lt;strong&gt;Edit&lt;/strong&gt; → &lt;strong&gt;Generate Wiki&lt;/strong&gt; and &lt;strong&gt;Generate Tags&lt;/strong&gt;.&lt;/p&gt;
&lt;h2&gt;AI-Powered Analytics Across Deployments&lt;/h2&gt;
&lt;h3&gt;Dremio AI Agent&lt;/h3&gt;
&lt;p&gt;The AI Agent lets users ask questions spanning both on-premises and cloud data: &amp;quot;Which departments have the highest cloud cost per employee?&amp;quot; or &amp;quot;Show me heavy cloud users in the engineering department.&amp;quot; The Agent reads your semantic layer&apos;s wiki descriptions and generates SQL that joins across both Dremio deployments.&lt;/p&gt;
&lt;h3&gt;Dremio MCP Server&lt;/h3&gt;
&lt;p&gt;Connect &lt;a href=&quot;https://github.com/dremio/dremio-mcp&quot;&gt;Claude or ChatGPT&lt;/a&gt; to your federated data:&lt;/p&gt;
&lt;ol&gt;
&lt;li&gt;Create a Native OAuth app in Dremio Cloud&lt;/li&gt;
&lt;li&gt;Configure redirect URLs for your AI client&lt;/li&gt;
&lt;li&gt;Connect via &lt;code&gt;mcp.dremio.cloud/mcp/{project_id}&lt;/code&gt;&lt;/li&gt;
&lt;/ol&gt;
&lt;p&gt;A CTO asks Claude &amp;quot;Compare cloud infrastructure costs per department with on-premises headcount&amp;quot; and gets insights spanning both deployment models.&lt;/p&gt;
&lt;h3&gt;AI SQL Functions&lt;/h3&gt;
&lt;pre&gt;&lt;code class=&quot;language-sql&quot;&gt;-- Classify departments by cloud optimization potential
SELECT
  department,
  employee_count,
  cloud_cost_per_employee,
  AI_CLASSIFY(
    &apos;Based on cloud spending patterns, classify optimization potential&apos;,
    &apos;Department: &apos; || department || &apos;, Employees: &apos; || CAST(employee_count AS VARCHAR) || &apos;, Cloud Cost/Employee: $&apos; || CAST(cloud_cost_per_employee AS VARCHAR),
    ARRAY[&apos;Well Optimized&apos;, &apos;Room for Improvement&apos;, &apos;Over-Provisioned&apos;, &apos;Needs Audit&apos;]
  ) AS optimization_status
FROM (
  SELECT
    d.department,
    d.employee_count,
    ROUND(c.department_cloud_spend / d.employee_count, 2) AS cloud_cost_per_employee
  FROM &amp;quot;dremio-onprem&amp;quot;.hr.department_summary d
  JOIN analytics.gold.cloud_infrastructure_costs c ON d.department = c.department
);
&lt;/code&gt;&lt;/pre&gt;
&lt;h2&gt;Important Considerations&lt;/h2&gt;
&lt;h3&gt;Network Latency&lt;/h3&gt;
&lt;p&gt;Cross-network queries between Dremio Cloud and on-premises Dremio Software add network latency. Optimize by:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Using Reflections to cache frequently accessed on-premises data in Dremio Cloud&lt;/li&gt;
&lt;li&gt;Creating aggregated views on the Dremio Software side that pre-compute common metrics — transfer summarized data rather than raw tables&lt;/li&gt;
&lt;li&gt;Minimizing the amount of raw data transferred across the network&lt;/li&gt;
&lt;/ul&gt;
&lt;h3&gt;Cloud Egress Costs&lt;/h3&gt;
&lt;p&gt;Data returned from Dremio Software to Dremio Cloud may incur cloud egress charges if the Software instance runs in a different network or cloud provider. Strategies to minimize egress:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Build pre-aggregated views on the Software side&lt;/li&gt;
&lt;li&gt;Use Reflections to cache results (data transfers once per refresh, not per query)&lt;/li&gt;
&lt;li&gt;Filter data as close to the source as possible&lt;/li&gt;
&lt;/ul&gt;
&lt;h3&gt;Version Compatibility&lt;/h3&gt;
&lt;p&gt;Keep Dremio Software at version 24.0 or later for best compatibility with Dremio Cloud. Older versions may have limited feature support through the federation connector.&lt;/p&gt;
&lt;h3&gt;Security&lt;/h3&gt;
&lt;ul&gt;
&lt;li&gt;Enable TLS for all connections between Dremio Cloud and Software&lt;/li&gt;
&lt;li&gt;Use a dedicated service account with minimal necessary permissions&lt;/li&gt;
&lt;li&gt;Enable user impersonation for proper access control propagation&lt;/li&gt;
&lt;li&gt;Consider network-level security (VPN, PrivateLink) for on-premises connections&lt;/li&gt;
&lt;/ul&gt;
&lt;h3&gt;Monitoring and Troubleshooting&lt;/h3&gt;
&lt;p&gt;Monitor the health and performance of your hybrid deployment:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Query profiles:&lt;/strong&gt; Use Dremio Cloud&apos;s query profiler to identify slow cross-deployment queries. Look for high data transfer volumes that suggest missing Reflections.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Metadata refresh timing:&lt;/strong&gt; If Dremio Cloud shows stale schema from Dremio Software, decrease the metadata refresh interval.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Connection pool management:&lt;/strong&gt; For high-concurrency workloads, monitor connection usage between Cloud and Software. Increase the maximum idle connections if you see connection timeout errors.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Latency benchmarks:&lt;/strong&gt; Establish baseline latency for cross-deployment queries. If latency degrades, check network connectivity and consider adding Reflections to cache frequently accessed data.&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;Track these metrics to ensure your hybrid architecture delivers consistent performance as usage grows.&lt;/p&gt;
&lt;h2&gt;Governance Across Deployments&lt;/h2&gt;
&lt;p&gt;Dremio Cloud&apos;s Fine-Grained Access Control (FGAC) applies governance to the federated view of data:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Column masking:&lt;/strong&gt; Mask sensitive on-premises fields (employee SSN, salary) from specific Cloud user roles&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Row-level filtering:&lt;/strong&gt; Regional Cloud users see only their region&apos;s data from on-premises sources&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Unified policies:&lt;/strong&gt; Same governance rules apply whether data comes from the Software instance, Cloud sources, or external databases&lt;/li&gt;
&lt;/ul&gt;
&lt;h2&gt;Connect BI Tools via Arrow Flight&lt;/h2&gt;
&lt;p&gt;BI tools connected to Dremio Cloud via Arrow Flight access both cloud and on-premises data through a single connection:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Tableau:&lt;/strong&gt; Dremio connector — one connection serves data from both deployments&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Power BI:&lt;/strong&gt; Dremio ODBC driver or native connector&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Python/Pandas:&lt;/strong&gt; &lt;code&gt;pyarrow.flight&lt;/code&gt; client for programmatic access&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;dbt:&lt;/strong&gt; &lt;code&gt;dbt-dremio&lt;/code&gt; adapter for transformation workflows&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;All queries benefit from Reflections, governance, and the semantic layer — regardless of where the source data resides.&lt;/p&gt;
&lt;h2&gt;VS Code Copilot Integration&lt;/h2&gt;
&lt;p&gt;Dremio&apos;s VS Code extension with Copilot integration enables developers to query federated data from their IDE. Ask Copilot &amp;quot;Compare cloud costs per department with on-premises headcount&amp;quot; and it generates SQL using your semantic layer that spans both deployments.&lt;/p&gt;
&lt;h2&gt;Reflections for Hybrid Optimization&lt;/h2&gt;
&lt;p&gt;Create Reflections on hybrid views to cache cross-deployment query results:&lt;/p&gt;
&lt;ol&gt;
&lt;li&gt;Build views that join Cloud and Software data&lt;/li&gt;
&lt;li&gt;Create Reflections on those views&lt;/li&gt;
&lt;li&gt;Set refresh intervals based on how frequently the underlying on-premises data changes&lt;/li&gt;
&lt;/ol&gt;
&lt;p&gt;After creation, dashboard queries that span both deployments are served from Dremio Cloud&apos;s Reflection cache — eliminating network latency for repeat queries.&lt;/p&gt;
&lt;h2&gt;Migration Planning: Software to Cloud&lt;/h2&gt;
&lt;p&gt;Use the Dremio-to-Dremio connector as a migration bridge:&lt;/p&gt;
&lt;ol&gt;
&lt;li&gt;&lt;strong&gt;Phase 1 — Federation:&lt;/strong&gt; Connect Dremio Software to Dremio Cloud. All existing Software views remain accessible from Cloud.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Phase 2 — Parallel Development:&lt;/strong&gt; Build new views and Reflections in Dremio Cloud while continuing to maintain Software views.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Phase 3 — Source Migration:&lt;/strong&gt; Gradually move individual data sources (PostgreSQL, Oracle, S3) from Software connections to Cloud connections. Update views to reference Cloud-native sources.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Phase 4 — Decommission:&lt;/strong&gt; Once all sources are connected to Cloud, remove the Dremio Software connection.&lt;/li&gt;
&lt;/ol&gt;
&lt;p&gt;During the migration, users experience no disruption — they continue querying through Dremio Cloud while the underlying sources are being transitioned.&lt;/p&gt;
&lt;h2&gt;Common Deployment Architectures&lt;/h2&gt;
&lt;h3&gt;Hub-and-Spoke Model&lt;/h3&gt;
&lt;p&gt;Dremio Cloud serves as the central hub, with multiple Dremio Software instances as spokes. Each spoke manages a specific data center or business unit:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Spoke A:&lt;/strong&gt; Finance data center (Oracle, SQL Server, DB2)&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Spoke B:&lt;/strong&gt; Manufacturing data center (SAP HANA, PostgreSQL)&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Spoke C:&lt;/strong&gt; Research data center (S3, MongoDB)&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;Dremio Cloud federates across all spokes, providing a single analytics interface for the entire organization.&lt;/p&gt;
&lt;h3&gt;Staged Migration Model&lt;/h3&gt;
&lt;p&gt;For organizations migrating to the cloud in waves:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Wave 1:&lt;/strong&gt; Non-sensitive workloads migrate to Cloud with direct source connections&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Wave 2:&lt;/strong&gt; Sensitive workloads use Software as a proxy (governance-compliant data access)&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Wave 3:&lt;/strong&gt; Remaining workloads migrate as regulatory and security requirements are met&lt;/li&gt;
&lt;/ul&gt;
&lt;h3&gt;Disaster Recovery Model&lt;/h3&gt;
&lt;p&gt;Dremio Software serves as a fallback if Cloud connectivity is temporarily unavailable. On-premises critical workloads run against Software; Cloud handles all other analytics. This architecture provides business continuity for mission-critical dashboards and reports.&lt;/p&gt;
&lt;h2&gt;Performance Best Practices&lt;/h2&gt;
&lt;p&gt;Maximize hybrid performance with these strategies:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Pre-aggregate on Software side:&lt;/strong&gt; Build views in Dremio Software that SUM, COUNT, and AVG at the granularity Cloud queries need. Transfer megabytes, not gigabytes.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Use Reflections aggressively:&lt;/strong&gt; Create Reflections on every cross-deployment view. Network latency disappears once results are cached.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Schedule Reflection refreshes strategically:&lt;/strong&gt; Refresh during off-peak hours when network bandwidth is available.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Monitor data transfer volumes:&lt;/strong&gt; Use Dremio Cloud&apos;s query profiler to identify queries that transfer large volumes from Software. Convert these to Reflections first.&lt;/li&gt;
&lt;/ul&gt;
&lt;h2&gt;Get Started&lt;/h2&gt;
&lt;p&gt;Organizations can seamlessly federate across Dremio deployments, enable AI analytics on combined on-premises and cloud data, and migrate incrementally to the cloud — all while maintaining unified governance. The Dremio-to-Dremio connector is the bridge that makes hybrid lakehouse analytics practical.&lt;/p&gt;
&lt;p&gt;Whether you&apos;re running a single Dremio Software instance in one data center or managing multiple Software installations across global facilities, Dremio Cloud provides a unified analytical interface. Combine the raw data processing power of on-premises Dremio Software with the AI capabilities, Reflections, and managed infrastructure of Dremio Cloud. The result is a truly hybrid analytics platform that grows with your cloud migration at whatever pace your organization requires. No rip-and-replace, no big-bang migration — just a gradual, governed transition that protects your existing investments.&lt;/p&gt;
&lt;p&gt;&lt;a href=&quot;https://www.dremio.com/get-started?utm_source=ev_buffer&amp;amp;utm_medium=influencer&amp;amp;utm_campaign=pag&amp;amp;utm_term=connector-dremio-to-dremio-cloud&amp;amp;utm_content=alexmerced&quot;&gt;Try Dremio Cloud free for 30 days&lt;/a&gt; and connect your existing Dremio Software instances.&lt;/p&gt;
</content:encoded><author>Alex Merced</author></item><item><title>Dremio&apos;s Built-in Open Catalog: Your Zero-Configuration Apache Iceberg Lakehouse</title><link>https://iceberglakehouse.com/posts/2026-03-connector-dremio-open-catalog/</link><guid isPermaLink="true">https://iceberglakehouse.com/posts/2026-03-connector-dremio-open-catalog/</guid><description>
Every Dremio Cloud account starts with a built-in Open Catalog — a fully managed Apache Iceberg catalog with integrated storage. When you create a Dr...</description><pubDate>Mon, 02 Mar 2026 04:00:00 GMT</pubDate><content:encoded>&lt;p&gt;Every Dremio Cloud account starts with a built-in Open Catalog — a fully managed Apache Iceberg catalog with integrated storage. When you create a Dremio Cloud project, you immediately have a catalog where you can create namespaces (folders), tables, and views without connecting any external sources, configuring storage, or setting up credentials.&lt;/p&gt;
&lt;p&gt;This isn&apos;t a bare-bones starting point. The built-in Open Catalog is a production-grade Iceberg catalog with automated performance management, Autonomous Reflections, time travel, branching, and full DML support. It&apos;s the fastest path from &amp;quot;sign up&amp;quot; to &amp;quot;running analytics.&amp;quot;&lt;/p&gt;
&lt;p&gt;Organizations typically spend days or weeks setting up external catalogs — provisioning S3 buckets, configuring IAM roles, debugging credential chains, and testing connectivity. With the built-in Open Catalog, you skip all of that. Your first &lt;code&gt;CREATE TABLE&lt;/code&gt; runs minutes after account creation.&lt;/p&gt;
&lt;p&gt;The Open Catalog is particularly powerful for teams adopting a lakehouse architecture for the first time. Instead of evaluating AWS Glue, Unity Catalog, and Snowflake Open Catalog (each with different setup complexity, vendor dependencies, and pricing models), start with the built-in catalog. You can always connect external catalogs later and federate across them.&lt;/p&gt;
&lt;h3&gt;Cross-Catalog Federation&lt;/h3&gt;
&lt;p&gt;The Open Catalog works alongside external catalogs. A common architecture:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Open Catalog:&lt;/strong&gt; Dremio-created analytical tables and views (gold layer, semantic layer)&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;AWS Glue:&lt;/strong&gt; Existing Iceberg tables managed by Spark/EMR pipelines&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;PostgreSQL:&lt;/strong&gt; Operational application data&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;Dremio federates across all three in a single SQL query. Views in the Open Catalog can reference tables from any connected source, creating a unified analytical layer that spans your entire data estate.&lt;/p&gt;
&lt;h3&gt;Branching and Tagging&lt;/h3&gt;
&lt;p&gt;The Open Catalog supports Iceberg&apos;s branching and tagging capabilities:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Branches:&lt;/strong&gt; Create isolated copies of table metadata for development and testing. Changes on a branch don&apos;t affect the main table until merged.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Tags:&lt;/strong&gt; Create named snapshots for milestone tracking (e.g., &lt;code&gt;quarterly-report-2024-Q2&lt;/code&gt;).&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;These features enable data engineering workflows where teams can test transformations on branches before promoting changes to production tables.&lt;/p&gt;
&lt;h2&gt;Why Start with the Built-in Open Catalog&lt;/h2&gt;
&lt;h3&gt;Zero Configuration&lt;/h3&gt;
&lt;p&gt;External catalogs (Glue, Unity, Snowflake Open Catalog) require AWS IAM roles, network configuration, credential management, and catalog-specific setup. The built-in Open Catalog requires nothing — it&apos;s already configured when your project is created. Create a folder, write SQL, and start working.&lt;/p&gt;
&lt;h3&gt;Automated Performance Management&lt;/h3&gt;
&lt;p&gt;Dremio automatically manages the performance of Iceberg tables in the built-in catalog:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Auto-compaction:&lt;/strong&gt; Small files are automatically merged into optimally sized files (typically 256MB). This prevents the &amp;quot;small file problem&amp;quot; that degrades query performance over time.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Manifest rewriting:&lt;/strong&gt; Table manifests are automatically optimized for faster metadata reads.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Data clustering:&lt;/strong&gt; Dremio sorts data based on query patterns to improve predicate pushdown efficiency.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Vacuuming:&lt;/strong&gt; Expired snapshots and orphaned data files are automatically cleaned up.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Results caching:&lt;/strong&gt; Query results are cached and served for identical subsequent queries.&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;None of these require manual &lt;code&gt;OPTIMIZE&lt;/code&gt; commands or scheduled maintenance jobs. Dremio handles it all in the background.&lt;/p&gt;
&lt;h3&gt;Autonomous Reflections&lt;/h3&gt;
&lt;p&gt;For tables in the built-in catalog, Dremio can automatically create and manage Reflections based on observed query patterns. If a specific view is queried frequently with certain filters and aggregations, Dremio creates a Reflection to accelerate those patterns without any manual configuration. This automated acceleration means your most common queries get faster over time.&lt;/p&gt;
&lt;h3&gt;Time Travel&lt;/h3&gt;
&lt;p&gt;Query any table as it existed at any point in the past:&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-sql&quot;&gt;-- Query table as it was 7 days ago
SELECT * FROM catalog_folder.my_table AT TIMESTAMP &apos;2024-06-01 00:00:00&apos;;

-- Query a specific snapshot
SELECT * FROM catalog_folder.my_table AT SNAPSHOT &apos;1234567890123456789&apos;;
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Time travel is valuable for auditing (&amp;quot;What did customer balances look like at quarter end?&amp;quot;), debugging (&amp;quot;What changed in the last 24 hours?&amp;quot;), and compliance (&amp;quot;Show me the data as it was on the regulatory reporting date&amp;quot;).&lt;/p&gt;
&lt;h3&gt;Full DML Support&lt;/h3&gt;
&lt;p&gt;The built-in catalog supports all standard DML operations:&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-sql&quot;&gt;-- INSERT
INSERT INTO analytics.bronze.events
SELECT event_type, user_id, event_timestamp
FROM &amp;quot;s3-datalake&amp;quot;.events.raw_events
WHERE event_date = CURRENT_DATE - INTERVAL &apos;1&apos; DAY;

-- UPDATE
UPDATE analytics.silver.customers
SET segment = &apos;Enterprise&apos;
WHERE total_spend &amp;gt; 100000;

-- DELETE
DELETE FROM analytics.bronze.events
WHERE event_timestamp &amp;lt; CURRENT_DATE - INTERVAL &apos;365&apos; DAY;

-- MERGE (upsert)
MERGE INTO analytics.silver.customers AS target
USING (
  SELECT customer_id, SUM(amount) AS total_spend
  FROM analytics.bronze.orders
  GROUP BY customer_id
) AS source
ON target.customer_id = source.customer_id
WHEN MATCHED THEN UPDATE SET total_spend = source.total_spend
WHEN NOT MATCHED THEN INSERT (customer_id, total_spend) VALUES (source.customer_id, source.total_spend);
&lt;/code&gt;&lt;/pre&gt;
&lt;h2&gt;Getting Started: Create Your First Tables&lt;/h2&gt;
&lt;p&gt;When you query items in the built-in catalog, you don&apos;t include a source name prefix — just the folder path and table/view name:&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-sql&quot;&gt;-- Create namespace structure
CREATE FOLDER IF NOT EXISTS analytics;
CREATE FOLDER IF NOT EXISTS analytics.bronze;
CREATE FOLDER IF NOT EXISTS analytics.silver;
CREATE FOLDER IF NOT EXISTS analytics.gold;

-- Create a table from an external source
CREATE TABLE analytics.bronze.raw_orders AS
SELECT order_id, customer_id, product_id, quantity, price, order_date
FROM &amp;quot;postgres-orders&amp;quot;.public.orders
WHERE order_date &amp;gt;= &apos;2024-01-01&apos;;

-- Create a transformed table
CREATE TABLE analytics.silver.enriched_orders AS
SELECT
  o.order_id,
  o.customer_id,
  c.customer_name,
  c.region,
  o.product_id,
  p.product_name,
  p.category,
  o.quantity,
  o.price,
  o.quantity * o.price AS total_amount,
  o.order_date
FROM analytics.bronze.raw_orders o
JOIN &amp;quot;postgres-orders&amp;quot;.public.customers c ON o.customer_id = c.customer_id
JOIN &amp;quot;postgres-orders&amp;quot;.public.products p ON o.product_id = p.product_id;

-- Create an analytics view
CREATE VIEW analytics.gold.revenue_summary AS
SELECT
  region,
  category,
  DATE_TRUNC(&apos;month&apos;, order_date) AS month,
  SUM(total_amount) AS revenue,
  COUNT(*) AS orders,
  COUNT(DISTINCT customer_id) AS unique_customers,
  ROUND(SUM(total_amount) / COUNT(DISTINCT customer_id), 2) AS revenue_per_customer
FROM analytics.silver.enriched_orders
GROUP BY region, category, DATE_TRUNC(&apos;month&apos;, order_date);
&lt;/code&gt;&lt;/pre&gt;
&lt;h2&gt;Build a Semantic Layer&lt;/h2&gt;
&lt;pre&gt;&lt;code class=&quot;language-sql&quot;&gt;CREATE VIEW analytics.gold.product_performance AS
SELECT
  category,
  product_name,
  SUM(total_amount) AS revenue,
  COUNT(*) AS orders,
  CASE
    WHEN SUM(total_amount) &amp;gt; 100000 THEN &apos;Top Performer&apos;
    WHEN SUM(total_amount) &amp;gt; 10000 THEN &apos;Solid&apos;
    ELSE &apos;Emerging&apos;
  END AS performance_tier
FROM analytics.silver.enriched_orders
GROUP BY category, product_name;
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;In the &lt;strong&gt;Catalog&lt;/strong&gt;, click &lt;strong&gt;Edit&lt;/strong&gt; (pencil icon) → &lt;strong&gt;Generate Wiki&lt;/strong&gt; and &lt;strong&gt;Generate Tags&lt;/strong&gt;. These descriptions power AI features.&lt;/p&gt;
&lt;h2&gt;AI-Powered Analytics&lt;/h2&gt;
&lt;h3&gt;Dremio AI Agent&lt;/h3&gt;
&lt;p&gt;The AI Agent reads your semantic layer to answer questions in plain English: &amp;quot;What&apos;s our top performing product category this quarter?&amp;quot; or &amp;quot;Show me revenue per customer by region.&amp;quot; The wiki descriptions you create tell the Agent what &amp;quot;top performing&amp;quot; and &amp;quot;revenue per customer&amp;quot; mean, generating accurate SQL automatically.&lt;/p&gt;
&lt;h3&gt;Dremio MCP Server&lt;/h3&gt;
&lt;p&gt;The &lt;a href=&quot;https://github.com/dremio/dremio-mcp&quot;&gt;Dremio MCP Server&lt;/a&gt; connects external AI tools to your catalog data:&lt;/p&gt;
&lt;ol&gt;
&lt;li&gt;Create a Native OAuth app in Dremio Cloud&lt;/li&gt;
&lt;li&gt;Configure redirect URLs for your AI client (Claude, ChatGPT)&lt;/li&gt;
&lt;li&gt;Connect via &lt;code&gt;mcp.dremio.cloud/mcp/{project_id}&lt;/code&gt;&lt;/li&gt;
&lt;/ol&gt;
&lt;p&gt;A VP of Product asks Claude &amp;quot;Compare our product category performance and identify emerging categories with high growth potential&amp;quot; and gets governed, accurate analysis.&lt;/p&gt;
&lt;h3&gt;AI SQL Functions&lt;/h3&gt;
&lt;pre&gt;&lt;code class=&quot;language-sql&quot;&gt;-- Generate product analysis with AI
SELECT
  product_name,
  performance_tier,
  revenue,
  AI_GENERATE(
    &apos;Write a one-sentence growth strategy for this product&apos;,
    &apos;Product: &apos; || product_name || &apos;, Category: &apos; || category || &apos;, Revenue: $&apos; || CAST(revenue AS VARCHAR) || &apos;, Tier: &apos; || performance_tier
  ) AS growth_strategy
FROM analytics.gold.product_performance;

-- Classify products for portfolio management
SELECT
  product_name,
  AI_CLASSIFY(
    &apos;Based on revenue and order volume, classify investment priority&apos;,
    &apos;Revenue: $&apos; || CAST(revenue AS VARCHAR) || &apos;, Orders: &apos; || CAST(orders AS VARCHAR),
    ARRAY[&apos;Strategic Investment&apos;, &apos;Maintain&apos;, &apos;Optimize&apos;, &apos;Sunset&apos;]
  ) AS investment_priority
FROM analytics.gold.product_performance;
&lt;/code&gt;&lt;/pre&gt;
&lt;h2&gt;Built-in vs. External Catalogs&lt;/h2&gt;
&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Feature&lt;/th&gt;
&lt;th&gt;Built-in Open Catalog&lt;/th&gt;
&lt;th&gt;External Catalogs (Glue, Unity, etc.)&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Setup&lt;/td&gt;
&lt;td&gt;Zero configuration&lt;/td&gt;
&lt;td&gt;Requires IAM, networking, credentials&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Auto-compaction&lt;/td&gt;
&lt;td&gt;✅ Automatic&lt;/td&gt;
&lt;td&gt;✅ For Iceberg tables&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Autonomous Reflections&lt;/td&gt;
&lt;td&gt;✅ Automatic&lt;/td&gt;
&lt;td&gt;Manual Reflections only&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Time travel&lt;/td&gt;
&lt;td&gt;✅ Full support&lt;/td&gt;
&lt;td&gt;✅ For Iceberg tables&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Write support&lt;/td&gt;
&lt;td&gt;✅ Full DML&lt;/td&gt;
&lt;td&gt;Varies by catalog&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Credential management&lt;/td&gt;
&lt;td&gt;None needed&lt;/td&gt;
&lt;td&gt;IAM roles or keys required&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Storage costs&lt;/td&gt;
&lt;td&gt;Included in Dremio&lt;/td&gt;
&lt;td&gt;Separate cloud storage costs&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;
&lt;p&gt;The built-in catalog is ideal for getting started, prototyping, and production workloads. External catalogs are valuable when your organization already manages data in Glue, Unity, or Snowflake Open Catalog and wants to query that data through Dremio.&lt;/p&gt;
&lt;h2&gt;Governance in the Open Catalog&lt;/h2&gt;
&lt;p&gt;Dremio&apos;s Fine-Grained Access Control (FGAC) provides enterprise-grade governance on all Open Catalog data:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Column masking:&lt;/strong&gt; Mask sensitive columns (customer PII, financial details) from specific roles. A data analyst sees &lt;code&gt;customer_name&lt;/code&gt; but not &lt;code&gt;social_security_number&lt;/code&gt;.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Row-level filtering:&lt;/strong&gt; Automatically restrict data visibility based on user roles. A regional manager querying &lt;code&gt;revenue_summary&lt;/code&gt; sees only their region.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Unified policies:&lt;/strong&gt; The same governance policies apply across Open Catalog tables, external catalogs, and database sources — one set of rules for all data.&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;These policies apply across SQL Runner, BI tools (Arrow Flight/ODBC), AI Agent queries, and MCP Server interactions.&lt;/p&gt;
&lt;h2&gt;Connect BI Tools via Arrow Flight&lt;/h2&gt;
&lt;p&gt;Dremio&apos;s Arrow Flight connector provides 10-100x faster data transfer than JDBC/ODBC. After building views in the Open Catalog:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Tableau:&lt;/strong&gt; Use the Dremio connector for direct Arrow Flight access&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Power BI:&lt;/strong&gt; Use Dremio&apos;s ODBC driver or native connector&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Python/Pandas:&lt;/strong&gt; Use &lt;code&gt;pyarrow.flight&lt;/code&gt; for high-speed programmatic data access&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Looker:&lt;/strong&gt; Connect via JDBC&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;dbt:&lt;/strong&gt; Use &lt;code&gt;dbt-dremio&lt;/code&gt; adapter for SQL-based transformations&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;All queries benefit from Autonomous Reflections, governance, and the semantic layer.&lt;/p&gt;
&lt;h2&gt;VS Code Copilot Integration&lt;/h2&gt;
&lt;p&gt;Dremio&apos;s VS Code extension with Copilot integration lets developers query Open Catalog data from their IDE. Ask Copilot &amp;quot;Show me this week&apos;s revenue by product category&amp;quot; and it generates SQL using your semantic layer — without switching to the Dremio console.&lt;/p&gt;
&lt;h2&gt;Data Lifecycle Management&lt;/h2&gt;
&lt;p&gt;The Open Catalog supports a complete data lifecycle:&lt;/p&gt;
&lt;ol&gt;
&lt;li&gt;&lt;strong&gt;Bronze layer:&lt;/strong&gt; Ingest raw data from external sources using &lt;code&gt;CREATE TABLE ... AS SELECT&lt;/code&gt;&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Silver layer:&lt;/strong&gt; Apply transformations, deduplication, and type casting with &lt;code&gt;CREATE TABLE ... AS SELECT&lt;/code&gt; or &lt;code&gt;MERGE&lt;/code&gt; for incremental updates&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Gold layer:&lt;/strong&gt; Create analytical views with business logic for the semantic layer&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Archival:&lt;/strong&gt; Use &lt;code&gt;DELETE&lt;/code&gt; with time-based conditions to remove old data; use time travel to access historical snapshots before deletion&lt;/li&gt;
&lt;/ol&gt;
&lt;p&gt;This medallion architecture runs entirely within Dremio — no external ETL tools, Spark clusters, or scheduled scripts needed.&lt;/p&gt;
&lt;h3&gt;Incremental Loading Patterns&lt;/h3&gt;
&lt;p&gt;For ongoing data ingestion, use &lt;code&gt;MERGE&lt;/code&gt; to incrementally update tables without full reloads:&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-sql&quot;&gt;-- Incremental merge: only update changed records
MERGE INTO analytics.silver.customers AS target
USING (
  SELECT customer_id, customer_name, email, segment, updated_at
  FROM &amp;quot;postgres-crm&amp;quot;.public.customers
  WHERE updated_at &amp;gt; (SELECT MAX(updated_at) FROM analytics.silver.customers)
) AS source
ON target.customer_id = source.customer_id
WHEN MATCHED THEN UPDATE SET
  customer_name = source.customer_name,
  email = source.email,
  segment = source.segment,
  updated_at = source.updated_at
WHEN NOT MATCHED THEN INSERT (customer_id, customer_name, email, segment, updated_at)
  VALUES (source.customer_id, source.customer_name, source.email, source.segment, source.updated_at);
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;This pattern transfers only changed records, minimizing network traffic and compute costs.&lt;/p&gt;
&lt;h3&gt;Time Travel Best Practices&lt;/h3&gt;
&lt;p&gt;Time travel is particularly valuable in the Open Catalog for:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;End-of-quarter reporting:&lt;/strong&gt; Query tables at exact quarter-end timestamps for regulatory submissions&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Debugging data issues:&lt;/strong&gt; Compare current data with a previous snapshot to identify when and what changed&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Audit trails:&lt;/strong&gt; Demonstrate data state at any point in time for compliance requirements&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Recovery:&lt;/strong&gt; If a bad &lt;code&gt;UPDATE&lt;/code&gt; or &lt;code&gt;DELETE&lt;/code&gt; corrupts data, query the pre-change snapshot and restore&lt;/li&gt;
&lt;/ul&gt;
&lt;pre&gt;&lt;code class=&quot;language-sql&quot;&gt;-- Compare current vs 24-hours-ago to find changed records
SELECT current_data.customer_id, current_data.segment AS new_segment, old_data.segment AS old_segment
FROM analytics.silver.customers current_data
JOIN analytics.silver.customers AT TIMESTAMP &apos;2024-06-14 00:00:00&apos; old_data
  ON current_data.customer_id = old_data.customer_id
WHERE current_data.segment &amp;lt;&amp;gt; old_data.segment;
&lt;/code&gt;&lt;/pre&gt;
&lt;h2&gt;Get Started&lt;/h2&gt;
&lt;p&gt;Every Dremio Cloud account includes the Open Catalog, ready to go. No setup, no configuration, no external dependencies. Create your first table in under a minute.&lt;/p&gt;
&lt;p&gt;The Open Catalog isn&apos;t just for prototyping — it&apos;s production-grade from day one. Organizations run terabyte-scale analytical workloads on the built-in catalog with automated performance management handling compaction, vacuuming, and Reflection optimization in the background. Start small with a few tables, then scale to hundreds of tables and dozens of users as your lakehouse grows. The same zero-configuration promise holds at scale.&lt;/p&gt;
&lt;p&gt;For teams new to the lakehouse concept, the Open Catalog is the lowest-friction entry point available. Data engineers familiar with SQL can build a complete medallion architecture (bronze → silver → gold) in a single afternoon, with AI capabilities and governance ready to activate immediately.&lt;/p&gt;
&lt;p&gt;&lt;a href=&quot;https://www.dremio.com/get-started?utm_source=ev_buffer&amp;amp;utm_medium=influencer&amp;amp;utm_campaign=pag&amp;amp;utm_term=connector-dremio-open-catalog-dremio-cloud&amp;amp;utm_content=alexmerced&quot;&gt;Sign up for Dremio Cloud free for 30 days&lt;/a&gt; and start building your lakehouse immediately.&lt;/p&gt;
</content:encoded><author>Alex Merced</author></item><item><title>Connect Any Iceberg REST Catalog to Dremio Cloud: Universal Lakehouse Access</title><link>https://iceberglakehouse.com/posts/2026-03-connector-iceberg-rest-catalog/</link><guid isPermaLink="true">https://iceberglakehouse.com/posts/2026-03-connector-iceberg-rest-catalog/</guid><description>
The Apache Iceberg REST Catalog specification defines a standard HTTP API for managing Iceberg table metadata. Any catalog implementation that confor...</description><pubDate>Mon, 02 Mar 2026 03:00:00 GMT</pubDate><content:encoded>&lt;p&gt;The Apache Iceberg REST Catalog specification defines a standard HTTP API for managing Iceberg table metadata. Any catalog implementation that conforms to this specification — Apache Polaris, Amazon S3 Tables, Confluent Tableflow, Tabular, Apache Gravitino, and custom-built services — can connect to Dremio Cloud through a single connector type.&lt;/p&gt;
&lt;p&gt;This is the most flexible catalog connector Dremio offers. Instead of needing a purpose-built connector for every catalog vendor, the Iceberg REST Catalog connector works with any compliant implementation. As new catalogs emerge — and they&apos;re emerging rapidly in the open lakehouse ecosystem — this connector ensures Dremio supports them from day one.&lt;/p&gt;
&lt;p&gt;The Iceberg REST specification is becoming the universal standard for lakehouse catalog interoperability. AWS launched Amazon S3 Tables (a fully managed Iceberg catalog with REST API) in late 2024, Confluent released Tableflow for streaming-to-Iceberg ingestion, and Apache Gravitino provides multi-catalog governance. All of these work with Dremio&apos;s REST Catalog connector without any Dremio-side code changes.&lt;/p&gt;
&lt;h3&gt;Credential Vending Advantage&lt;/h3&gt;
&lt;p&gt;Many REST catalogs support credential vending — the ability to issue temporary, scoped storage credentials to clients. When configured, Dremio receives short-lived tokens that grant access only to the specific data files needed for a query. This eliminates the need to store long-lived S3 access keys or Azure storage keys in Dremio&apos;s connection configuration, significantly reducing the security surface area. One REST catalog connection replaces what would otherwise require separate storage credentials for every S3 bucket, Azure container, or GCS bucket containing your Iceberg tables.&lt;/p&gt;
&lt;h2&gt;Why Iceberg REST Catalog Users Need Dremio&lt;/h2&gt;
&lt;h3&gt;Universal Compatibility&lt;/h3&gt;
&lt;p&gt;The Iceberg REST Catalog connector works with any catalog implementation that conforms to the Iceberg REST API spec. This includes:&lt;/p&gt;
&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Catalog&lt;/th&gt;
&lt;th&gt;Type&lt;/th&gt;
&lt;th&gt;Credential Vending&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Apache Polaris&lt;/td&gt;
&lt;td&gt;Open source&lt;/td&gt;
&lt;td&gt;Yes&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Amazon S3 Tables&lt;/td&gt;
&lt;td&gt;AWS managed&lt;/td&gt;
&lt;td&gt;Yes&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Confluent Tableflow&lt;/td&gt;
&lt;td&gt;Confluent managed&lt;/td&gt;
&lt;td&gt;Yes&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Tabular&lt;/td&gt;
&lt;td&gt;SaaS&lt;/td&gt;
&lt;td&gt;Yes&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Apache Gravitino&lt;/td&gt;
&lt;td&gt;Open source&lt;/td&gt;
&lt;td&gt;Varies&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Custom REST implementations&lt;/td&gt;
&lt;td&gt;Self-hosted&lt;/td&gt;
&lt;td&gt;Varies&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;
&lt;p&gt;You&apos;re not locked into specific catalog vendors. Deploy Apache Polaris today, consider S3 Tables tomorrow — the same Dremio connector works for both.&lt;/p&gt;
&lt;h3&gt;Read and Write Support&lt;/h3&gt;
&lt;p&gt;Dremio supports full DML (INSERT, UPDATE, DELETE, MERGE) on Iceberg tables managed by REST catalogs. You can create tables, run transformations, build data pipelines, and maintain your lakehouse entirely through Dremio&apos;s SQL engine. No need for separate Spark clusters or ETL jobs for routine operations.&lt;/p&gt;
&lt;h3&gt;Multi-Catalog Federation&lt;/h3&gt;
&lt;p&gt;Connect multiple REST catalogs alongside databases (PostgreSQL, MySQL, Oracle), object storage (S3, Azure), cloud warehouses (Snowflake, BigQuery), and other catalogs (Glue, Unity) — then query across all of them in a single SQL statement.&lt;/p&gt;
&lt;h3&gt;Automated Iceberg Maintenance&lt;/h3&gt;
&lt;p&gt;Dremio automatically compacts small files, rewrites manifests for faster metadata reads, and clusters data based on query patterns — even for tables managed by external REST catalogs.&lt;/p&gt;
&lt;h3&gt;Multiple Authentication Methods&lt;/h3&gt;
&lt;p&gt;The connector supports Bearer Token, OAuth 2.0 (client credentials flow), and custom authentication headers, accommodating the security requirements of different catalog implementations.&lt;/p&gt;
&lt;h3&gt;Flexible Storage Credential Management&lt;/h3&gt;
&lt;p&gt;Some REST catalogs vend temporary storage credentials (short-lived S3/Azure/GCS tokens) for reading and writing data files. Dremio supports credential vending where available. When a catalog doesn&apos;t vend credentials, you can configure storage access directly in Dremio.&lt;/p&gt;
&lt;h2&gt;Prerequisites&lt;/h2&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;REST Catalog endpoint URL&lt;/strong&gt; — the base URL of the catalog API (e.g., &lt;code&gt;https://my-polaris.example.com/api/catalog&lt;/code&gt;)&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Authentication credentials&lt;/strong&gt; — Bearer token, OAuth client ID/secret, or custom headers depending on the catalog&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Storage access&lt;/strong&gt; — either through credential vending (catalog provides temporary tokens) or direct storage credentials (S3, Azure, GCS)&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Dremio Cloud account&lt;/strong&gt; — &lt;a href=&quot;https://www.dremio.com/get-started?utm_source=ev_buffer&amp;amp;utm_medium=influencer&amp;amp;utm_campaign=pag&amp;amp;utm_term=connector-iceberg-rest-catalog-dremio-cloud&amp;amp;utm_content=alexmerced&quot;&gt;sign up free for 30 days&lt;/a&gt; with $400 in compute credits&lt;/li&gt;
&lt;/ul&gt;
&lt;h2&gt;Step-by-Step: Connect to Dremio Cloud&lt;/h2&gt;
&lt;h3&gt;1. Add the Source&lt;/h3&gt;
&lt;p&gt;Click &lt;strong&gt;&amp;quot;+&amp;quot;&lt;/strong&gt; and select &lt;strong&gt;Iceberg REST Catalog&lt;/strong&gt;.&lt;/p&gt;
&lt;h3&gt;2. Configure Connection Details&lt;/h3&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Name:&lt;/strong&gt; Descriptive identifier (e.g., &lt;code&gt;polaris-catalog&lt;/code&gt; or &lt;code&gt;s3-tables&lt;/code&gt;).&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Catalog Endpoint URL:&lt;/strong&gt; The base URL for the REST API.&lt;/li&gt;
&lt;/ul&gt;
&lt;h3&gt;3. Set Authentication&lt;/h3&gt;
&lt;p&gt;Choose from:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Bearer Token:&lt;/strong&gt; For token-based authentication (e.g., PAT tokens).&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;OAuth 2.0:&lt;/strong&gt; Client ID and client secret for OAuth client credentials flow.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;None:&lt;/strong&gt; For catalogs that use other authentication methods (configured via custom headers).&lt;/li&gt;
&lt;/ul&gt;
&lt;h3&gt;4. Configure Storage&lt;/h3&gt;
&lt;p&gt;If credential vending is supported, Dremio receives temporary credentials automatically. Otherwise, configure S3 (access key/secret or IAM role), Azure (shared key or service principal), or GCS (service account key) credentials.&lt;/p&gt;
&lt;h3&gt;5. Advanced Settings&lt;/h3&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Custom Headers:&lt;/strong&gt; Additional HTTP headers required by the catalog.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Query Parameters:&lt;/strong&gt; URL parameters appended to catalog API requests.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Catalog-specific properties:&lt;/strong&gt; Key-value pairs for vendor-specific configuration.&lt;/li&gt;
&lt;/ul&gt;
&lt;h3&gt;6. Set Reflection and Metadata Refresh, then Save&lt;/h3&gt;
&lt;h2&gt;Query and Write to REST Catalog Tables&lt;/h2&gt;
&lt;pre&gt;&lt;code class=&quot;language-sql&quot;&gt;-- Query Iceberg tables from a REST catalog
SELECT order_id, customer_id, order_total, order_date
FROM &amp;quot;rest-catalog&amp;quot;.ecommerce.orders
WHERE order_date &amp;gt;= &apos;2024-01-01&apos; AND order_total &amp;gt; 100
ORDER BY order_total DESC;

-- Write to the catalog
INSERT INTO &amp;quot;rest-catalog&amp;quot;.analytics.daily_summary
SELECT
  DATE_TRUNC(&apos;day&apos;, order_date) AS day,
  COUNT(*) AS order_count,
  SUM(order_total) AS revenue,
  AVG(order_total) AS avg_order_value
FROM &amp;quot;rest-catalog&amp;quot;.ecommerce.orders
WHERE order_date = CURRENT_DATE - INTERVAL &apos;1&apos; DAY
GROUP BY 1;

-- MERGE for upserts
MERGE INTO &amp;quot;rest-catalog&amp;quot;.analytics.customer_metrics AS target
USING (
  SELECT customer_id, COUNT(*) AS orders, SUM(order_total) AS total_spent
  FROM &amp;quot;rest-catalog&amp;quot;.ecommerce.orders
  WHERE order_date &amp;gt;= CURRENT_DATE - INTERVAL &apos;30&apos; DAY
  GROUP BY customer_id
) AS source
ON target.customer_id = source.customer_id
WHEN MATCHED THEN UPDATE SET orders = source.orders, total_spent = source.total_spent
WHEN NOT MATCHED THEN INSERT (customer_id, orders, total_spent) VALUES (source.customer_id, source.orders, source.total_spent);
&lt;/code&gt;&lt;/pre&gt;
&lt;h2&gt;Federate with Other Sources&lt;/h2&gt;
&lt;pre&gt;&lt;code class=&quot;language-sql&quot;&gt;-- Join REST catalog data with PostgreSQL and S3
SELECT
  rc.order_id,
  rc.order_total,
  pg.customer_name,
  pg.region,
  s3.support_tickets
FROM &amp;quot;rest-catalog&amp;quot;.ecommerce.orders rc
JOIN &amp;quot;postgres-crm&amp;quot;.public.customers pg ON rc.customer_id = pg.customer_id
LEFT JOIN &amp;quot;s3-support&amp;quot;.tickets.customer_counts s3 ON rc.customer_id = s3.customer_id
WHERE rc.order_total &amp;gt; 500
ORDER BY rc.order_total DESC;
&lt;/code&gt;&lt;/pre&gt;
&lt;h2&gt;Build a Semantic Layer&lt;/h2&gt;
&lt;pre&gt;&lt;code class=&quot;language-sql&quot;&gt;CREATE VIEW analytics.gold.customer_value AS
SELECT
  rc.customer_id,
  pg.customer_name,
  pg.region,
  SUM(rc.order_total) AS lifetime_value,
  COUNT(*) AS total_orders,
  ROUND(AVG(rc.order_total), 2) AS avg_order_value,
  CASE
    WHEN SUM(rc.order_total) &amp;gt; 50000 THEN &apos;Platinum&apos;
    WHEN SUM(rc.order_total) &amp;gt; 10000 THEN &apos;Gold&apos;
    WHEN SUM(rc.order_total) &amp;gt; 1000 THEN &apos;Silver&apos;
    ELSE &apos;Bronze&apos;
  END AS value_tier
FROM &amp;quot;rest-catalog&amp;quot;.ecommerce.orders rc
JOIN &amp;quot;postgres-crm&amp;quot;.public.customers pg ON rc.customer_id = pg.customer_id
GROUP BY rc.customer_id, pg.customer_name, pg.region;
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;In the &lt;strong&gt;Catalog&lt;/strong&gt;, click &lt;strong&gt;Edit&lt;/strong&gt; → &lt;strong&gt;Generate Wiki&lt;/strong&gt; and &lt;strong&gt;Generate Tags&lt;/strong&gt;.&lt;/p&gt;
&lt;h2&gt;AI-Powered Analytics&lt;/h2&gt;
&lt;h3&gt;Dremio AI Agent&lt;/h3&gt;
&lt;p&gt;Ask &amp;quot;Who are our Platinum customers?&amp;quot; and the AI Agent generates SQL from your semantic layer. The wiki descriptions you attached explain what &amp;quot;Platinum&amp;quot; means (lifetime value &amp;gt; $50,000), so the Agent produces accurate results.&lt;/p&gt;
&lt;h3&gt;Dremio MCP Server&lt;/h3&gt;
&lt;p&gt;Connect &lt;a href=&quot;https://github.com/dremio/dremio-mcp&quot;&gt;Claude or ChatGPT&lt;/a&gt; to your catalog data:&lt;/p&gt;
&lt;ol&gt;
&lt;li&gt;Create a Native OAuth app in Dremio Cloud&lt;/li&gt;
&lt;li&gt;Configure redirect URLs for your AI client&lt;/li&gt;
&lt;li&gt;Connect via &lt;code&gt;mcp.dremio.cloud/mcp/{project_id}&lt;/code&gt;&lt;/li&gt;
&lt;/ol&gt;
&lt;p&gt;A sales director asks Claude &amp;quot;Show me our top 20 Gold and Platinum customers by lifetime value&amp;quot; and gets governed results from your Iceberg catalog.&lt;/p&gt;
&lt;h3&gt;AI SQL Functions&lt;/h3&gt;
&lt;pre&gt;&lt;code class=&quot;language-sql&quot;&gt;-- Generate personalized engagement plans
SELECT
  customer_name,
  value_tier,
  lifetime_value,
  AI_GENERATE(
    &apos;Write a one-sentence personalized engagement recommendation&apos;,
    &apos;Customer: &apos; || customer_name || &apos;, Tier: &apos; || value_tier || &apos;, LTV: $&apos; || CAST(lifetime_value AS VARCHAR) || &apos;, Orders: &apos; || CAST(total_orders AS VARCHAR &amp;amp;&amp;amp; &apos;, Region: &apos; || region)
  ) AS engagement_plan
FROM analytics.gold.customer_value
WHERE value_tier IN (&apos;Platinum&apos;, &apos;Gold&apos;);

-- Classify churn risk
SELECT
  customer_name,
  AI_CLASSIFY(
    &apos;Based on order patterns, classify churn risk&apos;,
    &apos;Orders: &apos; || CAST(total_orders AS VARCHAR) || &apos;, Avg Order: $&apos; || CAST(avg_order_value AS VARCHAR) || &apos;, LTV: $&apos; || CAST(lifetime_value AS VARCHAR),
    ARRAY[&apos;Low Risk&apos;, &apos;Moderate Risk&apos;, &apos;High Risk&apos;]
  ) AS churn_risk
FROM analytics.gold.customer_value;
&lt;/code&gt;&lt;/pre&gt;
&lt;h2&gt;Reflections for Performance&lt;/h2&gt;
&lt;p&gt;Create Reflections on views to cache results and serve BI dashboards with sub-second response times.&lt;/p&gt;
&lt;h2&gt;When to Use REST Catalog vs. Other Iceberg Catalogs&lt;/h2&gt;
&lt;p&gt;&lt;strong&gt;Use REST Catalog when:&lt;/strong&gt; Your organization uses Tabular, Apache Polaris, Gravitino, or another REST-compliant catalog server; you need a vendor-neutral catalog interface; you want portability across different compute engines (Dremio, Spark, Trino, Flink).&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Use AWS Glue when:&lt;/strong&gt; You&apos;re primarily in the AWS ecosystem and want tight integration with EMR, Athena, and AWS-native tools.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Use Dremio&apos;s Open Catalog when:&lt;/strong&gt; You want zero-configuration automatic table maintenance, Autonomous Reflections, and no external catalog setup.&lt;/p&gt;
&lt;p&gt;You can use multiple catalogs simultaneously — for example, REST Catalog for cross-engine shared tables and Dremio&apos;s Open Catalog for Dremio-specific analytical workloads.&lt;/p&gt;
&lt;h2&gt;Governance on REST Catalog Data&lt;/h2&gt;
&lt;p&gt;Dremio&apos;s Fine-Grained Access Control (FGAC) adds governance that REST catalogs don&apos;t provide:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Column masking:&lt;/strong&gt; Mask sensitive columns from specific roles. A business analyst sees aggregated metrics but not individual customer data.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Row-level filtering:&lt;/strong&gt; Restrict data by the querying user&apos;s role. Regional users see only their data.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Unified policies:&lt;/strong&gt; Same governance applies across REST Catalog, database sources, and other catalogs.&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;These policies apply across SQL Runner, BI tools (Arrow Flight/ODBC), AI Agent, and MCP Server.&lt;/p&gt;
&lt;h2&gt;Connect BI Tools via Arrow Flight&lt;/h2&gt;
&lt;p&gt;Arrow Flight provides 10-100x faster data transfer than JDBC/ODBC for BI tools:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Tableau:&lt;/strong&gt; Dremio connector for direct Arrow Flight access&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Power BI:&lt;/strong&gt; Dremio ODBC driver or native connector&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Python/Pandas:&lt;/strong&gt; &lt;code&gt;pyarrow.flight&lt;/code&gt; for programmatic data access&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Looker:&lt;/strong&gt; Connect via JDBC&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;dbt:&lt;/strong&gt; &lt;code&gt;dbt-dremio&lt;/code&gt; adapter for transformation workflows&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;All queries benefit from Reflections, governance, and the semantic layer.&lt;/p&gt;
&lt;h2&gt;VS Code Copilot Integration&lt;/h2&gt;
&lt;p&gt;Dremio&apos;s VS Code extension with Copilot integration lets developers query REST Catalog data from their IDE. Ask Copilot &amp;quot;Show me transaction trends from the Iceberg catalog&amp;quot; and get SQL generated using your semantic layer.&lt;/p&gt;
&lt;h2&gt;REST Catalog Protocol Details&lt;/h2&gt;
&lt;p&gt;The Iceberg REST Catalog protocol is an HTTP-based interface defined by the Apache Iceberg project. Any catalog that implements this protocol works with Dremio&apos;s connector. This includes:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Apache Polaris (Incubating):&lt;/strong&gt; Open-source REST catalog by Snowflake&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Tabular:&lt;/strong&gt; Managed Iceberg catalog service (now part of Databricks)&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Gravitino:&lt;/strong&gt; Apache-incubating multi-catalog governance platform&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Amazon S3 Tables:&lt;/strong&gt; AWS-managed Iceberg tables with REST API access&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Custom implementations:&lt;/strong&gt; Any service implementing the Iceberg REST spec&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;Dremio handles authentication through OAuth2 bearer tokens or custom headers, making it compatible with most enterprise authentication systems.&lt;/p&gt;
&lt;h3&gt;REST Catalog Endpoints&lt;/h3&gt;
&lt;p&gt;The Iceberg REST specification defines standard endpoints for catalog operations:&lt;/p&gt;
&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Operation&lt;/th&gt;
&lt;th&gt;Endpoint&lt;/th&gt;
&lt;th&gt;Dremio Support&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;List namespaces&lt;/td&gt;
&lt;td&gt;&lt;code&gt;GET /v1/namespaces&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;✅&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;List tables&lt;/td&gt;
&lt;td&gt;&lt;code&gt;GET /v1/namespaces/{ns}/tables&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;✅&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Load table&lt;/td&gt;
&lt;td&gt;&lt;code&gt;GET /v1/namespaces/{ns}/tables/{table}&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;✅&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Create table&lt;/td&gt;
&lt;td&gt;&lt;code&gt;POST /v1/namespaces/{ns}/tables&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;✅&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Update table&lt;/td&gt;
&lt;td&gt;&lt;code&gt;POST /v1/namespaces/{ns}/tables/{table}&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;✅&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Drop table&lt;/td&gt;
&lt;td&gt;&lt;code&gt;DELETE /v1/namespaces/{ns}/tables/{table}&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;✅&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Get config&lt;/td&gt;
&lt;td&gt;&lt;code&gt;GET /v1/config&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;✅&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;
&lt;p&gt;Dremio uses these endpoints to discover tables, read metadata, perform DML operations, and manage table lifecycle — all through standard HTTP.&lt;/p&gt;
&lt;h3&gt;Multi-Catalog Architecture&lt;/h3&gt;
&lt;p&gt;Many organizations run multiple Iceberg catalogs for different purposes:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;REST Catalog A (Polaris):&lt;/strong&gt; Shared enterprise data, governed access for all teams&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;REST Catalog B (S3 Tables):&lt;/strong&gt; AWS-native data, auto-managed by AWS&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Dremio Open Catalog:&lt;/strong&gt; Dremio-specific analytical workloads with Autonomous Reflections&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;AWS Glue:&lt;/strong&gt; Legacy Iceberg tables managed by existing EMR pipelines&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;Dremio connects to all of them simultaneously and federates across them. Views in the semantic layer can join tables from different catalogs:&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-sql&quot;&gt;CREATE VIEW analytics.gold.unified_orders AS
SELECT o.order_id, o.order_total, c.customer_name, i.inventory_status
FROM &amp;quot;polaris-catalog&amp;quot;.ecommerce.orders o
JOIN &amp;quot;s3-tables&amp;quot;.customers.profiles c ON o.customer_id = c.customer_id
JOIN &amp;quot;glue-catalog&amp;quot;.warehouse.inventory i ON o.product_id = i.product_id;
&lt;/code&gt;&lt;/pre&gt;
&lt;h2&gt;Get Started&lt;/h2&gt;
&lt;p&gt;Iceberg REST Catalog users can query, write, federate, and AI-enrich their Iceberg tables through Dremio Cloud — with governance, Reflections, and AI capabilities that no compute engine provides natively. The REST Catalog connector is the most future-proof choice for organizations adopting Iceberg: as new catalog implementations emerge (and the Iceberg ecosystem is expanding rapidly), this single connector supports them all.&lt;/p&gt;
&lt;p&gt;Start by connecting your REST catalog to Dremio Cloud, building a semantic layer over your most important tables, and enabling the AI Agent for natural language querying. The same views and Reflections work regardless of which REST catalog implementation you use — Apache Polaris today, S3 Tables tomorrow, or a custom catalog next year.&lt;/p&gt;
&lt;p&gt;&lt;a href=&quot;https://www.dremio.com/get-started?utm_source=ev_buffer&amp;amp;utm_medium=influencer&amp;amp;utm_campaign=pag&amp;amp;utm_term=connector-iceberg-rest-catalog-dremio-cloud&amp;amp;utm_content=alexmerced&quot;&gt;Try Dremio Cloud free for 30 days&lt;/a&gt; and connect your REST catalog.&lt;/p&gt;
</content:encoded><author>Alex Merced</author></item><item><title>Connect Databricks Unity Catalog to Dremio Cloud: Query Delta Lake Tables with Federation and AI</title><link>https://iceberglakehouse.com/posts/2026-03-connector-unity-catalog/</link><guid isPermaLink="true">https://iceberglakehouse.com/posts/2026-03-connector-unity-catalog/</guid><description>
Databricks Unity Catalog is Databricks&apos; governance layer for data and AI assets. It manages Delta Lake tables, machine learning models, feature store...</description><pubDate>Mon, 02 Mar 2026 02:00:00 GMT</pubDate><content:encoded>&lt;p&gt;Databricks Unity Catalog is Databricks&apos; governance layer for data and AI assets. It manages Delta Lake tables, machine learning models, feature stores, and other data objects across Databricks workspaces. If your data engineering team uses Databricks for ETL and ML, your curated analytical datasets likely live in Unity Catalog as Delta Lake tables.&lt;/p&gt;
&lt;p&gt;With UniForm, Databricks generates Iceberg-compatible metadata for Delta Lake tables, making them readable by non-Databricks engines without data conversion. This is where Dremio Cloud enters the picture: connect to Unity Catalog through the UniForm Iceberg compatibility layer and query your Delta Lake tables alongside every other data source in your organization — with federation, governance, AI analytics, and performance acceleration that Databricks alone doesn&apos;t provide.&lt;/p&gt;
&lt;h2&gt;Why Unity Catalog Users Need Dremio&lt;/h2&gt;
&lt;h3&gt;Multi-Engine Analytics Beyond Databricks&lt;/h3&gt;
&lt;p&gt;Unity Catalog centralizes governance for Databricks. But your data consumers use tools beyond Databricks notebooks — Tableau, Power BI, custom Python applications, and business analysts who work in SQL. Dremio provides a high-performance SQL layer that serves all these tools via Arrow Flight (10-100x faster than JDBC/ODBC) or standard ODBC connections.&lt;/p&gt;
&lt;p&gt;Instead of provisioning Databricks SQL warehouses for BI workloads (which consume Databricks Units), route those queries through Dremio where Reflections cache results and Autonomous Reflections automatically optimize query performance.&lt;/p&gt;
&lt;h3&gt;Federate with Non-Databricks Sources&lt;/h3&gt;
&lt;p&gt;Your Delta Lake tables in Unity Catalog contain curated, processed analytics data. But your operational databases (PostgreSQL, SQL Server, Oracle) live outside Databricks. Your cloud warehouses (Snowflake, Redshift) hold other analytical datasets. Your raw files (S3, Azure Storage) contain event logs and unstructured data. Without a federation layer, combining these with Delta Lake data requires Databricks ingestion pipelines for each source.&lt;/p&gt;
&lt;p&gt;Dremio queries each source in place and joins them in a single SQL statement — no ingestion required.&lt;/p&gt;
&lt;h3&gt;Unified Governance Beyond Databricks&lt;/h3&gt;
&lt;p&gt;Unity Catalog governs data within Databricks. Dremio&apos;s Fine-Grained Access Control (FGAC) governs data across Unity Catalog, PostgreSQL, S3, BigQuery, and every other connected source. One set of column masking and row-level filtering policies, applied consistently everywhere.&lt;/p&gt;
&lt;h3&gt;AI Analytics on Delta Lake Data&lt;/h3&gt;
&lt;p&gt;Dremio&apos;s semantic layer, AI Agent, MCP Server, and AI SQL Functions add capabilities that Databricks&apos; Genie doesn&apos;t replicate — particularly for cross-source analytics and integration with external AI clients like Claude and ChatGPT.&lt;/p&gt;
&lt;h3&gt;Credential Vending&lt;/h3&gt;
&lt;p&gt;Unity Catalog supports credential vending across AWS, Azure, and GCS. This means Dremio doesn&apos;t need separate S3 or Azure Storage credentials to access the underlying data files — the catalog provides temporary, scoped credentials automatically.&lt;/p&gt;
&lt;h2&gt;Prerequisites&lt;/h2&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Databricks workspace URL&lt;/strong&gt; — your Databricks deployment URL (e.g., &lt;code&gt;https://mycompany.cloud.databricks.com&lt;/code&gt;)&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Personal Access Token (PAT)&lt;/strong&gt; or OAuth credentials for Databricks&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;UniForm enabled&lt;/strong&gt; on the Delta Lake tables you want to query (this generates Iceberg-compatible metadata)&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Storage configuration&lt;/strong&gt; — AWS, Azure, or GCS (credential vending handles this if configured)&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Dremio Cloud account&lt;/strong&gt; — &lt;a href=&quot;https://www.dremio.com/get-started?utm_source=ev_buffer&amp;amp;utm_medium=influencer&amp;amp;utm_campaign=pag&amp;amp;utm_term=connector-unity-catalog-dremio-cloud&amp;amp;utm_content=alexmerced&quot;&gt;sign up free for 30 days&lt;/a&gt; with $400 in compute credits&lt;/li&gt;
&lt;/ul&gt;
&lt;h3&gt;Enabling UniForm on Delta Lake Tables&lt;/h3&gt;
&lt;p&gt;To make Delta Lake tables readable from Dremio, enable UniForm in Databricks:&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-sql&quot;&gt;-- In Databricks, enable UniForm when creating a table
CREATE TABLE my_catalog.my_schema.my_table (
  id BIGINT,
  name STRING,
  value DOUBLE
) TBLPROPERTIES (
  &apos;delta.universalFormat.enabledFormats&apos; = &apos;iceberg&apos;
);

-- Or alter an existing table
ALTER TABLE my_catalog.my_schema.my_table SET TBLPROPERTIES (
  &apos;delta.universalFormat.enabledFormats&apos; = &apos;iceberg&apos;
);
&lt;/code&gt;&lt;/pre&gt;
&lt;h2&gt;Step-by-Step: Connect Unity Catalog to Dremio Cloud&lt;/h2&gt;
&lt;h3&gt;1. Add the Source&lt;/h3&gt;
&lt;p&gt;In the Dremio console, click &lt;strong&gt;&amp;quot;+&amp;quot;&lt;/strong&gt; and select &lt;strong&gt;Unity Catalog&lt;/strong&gt;.&lt;/p&gt;
&lt;h3&gt;2. Configure Connection Details&lt;/h3&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Name:&lt;/strong&gt; A descriptive identifier (e.g., &lt;code&gt;unity-catalog&lt;/code&gt; or &lt;code&gt;databricks-lakehouse&lt;/code&gt;).&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Workspace URL:&lt;/strong&gt; Your Databricks workspace URL.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Credentials:&lt;/strong&gt; Personal Access Token or OAuth credentials.&lt;/li&gt;
&lt;/ul&gt;
&lt;h3&gt;3. Select Catalogs and Schemas&lt;/h3&gt;
&lt;p&gt;Choose which Unity Catalog catalogs and schemas to expose in Dremio. Only tables with UniForm enabled will be readable.&lt;/p&gt;
&lt;h3&gt;4. Configure Advanced Settings&lt;/h3&gt;
&lt;p&gt;Set Reflection Refresh and Metadata schedules. More frequent metadata refreshes help Dremio discover new tables and schema changes faster.&lt;/p&gt;
&lt;h3&gt;5. Set Privileges and Save&lt;/h3&gt;
&lt;p&gt;Optionally restrict access, then click &lt;strong&gt;Save&lt;/strong&gt;.&lt;/p&gt;
&lt;h2&gt;Query Delta Lake Tables via UniForm&lt;/h2&gt;
&lt;p&gt;From Dremio&apos;s perspective, UniForm tables appear as standard Iceberg tables:&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-sql&quot;&gt;-- Query ML model predictions
SELECT
  customer_id,
  churn_probability,
  predicted_ltv,
  prediction_date
FROM &amp;quot;unity-catalog&amp;quot;.ml_models.customer_predictions
WHERE churn_probability &amp;gt; 0.7 AND prediction_date &amp;gt;= &apos;2024-06-01&apos;
ORDER BY churn_probability DESC;
&lt;/code&gt;&lt;/pre&gt;
&lt;h2&gt;Federate with Non-Databricks Sources&lt;/h2&gt;
&lt;p&gt;Join Delta Lake model outputs with operational data from other systems:&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-sql&quot;&gt;-- Combine ML predictions with CRM data and support logs
SELECT
  uc.customer_id,
  uc.churn_probability,
  uc.predicted_ltv,
  pg.customer_name,
  pg.contract_end_date,
  pg.account_manager,
  s3.last_login_date,
  s3.support_tickets_30d,
  CASE
    WHEN uc.churn_probability &amp;gt; 0.8 AND pg.contract_end_date &amp;lt; CURRENT_DATE + INTERVAL &apos;90&apos; DAY THEN &apos;Critical - Immediate Action&apos;
    WHEN uc.churn_probability &amp;gt; 0.7 THEN &apos;High Risk - Outreach Needed&apos;
    WHEN uc.churn_probability &amp;gt; 0.5 THEN &apos;Watch List&apos;
    ELSE &apos;Healthy&apos;
  END AS action_required
FROM &amp;quot;unity-catalog&amp;quot;.ml_models.customer_predictions uc
JOIN &amp;quot;postgres-crm&amp;quot;.public.customers pg ON uc.customer_id = pg.customer_id
LEFT JOIN &amp;quot;s3-logs&amp;quot;.activity.user_activity s3 ON uc.customer_id = s3.user_id
WHERE uc.churn_probability &amp;gt; 0.5
ORDER BY uc.churn_probability DESC;
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Three data systems (Databricks, PostgreSQL, S3), one query, and actionable churn intervention recommendations.&lt;/p&gt;
&lt;h2&gt;Build a Semantic Layer&lt;/h2&gt;
&lt;pre&gt;&lt;code class=&quot;language-sql&quot;&gt;CREATE VIEW analytics.gold.customer_risk_dashboard AS
SELECT
  uc.customer_id,
  pg.customer_name,
  pg.region,
  uc.churn_probability,
  uc.predicted_ltv,
  CASE
    WHEN uc.predicted_ltv &amp;gt; 100000 THEN &apos;Enterprise&apos;
    WHEN uc.predicted_ltv &amp;gt; 25000 THEN &apos;Mid-Market&apos;
    ELSE &apos;SMB&apos;
  END AS value_segment,
  CASE
    WHEN uc.churn_probability &amp;gt; 0.7 THEN &apos;High Risk&apos;
    WHEN uc.churn_probability &amp;gt; 0.4 THEN &apos;Moderate Risk&apos;
    ELSE &apos;Low Risk&apos;
  END AS risk_tier
FROM &amp;quot;unity-catalog&amp;quot;.ml_models.customer_predictions uc
JOIN &amp;quot;postgres-crm&amp;quot;.public.customers pg ON uc.customer_id = pg.customer_id;
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Navigate to the &lt;strong&gt;Catalog&lt;/strong&gt;, click &lt;strong&gt;Edit&lt;/strong&gt; (pencil icon), and &lt;strong&gt;Generate Wiki&lt;/strong&gt; and &lt;strong&gt;Generate Tags&lt;/strong&gt;. This creates business descriptions like &amp;quot;customer_risk_dashboard: Contains one row per customer combining ML churn predictions from Databricks with CRM account details&amp;quot; — context that powers AI features.&lt;/p&gt;
&lt;h2&gt;AI-Powered Analytics on Delta Lake Data&lt;/h2&gt;
&lt;h3&gt;Dremio AI Agent&lt;/h3&gt;
&lt;p&gt;The AI Agent lets business users ask questions in plain English: &amp;quot;Which enterprise customers are at high risk of churning?&amp;quot; or &amp;quot;Show me our top 10 customers by predicted lifetime value.&amp;quot; The Agent reads your wiki descriptions to understand what &amp;quot;enterprise,&amp;quot; &amp;quot;high risk,&amp;quot; and &amp;quot;lifetime value&amp;quot; mean in your data context, then generates accurate SQL.&lt;/p&gt;
&lt;p&gt;This is particularly powerful for Delta Lake data because model outputs (churn scores, predictions) often need business interpretation. The semantic layer bridges the gap between ML model outputs and business-friendly analytics.&lt;/p&gt;
&lt;h3&gt;Dremio MCP Server&lt;/h3&gt;
&lt;p&gt;The &lt;a href=&quot;https://github.com/dremio/dremio-mcp&quot;&gt;Dremio MCP Server&lt;/a&gt; connects Claude, ChatGPT, and other AI clients to your Dremio data. Setup:&lt;/p&gt;
&lt;ol&gt;
&lt;li&gt;Create a Native OAuth application in Dremio Cloud&lt;/li&gt;
&lt;li&gt;Configure redirect URLs (e.g., &lt;code&gt;https://claude.ai/api/mcp/auth_callback&lt;/code&gt; for Claude, &lt;code&gt;https://chatgpt.com/connector_platform_oauth_redirect&lt;/code&gt; for ChatGPT)&lt;/li&gt;
&lt;li&gt;Connect using &lt;code&gt;mcp.dremio.cloud/mcp/{project_id}&lt;/code&gt; (US) or &lt;code&gt;mcp.eu.dremio.cloud/mcp/{project_id}&lt;/code&gt; (EU)&lt;/li&gt;
&lt;/ol&gt;
&lt;p&gt;A customer success manager can ask Claude &amp;quot;Show me all high-risk enterprise customers with contracts ending in the next 90 days&amp;quot; and get accurate, governed results from your Unity Catalog ML predictions — without knowing SQL.&lt;/p&gt;
&lt;h3&gt;AI SQL Functions&lt;/h3&gt;
&lt;p&gt;Use AI directly in queries against Unity Catalog data:&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-sql&quot;&gt;-- Generate personalized retention messages for at-risk customers
SELECT
  customer_name,
  churn_probability,
  predicted_ltv,
  AI_GENERATE(
    &apos;Write a one-sentence personalized retention offer for this at-risk customer&apos;,
    &apos;Customer: &apos; || customer_name || &apos;, Segment: &apos; || value_segment || &apos;, Risk: &apos; || risk_tier || &apos;, LTV: $&apos; || CAST(predicted_ltv AS VARCHAR)
  ) AS retention_message
FROM analytics.gold.customer_risk_dashboard
WHERE risk_tier = &apos;High Risk&apos; AND value_segment = &apos;Enterprise&apos;;

-- Classify intervention urgency
SELECT
  customer_name,
  AI_CLASSIFY(
    &apos;Based on these risk factors, classify the intervention urgency&apos;,
    &apos;Churn probability: &apos; || CAST(churn_probability AS VARCHAR) || &apos;, LTV: $&apos; || CAST(predicted_ltv AS VARCHAR) || &apos;, Segment: &apos; || value_segment,
    ARRAY[&apos;Immediate&apos;, &apos;This Week&apos;, &apos;This Month&apos;, &apos;Monitor&apos;]
  ) AS urgency
FROM analytics.gold.customer_risk_dashboard
WHERE risk_tier IN (&apos;High Risk&apos;, &apos;Moderate Risk&apos;);
&lt;/code&gt;&lt;/pre&gt;
&lt;h2&gt;Important Notes&lt;/h2&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Read-only access.&lt;/strong&gt; Dremio connects to Unity Catalog tables in read-only mode. Write operations continue through Databricks.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;UniForm required.&lt;/strong&gt; Only Delta Lake tables with UniForm enabled appear as queryable Iceberg tables in Dremio.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Table format transparency.&lt;/strong&gt; From Dremio&apos;s perspective, UniForm tables look and behave like standard Iceberg tables.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Credential vending.&lt;/strong&gt; When configured, Dremio receives temporary credentials from Unity Catalog, simplifying storage access.&lt;/li&gt;
&lt;/ul&gt;
&lt;h2&gt;Reflections for Performance&lt;/h2&gt;
&lt;p&gt;Create Reflections on Unity Catalog views to cache results and serve dashboard queries without re-reading Delta Lake files:&lt;/p&gt;
&lt;ol&gt;
&lt;li&gt;Navigate to the view in the &lt;strong&gt;Catalog&lt;/strong&gt;&lt;/li&gt;
&lt;li&gt;Click the &lt;strong&gt;Reflections&lt;/strong&gt; tab&lt;/li&gt;
&lt;li&gt;Choose &lt;strong&gt;Raw Reflection&lt;/strong&gt; or &lt;strong&gt;Aggregation Reflection&lt;/strong&gt;&lt;/li&gt;
&lt;li&gt;Select columns and aggregations&lt;/li&gt;
&lt;li&gt;Set the &lt;strong&gt;Refresh Interval&lt;/strong&gt; — balance between data freshness and compute cost&lt;/li&gt;
&lt;li&gt;Click &lt;strong&gt;Save&lt;/strong&gt;&lt;/li&gt;
&lt;/ol&gt;
&lt;p&gt;BI tools connected to Dremio get sub-second response times from Reflections. This eliminates the need for Databricks SQL warehouses for read-heavy BI workloads.&lt;/p&gt;
&lt;h2&gt;Governance Across Unity Catalog and Other Sources&lt;/h2&gt;
&lt;p&gt;Unity Catalog governs data within Databricks. Dremio&apos;s Fine-Grained Access Control (FGAC) extends governance across Unity Catalog and every other connected source:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Column masking:&lt;/strong&gt; Mask churn probability or predicted LTV from specific roles. A sales rep sees risk tier but not raw probability scores.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Row-level filtering:&lt;/strong&gt; Regional account managers see only customers in their territory.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Unified policies:&lt;/strong&gt; Same rules apply across Unity Catalog, PostgreSQL, S3, and all other sources.&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;These policies apply across SQL Runner, BI tools (Arrow Flight/ODBC), AI Agent, and MCP Server.&lt;/p&gt;
&lt;h2&gt;Connect BI Tools via Arrow Flight&lt;/h2&gt;
&lt;p&gt;Arrow Flight provides 10-100x faster data transfer than JDBC/ODBC. For Delta Lake data:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Tableau:&lt;/strong&gt; Dremio connector — avoids Databricks SQL warehouse costs for BI&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Power BI:&lt;/strong&gt; Dremio ODBC or native connector&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Python/Pandas:&lt;/strong&gt; &lt;code&gt;pyarrow.flight&lt;/code&gt; for programmatic access to ML model outputs&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Looker:&lt;/strong&gt; Connect via JDBC&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;dbt:&lt;/strong&gt; &lt;code&gt;dbt-dremio&lt;/code&gt; for transformations&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;All queries benefit from Reflections, governance, and the semantic layer.&lt;/p&gt;
&lt;h2&gt;VS Code Copilot Integration&lt;/h2&gt;
&lt;p&gt;Dremio&apos;s VS Code extension with Copilot lets developers query Unity Catalog data from their IDE. Ask Copilot &amp;quot;Show me high-risk enterprise customers from the churn model&amp;quot; and get SQL from your semantic layer — without switching to Databricks notebooks or SQL warehouses.&lt;/p&gt;
&lt;h2&gt;When to Use Dremio vs. Databricks SQL&lt;/h2&gt;
&lt;p&gt;&lt;strong&gt;Use Dremio when:&lt;/strong&gt; You need cross-source federation (joining Delta Lake with PostgreSQL, S3, Snowflake), you want AI analytics on federated data, you need governance across multiple data sources, you want Reflection-based caching for BI tools.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Use Databricks SQL when:&lt;/strong&gt; You need write-heavy workloads on Delta Lake, you&apos;re running Databricks-native jobs (streaming, ML training), your queries use Databricks-specific SQL extensions.&lt;/p&gt;
&lt;p&gt;Both can coexist — Databricks for data engineering and ML, Dremio for federated analytics, AI, and BI serving.&lt;/p&gt;
&lt;h2&gt;Delta Lake Tables in Dremio&lt;/h2&gt;
&lt;p&gt;Dremio reads Delta Lake tables from Unity Catalog with full Delta protocol support:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Time travel:&lt;/strong&gt; Query tables at specific versions using Delta Lake&apos;s transaction log&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Schema evolution:&lt;/strong&gt; Dremio automatically detects schema changes made by Databricks jobs&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Partition pruning:&lt;/strong&gt; Dremio leverages Delta Lake&apos;s partition statistics to skip irrelevant data files&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Column statistics:&lt;/strong&gt; Delta Lake&apos;s min/max statistics enable efficient predicate pushdown&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;For write operations, use Dremio&apos;s Open Catalog with Iceberg tables for new analytical workloads. Unity Catalog remains the source of truth for Databricks-managed Delta Lake tables.&lt;/p&gt;
&lt;h2&gt;Databricks Cost Optimization&lt;/h2&gt;
&lt;p&gt;Databricks pricing is based on Databricks Units (DBUs) consumed by SQL warehouses, clusters, and jobs. Dremio helps optimize costs:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;BI serving:&lt;/strong&gt; Instead of running a Databricks SQL warehouse 24/7 for dashboards, create Reflections in Dremio. Dashboard queries hit Dremio, SQL warehouse auto-stops.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Ad-hoc exploration:&lt;/strong&gt; Analysts query Dremio&apos;s cached Reflections instead of waking Databricks clusters. Less start/stop overhead.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Cross-source queries:&lt;/strong&gt; Joining Delta Lake with PostgreSQL or S3 doesn&apos;t require moving all data into Databricks — Dremio federates in place.&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;For organizations spending $50K+/month on Databricks, routing read-heavy analytical workloads through Dremio can reduce DBU consumption by 30-50% on those workloads.&lt;/p&gt;
&lt;h2&gt;Get Started&lt;/h2&gt;
&lt;p&gt;Unity Catalog users can extend their Databricks investment with Dremio&apos;s federation, AI analytics, and performance acceleration — without moving data out of Delta Lake. Dremio and Databricks are complementary: Databricks handles data engineering, ML training, and streaming workloads on Delta Lake tables, while Dremio serves analytical queries, BI dashboards, and AI-powered natural language access across your entire data estate.&lt;/p&gt;
&lt;p&gt;Connect your Unity Catalog to Dremio Cloud, build Reflections on frequently queried tables, and enable the AI Agent for business users who need answers without writing SQL.&lt;/p&gt;
&lt;p&gt;&lt;a href=&quot;https://www.dremio.com/get-started?utm_source=ev_buffer&amp;amp;utm_medium=influencer&amp;amp;utm_campaign=pag&amp;amp;utm_term=connector-unity-catalog-dremio-cloud&amp;amp;utm_content=alexmerced&quot;&gt;Try Dremio Cloud free for 30 days&lt;/a&gt; and connect your Unity Catalog.&lt;/p&gt;
</content:encoded><author>Alex Merced</author></item><item><title>Connect Snowflake Open Catalog to Dremio Cloud: Multi-Engine Iceberg Analytics</title><link>https://iceberglakehouse.com/posts/2026-03-connector-snowflake-open-catalog/</link><guid isPermaLink="true">https://iceberglakehouse.com/posts/2026-03-connector-snowflake-open-catalog/</guid><description>
Snowflake Open Catalog is Snowflake&apos;s managed implementation of the Apache Iceberg REST catalog specification, based on the open-source Apache Polari...</description><pubDate>Mon, 02 Mar 2026 01:00:00 GMT</pubDate><content:encoded>&lt;p&gt;Snowflake Open Catalog is Snowflake&apos;s managed implementation of the Apache Iceberg REST catalog specification, based on the open-source Apache Polaris project. It serves as a centralized metadata catalog for Apache Iceberg tables, enabling multiple compute engines — including Dremio, Spark, Trino, and Flink — to read from and write to the same Iceberg tables without metadata conflicts.&lt;/p&gt;
&lt;p&gt;Dremio Cloud connects to Snowflake Open Catalog as a first-class Iceberg data source. You get full read and write access to Iceberg tables, automatic table maintenance (compaction, manifest optimization, vacuuming), and the ability to federate catalog data with databases, object storage, cloud warehouses, and other catalogs — all through standard SQL.&lt;/p&gt;
&lt;p&gt;For organizations already invested in Snowflake, the Open Catalog is a strategic choice for multi-engine interoperability. Unlike Snowflake&apos;s proprietary internal catalog (which is only accessible through Snowflake compute), the Open Catalog exposes Iceberg metadata via a standard REST API. This means you&apos;re not locked into Snowflake compute for every analytical query — Dremio can read the same tables at a fraction of the credit cost for repetitive workloads. Dremio also provides its federated engine, Reflections, governance, and AI capabilities — all without duplicating data or metadata.&lt;/p&gt;
&lt;h2&gt;Why Snowflake Open Catalog Users Need Dremio&lt;/h2&gt;
&lt;h3&gt;Multi-Engine Strategy Without Vendor Lock-In&lt;/h3&gt;
&lt;p&gt;Snowflake Open Catalog is designed for multi-engine compatibility, which makes it an ideal complement to Dremio. By connecting Dremio to your Snowflake Open Catalog, you add a query engine that specializes in areas Snowflake doesn&apos;t:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Federation:&lt;/strong&gt; Join catalog tables with PostgreSQL, MongoDB, S3, BigQuery, and any other Dremio-connected source in a single SQL query — something Snowflake can&apos;t do natively with non-Snowflake sources.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Autonomous performance management:&lt;/strong&gt; Dremio automatically compacts files, rewrites manifests, and builds Reflections based on query patterns for external catalog tables.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;AI-powered querying:&lt;/strong&gt; Dremio&apos;s AI Agent, MCP Server, and AI SQL Functions bring LLM capabilities to your catalog data.&lt;/li&gt;
&lt;/ul&gt;
&lt;h3&gt;Cost Optimization&lt;/h3&gt;
&lt;p&gt;Instead of running all workloads through Snowflake credits, offload analytical queries to Dremio. Dremio&apos;s Reflections cache results so repeated queries don&apos;t consume Snowflake credits. For organizations spending significant amounts on Snowflake compute, routing read-heavy analytical workloads through Dremio can reduce overall costs.&lt;/p&gt;
&lt;h3&gt;Federate with Non-Snowflake Sources&lt;/h3&gt;
&lt;p&gt;Snowflake&apos;s data sharing works within Snowflake. But what if you need to join your Snowflake Open Catalog data with PostgreSQL application data, MongoDB user profiles, or S3 raw event logs? Dremio&apos;s federation engine does exactly that — no ETL pipelines, no data duplication.&lt;/p&gt;
&lt;h3&gt;Credential Vending&lt;/h3&gt;
&lt;p&gt;Snowflake Open Catalog supports credential vending, meaning Dremio doesn&apos;t need separate storage credentials to access the underlying S3, Azure, or GCS data. The catalog provides temporary, scoped credentials for accessing data files. This simplifies security configuration and reduces the credentials you need to manage.&lt;/p&gt;
&lt;h3&gt;Write Support for External Catalogs&lt;/h3&gt;
&lt;p&gt;Dremio can write to external Snowflake Open Catalogs, enabling you to create tables, run transformations, and build data pipelines using Dremio&apos;s SQL engine while keeping metadata managed in Snowflake&apos;s catalog.&lt;/p&gt;
&lt;h2&gt;Prerequisites&lt;/h2&gt;
&lt;p&gt;Before connecting to Snowflake Open Catalog, confirm you have:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Snowflake Open Catalog account URL&lt;/strong&gt; — the endpoint for your catalog instance&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;OAuth or Personal Access Token (PAT) credentials&lt;/strong&gt; — for authenticating to the catalog&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Catalog names&lt;/strong&gt; — the specific catalogs you want to access (internal read-only and/or external read-write)&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Storage configuration&lt;/strong&gt; — if credential vending isn&apos;t available for your setup, you&apos;ll need S3, Azure, or GCS credentials for the underlying data&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Dremio Cloud account&lt;/strong&gt; — &lt;a href=&quot;https://www.dremio.com/get-started?utm_source=ev_buffer&amp;amp;utm_medium=influencer&amp;amp;utm_campaign=pag&amp;amp;utm_term=connector-snowflake-open-catalog-dremio-cloud&amp;amp;utm_content=alexmerced&quot;&gt;sign up free for 30 days&lt;/a&gt; with $400 in compute credits&lt;/li&gt;
&lt;/ul&gt;
&lt;h2&gt;Step-by-Step: Connect Snowflake Open Catalog to Dremio Cloud&lt;/h2&gt;
&lt;h3&gt;1. Add the Source&lt;/h3&gt;
&lt;p&gt;In the Dremio console, click the &lt;strong&gt;&amp;quot;+&amp;quot;&lt;/strong&gt; button in the left sidebar and select &lt;strong&gt;Snowflake Open Catalog&lt;/strong&gt; from the catalog source types.&lt;/p&gt;
&lt;h3&gt;2. Configure Connection Details&lt;/h3&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Name:&lt;/strong&gt; A descriptive identifier (e.g., &lt;code&gt;snowflake-open-catalog&lt;/code&gt; or &lt;code&gt;lakehouse-catalog&lt;/code&gt;). This appears in SQL queries as the source prefix.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Catalog URL:&lt;/strong&gt; The Snowflake Open Catalog endpoint.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Credentials:&lt;/strong&gt; OAuth client ID/secret or a Personal Access Token.&lt;/li&gt;
&lt;/ul&gt;
&lt;h3&gt;3. Select Catalogs&lt;/h3&gt;
&lt;p&gt;Choose which catalogs to enable:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Internal catalogs&lt;/strong&gt; are read-only from Dremio&apos;s perspective — you can query but not write.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;External catalogs&lt;/strong&gt; support full read and write operations (INSERT, UPDATE, DELETE, MERGE).&lt;/li&gt;
&lt;/ul&gt;
&lt;h3&gt;4. Configure Advanced Settings&lt;/h3&gt;
&lt;p&gt;Set Reflection Refresh and Metadata schedules. For catalogs with frequently changing tables, more frequent metadata refreshes ensure Dremio sees new tables and schema changes quickly.&lt;/p&gt;
&lt;h3&gt;5. Set Privileges and Save&lt;/h3&gt;
&lt;p&gt;Optionally restrict which Dremio users can access this catalog. Click &lt;strong&gt;Save&lt;/strong&gt;.&lt;/p&gt;
&lt;h2&gt;Query Snowflake Open Catalog Data&lt;/h2&gt;
&lt;pre&gt;&lt;code class=&quot;language-sql&quot;&gt;-- Query an Iceberg table managed by Snowflake Open Catalog
SELECT customer_id, customer_name, total_spend, signup_date
FROM &amp;quot;sf-open-catalog&amp;quot;.analytics.customer_summary
WHERE total_spend &amp;gt; 10000 AND signup_date &amp;gt;= &apos;2024-01-01&apos;
ORDER BY total_spend DESC;

-- Write to an external catalog
INSERT INTO &amp;quot;sf-open-catalog&amp;quot;.analytics.monthly_metrics
SELECT
  DATE_TRUNC(&apos;month&apos;, order_date) AS month,
  COUNT(*) AS order_count,
  SUM(total_amount) AS revenue
FROM &amp;quot;sf-open-catalog&amp;quot;.ecommerce.orders
GROUP BY 1;
&lt;/code&gt;&lt;/pre&gt;
&lt;h2&gt;Federate with Other Sources&lt;/h2&gt;
&lt;p&gt;Join catalog data with non-Snowflake sources in a single query:&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-sql&quot;&gt;SELECT
  soc.customer_name,
  soc.total_spend AS catalog_spend,
  pg.region,
  pg.account_manager,
  s3.support_ticket_count,
  CASE
    WHEN soc.total_spend &amp;gt; 100000 AND s3.support_ticket_count &amp;lt; 3 THEN &apos;Platinum&apos;
    WHEN soc.total_spend &amp;gt; 50000 THEN &apos;Gold&apos;
    WHEN soc.total_spend &amp;gt; 10000 THEN &apos;Silver&apos;
    ELSE &apos;Standard&apos;
  END AS customer_tier
FROM &amp;quot;sf-open-catalog&amp;quot;.analytics.customer_summary soc
JOIN &amp;quot;postgres-crm&amp;quot;.public.customers pg ON soc.customer_id = pg.customer_id
LEFT JOIN &amp;quot;s3-support&amp;quot;.tickets.customer_tickets s3 ON soc.customer_id = s3.customer_id
ORDER BY catalog_spend DESC;
&lt;/code&gt;&lt;/pre&gt;
&lt;h2&gt;Build a Semantic Layer&lt;/h2&gt;
&lt;p&gt;Create views that combine catalog data with business logic:&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-sql&quot;&gt;CREATE VIEW analytics.gold.customer_health AS
SELECT
  soc.customer_id,
  soc.customer_name,
  soc.total_spend,
  soc.signup_date,
  CASE
    WHEN soc.total_spend &amp;gt; 100000 THEN &apos;Enterprise&apos;
    WHEN soc.total_spend &amp;gt; 25000 THEN &apos;Mid-Market&apos;
    ELSE &apos;SMB&apos;
  END AS customer_segment,
  ROUND(soc.total_spend / GREATEST(DATEDIFF(&apos;MONTH&apos;, soc.signup_date, CURRENT_DATE), 1), 2) AS monthly_spend_rate
FROM &amp;quot;sf-open-catalog&amp;quot;.analytics.customer_summary soc;
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Navigate to the &lt;strong&gt;Catalog&lt;/strong&gt;, click &lt;strong&gt;Edit&lt;/strong&gt; (pencil icon) on this view, and &lt;strong&gt;Generate Wiki&lt;/strong&gt; and &lt;strong&gt;Generate Tags&lt;/strong&gt;. This creates the business context that powers Dremio&apos;s AI features.&lt;/p&gt;
&lt;h2&gt;AI-Powered Analytics&lt;/h2&gt;
&lt;h3&gt;Dremio AI Agent&lt;/h3&gt;
&lt;p&gt;The AI Agent lets users ask questions in plain English. For example: &amp;quot;Who are our highest-spending enterprise customers?&amp;quot; The Agent reads your wiki descriptions and view definitions to generate the correct SQL. Better wikis produce better results — describe what &amp;quot;enterprise customer&amp;quot; and &amp;quot;monthly spend rate&amp;quot; mean in business terms.&lt;/p&gt;
&lt;h3&gt;Dremio MCP Server&lt;/h3&gt;
&lt;p&gt;The &lt;a href=&quot;https://github.com/dremio/dremio-mcp&quot;&gt;Dremio MCP Server&lt;/a&gt; extends AI capabilities to Claude, ChatGPT, and other AI chat clients. Connect through the hosted MCP Server:&lt;/p&gt;
&lt;ol&gt;
&lt;li&gt;Create a Native OAuth application in Dremio Cloud&lt;/li&gt;
&lt;li&gt;Configure redirect URLs for your AI client (e.g., &lt;code&gt;https://claude.ai/api/mcp/auth_callback&lt;/code&gt;)&lt;/li&gt;
&lt;li&gt;Connect using &lt;code&gt;mcp.dremio.cloud/mcp/{project_id}&lt;/code&gt;&lt;/li&gt;
&lt;/ol&gt;
&lt;p&gt;Your team can then ask Claude &amp;quot;Show me customer health trends from our Snowflake catalog data&amp;quot; and get governed, accurate results without writing SQL.&lt;/p&gt;
&lt;h3&gt;AI SQL Functions&lt;/h3&gt;
&lt;p&gt;Enrich catalog data with AI inline in your queries:&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-sql&quot;&gt;SELECT
  customer_name,
  total_spend,
  AI_CLASSIFY(
    &apos;Based on spending patterns, classify customer risk of churn&apos;,
    &apos;Customer: &apos; || customer_name || &apos;, Total Spend: $&apos; || CAST(total_spend AS VARCHAR) || &apos;, Months Active: &apos; || CAST(months_active AS VARCHAR),
    ARRAY[&apos;Low Risk&apos;, &apos;Moderate Risk&apos;, &apos;High Risk&apos;, &apos;Critical&apos;]
  ) AS churn_risk
FROM &amp;quot;sf-open-catalog&amp;quot;.analytics.customer_summary
WHERE total_spend &amp;gt; 5000;
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;&lt;code&gt;AI_CLASSIFY&lt;/code&gt; runs LLM inference in your SQL query. &lt;code&gt;AI_GENERATE&lt;/code&gt; produces narrative summaries, and &lt;code&gt;AI_SIMILARITY&lt;/code&gt; finds semantic matches between text fields.&lt;/p&gt;
&lt;h2&gt;Reflections for Performance&lt;/h2&gt;
&lt;p&gt;Create Reflections on frequently queried views to cache results:&lt;/p&gt;
&lt;ol&gt;
&lt;li&gt;In the &lt;strong&gt;Catalog&lt;/strong&gt;, select the view and click the &lt;strong&gt;Reflections&lt;/strong&gt; tab&lt;/li&gt;
&lt;li&gt;Choose &lt;strong&gt;Raw Reflection&lt;/strong&gt; (full cache) or &lt;strong&gt;Aggregation Reflection&lt;/strong&gt; (pre-computed metrics)&lt;/li&gt;
&lt;li&gt;Select columns and aggregations to include&lt;/li&gt;
&lt;li&gt;Set the &lt;strong&gt;Refresh Interval&lt;/strong&gt; — balance freshness against compute cost&lt;/li&gt;
&lt;li&gt;Click &lt;strong&gt;Save&lt;/strong&gt;&lt;/li&gt;
&lt;/ol&gt;
&lt;p&gt;BI tools connected via Arrow Flight or ODBC get sub-second responses from Reflections instead of re-reading Iceberg files from storage. This reduces Snowflake credit consumption for workloads routed through Dremio.&lt;/p&gt;
&lt;h2&gt;Governance Across Snowflake Open Catalog and Other Sources&lt;/h2&gt;
&lt;p&gt;Dremio&apos;s Fine-Grained Access Control (FGAC) adds governance that spans Snowflake Open Catalog and all other sources:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Column masking:&lt;/strong&gt; Mask sensitive customer data from specific roles. A marketing analyst sees spending behavior but not PII.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Row-level filtering:&lt;/strong&gt; Regional users see only their region&apos;s data automatically.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Unified policies:&lt;/strong&gt; One set of governance rules applies across Snowflake Open Catalog, database connectors, and other external catalogs.&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;These policies apply across SQL Runner, BI tools (Arrow Flight/ODBC), AI Agent, and MCP Server.&lt;/p&gt;
&lt;h2&gt;Connect BI Tools via Arrow Flight&lt;/h2&gt;
&lt;p&gt;Arrow Flight provides 10-100x faster data transfer than JDBC/ODBC:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Tableau:&lt;/strong&gt; Dremio connector for direct Arrow Flight access&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Power BI:&lt;/strong&gt; Dremio ODBC driver or native connector&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Python/Pandas:&lt;/strong&gt; &lt;code&gt;pyarrow.flight&lt;/code&gt; for high-speed programmatic access&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Looker:&lt;/strong&gt; Connect via JDBC&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;dbt:&lt;/strong&gt; &lt;code&gt;dbt-dremio&lt;/code&gt; adapter for transformation workflows&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;All queries benefit from Reflections, governance, and the semantic layer.&lt;/p&gt;
&lt;h2&gt;VS Code Copilot Integration&lt;/h2&gt;
&lt;p&gt;Dremio&apos;s VS Code extension with Copilot integration lets developers query Snowflake Open Catalog data from their IDE. Ask Copilot &amp;quot;Show me customer churn risk from the catalog&amp;quot; and get SQL generated using your semantic layer — without switching tools.&lt;/p&gt;
&lt;h2&gt;When to Use Snowflake Open Catalog vs. Other Catalogs&lt;/h2&gt;
&lt;p&gt;&lt;strong&gt;Use Snowflake Open Catalog when:&lt;/strong&gt; You&apos;re already in the Snowflake ecosystem and want multi-engine Iceberg access, your team uses Snowflake for data management but needs Dremio for federation and AI.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Use AWS Glue when:&lt;/strong&gt; You&apos;re AWS-native and want tight integration with EMR, Athena, and S3.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Use Dremio&apos;s Open Catalog when:&lt;/strong&gt; You want zero-configuration automatic maintenance, Autonomous Reflections, and no external catalog dependencies.&lt;/p&gt;
&lt;p&gt;You can connect multiple catalogs simultaneously. Many organizations use Snowflake Open Catalog for shared enterprise data and Dremio&apos;s Open Catalog for Dremio-specific analytical workloads.&lt;/p&gt;
&lt;h2&gt;Credential Vending in Detail&lt;/h2&gt;
&lt;p&gt;Credential vending is a key feature of Snowflake Open Catalog that simplifies Dremio&apos;s access to underlying storage. Here&apos;s how it works:&lt;/p&gt;
&lt;ol&gt;
&lt;li&gt;&lt;strong&gt;You configure storage in Snowflake Open Catalog&lt;/strong&gt; — specify the S3, Azure, or GCS bucket where Iceberg data files live.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;When Dremio queries a table&lt;/strong&gt;, it requests access from the catalog API.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Snowflake Open Catalog returns temporary, scoped credentials&lt;/strong&gt; — short-lived tokens with permissions limited to the specific data files needed.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Dremio uses these credentials&lt;/strong&gt; to read (or write, for external catalogs) directly from storage.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Credentials expire automatically&lt;/strong&gt; — no long-lived keys to rotate or manage.&lt;/li&gt;
&lt;/ol&gt;
&lt;p&gt;This means your Dremio Cloud connection needs only the catalog API credentials (OAuth), not separate storage credentials for every S3 bucket or Azure container. One connection, automatic credential management, reduced security surface area.&lt;/p&gt;
&lt;h2&gt;Multi-Engine Architecture with Snowflake Open Catalog&lt;/h2&gt;
&lt;p&gt;Snowflake Open Catalog enables a powerful multi-engine architecture:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Snowflake:&lt;/strong&gt; Data engineering, SQL analytics, and catalog management&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Dremio:&lt;/strong&gt; Federation, AI analytics, and Reflection-based BI serving&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Apache Spark:&lt;/strong&gt; Large-scale data processing and ML model training&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Trino/Presto:&lt;/strong&gt; Ad-hoc query engine for open-source workflows&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;All engines read from the same Iceberg tables managed by Snowflake Open Catalog — no data duplication, no metadata sync issues, no format conversion. Each engine reads the latest table metadata from the catalog and accesses data files via credential vending.&lt;/p&gt;
&lt;p&gt;Dremio&apos;s unique contribution to this architecture is federation (joining catalog tables with non-Iceberg sources), AI capabilities (Agent, MCP, SQL Functions), and Reflections (sub-second BI serving without re-reading storage).&lt;/p&gt;
&lt;h2&gt;Snowflake Open Catalog vs. Apache Polaris&lt;/h2&gt;
&lt;p&gt;Snowflake Open Catalog is based on the open-source Apache Polaris (incubating) project. Key differences:&lt;/p&gt;
&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Feature&lt;/th&gt;
&lt;th&gt;Snowflake Open Catalog&lt;/th&gt;
&lt;th&gt;Apache Polaris (self-managed)&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Hosting&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Managed by Snowflake&lt;/td&gt;
&lt;td&gt;Self-hosted&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Credential Vending&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Built-in&lt;/td&gt;
&lt;td&gt;Requires configuration&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Authentication&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Snowflake OAuth&lt;/td&gt;
&lt;td&gt;Custom&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Support&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Snowflake support&lt;/td&gt;
&lt;td&gt;Community&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Cost&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Snowflake pricing&lt;/td&gt;
&lt;td&gt;Infrastructure costs&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;
&lt;p&gt;If you use Snowflake&apos;s managed offering, you get turnkey catalog management. If you prefer self-managed, Apache Polaris works with Dremio&apos;s Iceberg REST Catalog connector.&lt;/p&gt;
&lt;h2&gt;Get Started&lt;/h2&gt;
&lt;p&gt;Snowflake Open Catalog users can build a truly multi-engine lakehouse — manage Iceberg metadata in Snowflake&apos;s infrastructure while querying with Dremio&apos;s federated engine, AI capabilities, and Reflection-based acceleration.&lt;/p&gt;
&lt;p&gt;Connect your Snowflake Open Catalog to Dremio Cloud, build views over your Iceberg tables, and start leveraging AI Agent, MCP Server, and Reflections for cost-optimized analytical serving. The setup takes minutes and works immediately.&lt;/p&gt;
&lt;p&gt;&lt;a href=&quot;https://www.dremio.com/get-started?utm_source=ev_buffer&amp;amp;utm_medium=influencer&amp;amp;utm_campaign=pag&amp;amp;utm_term=connector-snowflake-open-catalog-dremio-cloud&amp;amp;utm_content=alexmerced&quot;&gt;Try Dremio Cloud free for 30 days&lt;/a&gt; and connect your Snowflake Open Catalog.&lt;/p&gt;
</content:encoded><author>Alex Merced</author></item><item><title>Connect AWS Glue Data Catalog to Dremio Cloud: Query and Manage Your AWS Iceberg Tables</title><link>https://iceberglakehouse.com/posts/2026-03-connector-aws-glue/</link><guid isPermaLink="true">https://iceberglakehouse.com/posts/2026-03-connector-aws-glue/</guid><description>
AWS Glue Data Catalog is AWS&apos;s managed metadata service for data lakes. It stores table definitions, schemas, partition information, and statistics f...</description><pubDate>Mon, 02 Mar 2026 00:00:00 GMT</pubDate><content:encoded>&lt;p&gt;AWS Glue Data Catalog is AWS&apos;s managed metadata service for data lakes. It stores table definitions, schemas, partition information, and statistics for data stored in Amazon S3. If you&apos;ve built your data lake on AWS using Apache Spark (on EMR), AWS Glue ETL jobs, or Amazon Athena, your table metadata lives in Glue. But Glue is just a catalog — a registry of what&apos;s where. To actually query the data, you need Athena (per-TB pricing), EMR clusters (infrastructure management), or Redshift Spectrum (additional cost).&lt;/p&gt;
&lt;p&gt;Dremio Cloud connects to your Glue Data Catalog and queries the underlying Iceberg tables with full read and write support. You get enterprise-grade SQL, Reflections for query acceleration, governance, and AI analytics — all on top of your existing Glue-managed lakehouse.&lt;/p&gt;
&lt;h2&gt;Why Glue Users Need Dremio&lt;/h2&gt;
&lt;h3&gt;Query Without Athena&apos;s Per-TB Pricing&lt;/h3&gt;
&lt;p&gt;Athena charges per terabyte of data scanned, regardless of whether the query is the same one you ran 5 minutes ago. For teams running dashboard queries, scheduled reports, and ad-hoc exploration, this pricing model creates unpredictable costs. Dremio&apos;s Reflections cache results so repeated queries don&apos;t re-scan S3. C3 (Columnar Cloud Cache) caches file data on local NVMe for frequently accessed datasets. You pay for Dremio compute time, not per-TB scanned.&lt;/p&gt;
&lt;h3&gt;Full Read and Write on Iceberg Tables&lt;/h3&gt;
&lt;p&gt;Dremio supports full DML (INSERT, UPDATE, DELETE, MERGE) on Glue-cataloged Iceberg tables. Create tables, run transformations, build data pipelines, and maintain your lakehouse entirely through Dremio&apos;s SQL engine — no need to spin up EMR clusters or Glue ETL jobs for simple transformations.&lt;/p&gt;
&lt;h3&gt;Federate Glue with Non-AWS Sources&lt;/h3&gt;
&lt;p&gt;Your Glue-managed data lake covers AWS data, but your application database is on Azure (Azure SQL), your analytics warehouse is Snowflake, and your marketing data is in Google BigQuery. Dremio federates across all of them in a single SQL query.&lt;/p&gt;
&lt;h3&gt;Automated Iceberg Maintenance&lt;/h3&gt;
&lt;p&gt;Dremio automatically compacts small files into optimally sized ones, rewrites manifests for faster metadata reads, and clusters data based on query patterns — all on Glue-cataloged Iceberg tables. This eliminates the need for manual &lt;code&gt;OPTIMIZE&lt;/code&gt; jobs or scheduled Glue ETL maintenance tasks.&lt;/p&gt;
&lt;h3&gt;Credential Vending&lt;/h3&gt;
&lt;p&gt;Dremio uses Glue&apos;s credential vending to securely access the underlying S3 data without separate S3 credentials. The catalog provides temporary, scoped credentials for each data request.&lt;/p&gt;
&lt;h2&gt;Prerequisites&lt;/h2&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;AWS Account&lt;/strong&gt; with Glue Data Catalog configured&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;IAM Role&lt;/strong&gt; with permissions: &lt;code&gt;glue:GetDatabase&lt;/code&gt;, &lt;code&gt;glue:GetTable&lt;/code&gt;, &lt;code&gt;glue:GetTables&lt;/code&gt;, &lt;code&gt;glue:GetPartitions&lt;/code&gt;, and S3 read/write permissions for underlying data&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;AWS Region&lt;/strong&gt; where your Glue catalog is deployed&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Dremio Cloud account&lt;/strong&gt; — &lt;a href=&quot;https://www.dremio.com/get-started?utm_source=ev_buffer&amp;amp;utm_medium=influencer&amp;amp;utm_campaign=pag&amp;amp;utm_term=connector-aws-glue-dremio-cloud&amp;amp;utm_content=alexmerced&quot;&gt;sign up free for 30 days&lt;/a&gt; with $400 in compute credits&lt;/li&gt;
&lt;/ul&gt;
&lt;h2&gt;Step-by-Step: Connect Glue to Dremio Cloud&lt;/h2&gt;
&lt;h3&gt;1. Add the Source&lt;/h3&gt;
&lt;p&gt;Click &lt;strong&gt;&amp;quot;+&amp;quot;&lt;/strong&gt; in the Dremio console and select &lt;strong&gt;AWS Glue Data Catalog&lt;/strong&gt;.&lt;/p&gt;
&lt;h3&gt;2. Configure Connection&lt;/h3&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Name:&lt;/strong&gt; Descriptive identifier (e.g., &lt;code&gt;glue-catalog&lt;/code&gt; or &lt;code&gt;aws-lakehouse&lt;/code&gt;).&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;AWS Region:&lt;/strong&gt; The region where your Glue catalog is deployed.&lt;/li&gt;
&lt;/ul&gt;
&lt;h3&gt;3. Set Authentication&lt;/h3&gt;
&lt;p&gt;Provide IAM Role ARN (recommended for Dremio Cloud) or AWS Access Key/Secret Key.&lt;/p&gt;
&lt;h3&gt;4. Select Databases&lt;/h3&gt;
&lt;p&gt;Choose which Glue databases to expose. You can enable specific databases or allow access to all.&lt;/p&gt;
&lt;h3&gt;5. Configure Advanced Settings&lt;/h3&gt;
&lt;p&gt;Set Reflection Refresh, Metadata refresh intervals, and click &lt;strong&gt;Save&lt;/strong&gt;.&lt;/p&gt;
&lt;h2&gt;Query and Write to Glue Iceberg Tables&lt;/h2&gt;
&lt;pre&gt;&lt;code class=&quot;language-sql&quot;&gt;-- Query a Glue-cataloged Iceberg table
SELECT product_id, product_name, category, price, inventory_count
FROM &amp;quot;glue-catalog&amp;quot;.ecommerce.products
WHERE category = &apos;Electronics&apos; AND price &amp;gt; 50 AND inventory_count &amp;gt; 0
ORDER BY price ASC;

-- Write to Glue Iceberg tables
INSERT INTO &amp;quot;glue-catalog&amp;quot;.analytics.daily_summary
SELECT
  DATE_TRUNC(&apos;day&apos;, order_date) AS day,
  COUNT(*) AS order_count,
  SUM(total) AS revenue,
  AVG(total) AS avg_order_value
FROM &amp;quot;glue-catalog&amp;quot;.ecommerce.orders
WHERE order_date = CURRENT_DATE - INTERVAL &apos;1&apos; DAY
GROUP BY 1;

-- MERGE for upserts
MERGE INTO &amp;quot;glue-catalog&amp;quot;.analytics.product_metrics AS target
USING (
  SELECT product_id, COUNT(*) AS orders, SUM(quantity) AS units_sold
  FROM &amp;quot;glue-catalog&amp;quot;.ecommerce.order_items
  WHERE order_date &amp;gt;= CURRENT_DATE - INTERVAL &apos;7&apos; DAY
  GROUP BY product_id
) AS source
ON target.product_id = source.product_id
WHEN MATCHED THEN UPDATE SET orders = source.orders, units_sold = source.units_sold
WHEN NOT MATCHED THEN INSERT (product_id, orders, units_sold) VALUES (source.product_id, source.orders, source.units_sold);
&lt;/code&gt;&lt;/pre&gt;
&lt;h2&gt;Federate with Non-AWS Sources&lt;/h2&gt;
&lt;pre&gt;&lt;code class=&quot;language-sql&quot;&gt;-- Join Glue products with external review and supplier data
SELECT
  g.product_name,
  g.price,
  g.category,
  pg.avg_rating,
  pg.review_count,
  sf.supplier_name,
  sf.lead_time_days
FROM &amp;quot;glue-catalog&amp;quot;.ecommerce.products g
LEFT JOIN &amp;quot;postgres-reviews&amp;quot;.public.product_reviews pg ON g.product_id = pg.product_id
LEFT JOIN &amp;quot;snowflake-supply&amp;quot;.PUBLIC.SUPPLIERS sf ON g.supplier_id = sf.supplier_id
WHERE g.category = &apos;Electronics&apos;
ORDER BY pg.avg_rating DESC;
&lt;/code&gt;&lt;/pre&gt;
&lt;h2&gt;Build a Semantic Layer&lt;/h2&gt;
&lt;pre&gt;&lt;code class=&quot;language-sql&quot;&gt;CREATE VIEW analytics.gold.product_performance AS
SELECT
  g.product_id,
  g.product_name,
  g.category,
  g.price,
  SUM(oi.quantity) AS units_sold,
  SUM(oi.quantity * g.price) AS revenue,
  CASE
    WHEN SUM(oi.quantity) &amp;gt; 1000 THEN &apos;Best Seller&apos;
    WHEN SUM(oi.quantity) &amp;gt; 100 THEN &apos;Popular&apos;
    ELSE &apos;Niche&apos;
  END AS popularity_tier
FROM &amp;quot;glue-catalog&amp;quot;.ecommerce.products g
LEFT JOIN &amp;quot;glue-catalog&amp;quot;.ecommerce.order_items oi ON g.product_id = oi.product_id
GROUP BY g.product_id, g.product_name, g.category, g.price;
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;In the &lt;strong&gt;Catalog&lt;/strong&gt;, click &lt;strong&gt;Edit&lt;/strong&gt; → &lt;strong&gt;Generate Wiki&lt;/strong&gt; and &lt;strong&gt;Generate Tags&lt;/strong&gt;.&lt;/p&gt;
&lt;h2&gt;AI-Powered Analytics&lt;/h2&gt;
&lt;h3&gt;Dremio AI Agent&lt;/h3&gt;
&lt;p&gt;Ask &amp;quot;Which electronics products are best sellers?&amp;quot; and the AI Agent generates SQL from your semantic layer. The wiki descriptions you&apos;ve attached to views guide the Agent&apos;s understanding of terms like &amp;quot;best seller&amp;quot; and &amp;quot;popularity tier.&amp;quot;&lt;/p&gt;
&lt;h3&gt;Dremio MCP Server&lt;/h3&gt;
&lt;p&gt;The &lt;a href=&quot;https://github.com/dremio/dremio-mcp&quot;&gt;Dremio MCP Server&lt;/a&gt; connects Claude and ChatGPT to your Glue-cataloged data:&lt;/p&gt;
&lt;ol&gt;
&lt;li&gt;Create a Native OAuth app in Dremio Cloud&lt;/li&gt;
&lt;li&gt;Configure redirect URLs for your AI client&lt;/li&gt;
&lt;li&gt;Connect using &lt;code&gt;mcp.dremio.cloud/mcp/{project_id}&lt;/code&gt;&lt;/li&gt;
&lt;/ol&gt;
&lt;p&gt;A product manager asks ChatGPT &amp;quot;Show me niche electronics products with high ratings that might be under-marketed&amp;quot; and gets governed results from your Glue lakehouse.&lt;/p&gt;
&lt;h3&gt;AI SQL Functions&lt;/h3&gt;
&lt;pre&gt;&lt;code class=&quot;language-sql&quot;&gt;-- Generate product descriptions from catalog data
SELECT
  product_name,
  category,
  price,
  AI_GENERATE(
    &apos;Write a one-sentence marketing description for this product&apos;,
    &apos;Product: &apos; || product_name || &apos;, Category: &apos; || category || &apos;, Price: $&apos; || CAST(price AS VARCHAR) || &apos;, Popularity: &apos; || popularity_tier
  ) AS marketing_description
FROM analytics.gold.product_performance
WHERE popularity_tier = &apos;Best Seller&apos;;

-- Classify inventory risk
SELECT
  product_name,
  inventory_count,
  AI_CLASSIFY(
    &apos;Based on inventory levels and sales velocity, classify the reorder urgency&apos;,
    &apos;Product: &apos; || product_name || &apos;, Stock: &apos; || CAST(inventory_count AS VARCHAR) || &apos;, Units Sold (7d): &apos; || CAST(units_sold AS VARCHAR),
    ARRAY[&apos;Order Now&apos;, &apos;Order Soon&apos;, &apos;Adequate Stock&apos;, &apos;Overstocked&apos;]
  ) AS reorder_urgency
FROM &amp;quot;glue-catalog&amp;quot;.ecommerce.products g
JOIN analytics.gold.product_performance pp ON g.product_id = pp.product_id;
&lt;/code&gt;&lt;/pre&gt;
&lt;h2&gt;Reflections for Performance&lt;/h2&gt;
&lt;p&gt;Create Reflections on product performance and daily summary views to cache results and serve BI tools with sub-second response times:&lt;/p&gt;
&lt;ol&gt;
&lt;li&gt;In the &lt;strong&gt;Catalog&lt;/strong&gt;, navigate to the view you want to accelerate&lt;/li&gt;
&lt;li&gt;Click the &lt;strong&gt;Reflections&lt;/strong&gt; tab&lt;/li&gt;
&lt;li&gt;Choose &lt;strong&gt;Raw Reflection&lt;/strong&gt; to cache the full view or &lt;strong&gt;Aggregation Reflection&lt;/strong&gt; to pre-compute specific SUM/COUNT/AVG aggregations&lt;/li&gt;
&lt;li&gt;Select columns to include in the Reflection&lt;/li&gt;
&lt;li&gt;Set the &lt;strong&gt;Refresh Interval&lt;/strong&gt; — how often Dremio re-queries the underlying Iceberg tables to update the cache&lt;/li&gt;
&lt;li&gt;Click &lt;strong&gt;Save&lt;/strong&gt;&lt;/li&gt;
&lt;/ol&gt;
&lt;p&gt;Dashboard queries from Tableau, Power BI, or Looker connected via Arrow Flight hit the Reflection instead of re-reading S3 Iceberg files, providing sub-second response times even for complex aggregations.&lt;/p&gt;
&lt;h2&gt;Time Travel on Glue Iceberg Tables&lt;/h2&gt;
&lt;p&gt;Iceberg tables cataloged in Glue support time travel through Dremio:&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-sql&quot;&gt;-- Query a table as it existed 7 days ago
SELECT product_id, price, inventory_count
FROM &amp;quot;glue-catalog&amp;quot;.ecommerce.products
AT TIMESTAMP &apos;2024-06-01 00:00:00&apos;;

-- Compare current state to a historical snapshot
SELECT
  curr.product_name,
  curr.price AS current_price,
  hist.price AS previous_price,
  ROUND((curr.price - hist.price) / hist.price * 100, 2) AS price_change_pct
FROM &amp;quot;glue-catalog&amp;quot;.ecommerce.products curr
JOIN &amp;quot;glue-catalog&amp;quot;.ecommerce.products AT TIMESTAMP &apos;2024-01-01 00:00:00&apos; hist
  ON curr.product_id = hist.product_id
WHERE curr.price != hist.price
ORDER BY ABS(price_change_pct) DESC;
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Time travel is valuable for auditing (&amp;quot;What were inventory levels at quarter end?&amp;quot;), debugging (&amp;quot;What changed in the last 24 hours?&amp;quot;), and compliance (&amp;quot;Show data as it was on the regulatory reporting date&amp;quot;).&lt;/p&gt;
&lt;h2&gt;Governance on Glue Data&lt;/h2&gt;
&lt;p&gt;Dremio&apos;s Fine-Grained Access Control (FGAC) adds governance capabilities that Glue and Athena don&apos;t provide natively:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Column masking:&lt;/strong&gt; Hide sensitive fields (customer PII, pricing details) from specific roles while allowing full access for authorized users. For example, mask &lt;code&gt;customer_email&lt;/code&gt; for marketing analysts but show it for customer support teams.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Row-level filtering:&lt;/strong&gt; Automatically filter data based on the querying user&apos;s role. A regional manager sees only their region&apos;s data. A global admin sees everything.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Unified policies:&lt;/strong&gt; The same governance policies apply whether data comes from Glue, PostgreSQL, Snowflake, or any other connected source.&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;These policies apply across all access methods — SQL Runner, BI tools via Arrow Flight/ODBC, AI Agent queries, and MCP Server interactions.&lt;/p&gt;
&lt;h2&gt;Connect BI Tools via Arrow Flight&lt;/h2&gt;
&lt;p&gt;Dremio&apos;s Arrow Flight connector provides 10-100x faster data transfer compared to JDBC/ODBC for BI tools. After creating views over your Glue data, connect:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Tableau:&lt;/strong&gt; Use the Dremio connector, enter your Dremio Cloud endpoint&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Power BI:&lt;/strong&gt; Use the Dremio ODBC driver or Arrow Flight connector&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Python/Pandas:&lt;/strong&gt; Use &lt;code&gt;pyarrow.flight&lt;/code&gt; client for high-speed data access&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;dbt:&lt;/strong&gt; Use the &lt;code&gt;dbt-dremio&lt;/code&gt; adapter for transformation workflows&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;All queries from these tools benefit from Reflections, governance, and the semantic layer.&lt;/p&gt;
&lt;h2&gt;VS Code Copilot Integration&lt;/h2&gt;
&lt;p&gt;Dremio&apos;s VS Code extension with Copilot integration lets developers query Glue-cataloged data directly from their IDE. Ask Copilot &amp;quot;Show me product inventory trends from the Glue catalog&amp;quot; and it generates SQL using Dremio&apos;s semantic layer — all without leaving your development environment.&lt;/p&gt;
&lt;h2&gt;Glue vs. Athena vs. Dremio: When to Use Each&lt;/h2&gt;
&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Feature&lt;/th&gt;
&lt;th&gt;AWS Glue&lt;/th&gt;
&lt;th&gt;Amazon Athena&lt;/th&gt;
&lt;th&gt;Dremio Cloud&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Purpose&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Metadata catalog&lt;/td&gt;
&lt;td&gt;Serverless SQL&lt;/td&gt;
&lt;td&gt;Federated analytics + catalog&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Pricing&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Free (metadata)&lt;/td&gt;
&lt;td&gt;Per TB scanned&lt;/td&gt;
&lt;td&gt;Compute-based&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Write support&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Via ETL jobs&lt;/td&gt;
&lt;td&gt;Limited&lt;/td&gt;
&lt;td&gt;Full DML&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Federation&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;No&lt;/td&gt;
&lt;td&gt;Federated queries (limited)&lt;/td&gt;
&lt;td&gt;Full cross-source federation&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;AI analytics&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;No&lt;/td&gt;
&lt;td&gt;No&lt;/td&gt;
&lt;td&gt;AI Agent, MCP, SQL Functions&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Reflections&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;No&lt;/td&gt;
&lt;td&gt;No&lt;/td&gt;
&lt;td&gt;Yes (automatic caching)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Governance&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;IAM only&lt;/td&gt;
&lt;td&gt;IAM + Lake Formation&lt;/td&gt;
&lt;td&gt;FGAC + semantic layer&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;
&lt;p&gt;Glue is the metadata catalog. Athena is a query engine with per-TB pricing. Dremio is a federated platform that uses Glue as one of many catalogs and adds AI, governance, and performance acceleration.&lt;/p&gt;
&lt;h2&gt;When to Keep Tables in Glue vs. Use Dremio&apos;s Open Catalog&lt;/h2&gt;
&lt;p&gt;&lt;strong&gt;Keep in Glue:&lt;/strong&gt; Tables managed by existing AWS-native pipelines (EMR, Glue ETL), tables shared across multiple AWS services, data consumed by Athena or Redshift Spectrum alongside Dremio.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Use Dremio&apos;s Open Catalog:&lt;/strong&gt; New analytical tables, data created through Dremio transformations, datasets where you want zero-configuration automatic maintenance (compaction, vacuuming, Autonomous Reflections).&lt;/p&gt;
&lt;p&gt;You can use both simultaneously — Glue for your existing AWS lakehouse, Dremio&apos;s Open Catalog for new analytical workloads.&lt;/p&gt;
&lt;h2&gt;Dremio vs. Athena for Querying Glue-Managed Tables&lt;/h2&gt;
&lt;p&gt;Both Dremio and Athena can query tables registered in the Glue Data Catalog. Key differences:&lt;/p&gt;
&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Feature&lt;/th&gt;
&lt;th&gt;Dremio Cloud&lt;/th&gt;
&lt;th&gt;Amazon Athena&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Pricing&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Compute-based&lt;/td&gt;
&lt;td&gt;$5/TB scanned&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Reflections&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;✅ Cache results&lt;/td&gt;
&lt;td&gt;❌ Scans every time&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Federation&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;PostgreSQL, MongoDB, BigQuery, etc.&lt;/td&gt;
&lt;td&gt;S3 + federated queries (limited)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;AI Agent&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;✅ Natural language queries&lt;/td&gt;
&lt;td&gt;❌&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;MCP Server&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;✅ Claude/ChatGPT integration&lt;/td&gt;
&lt;td&gt;❌&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;BI Tool Connectivity&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Arrow Flight (10-100x faster)&lt;/td&gt;
&lt;td&gt;ODBC/JDBC only&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Governance&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Column masking + row filtering&lt;/td&gt;
&lt;td&gt;Lake Formation policies&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Iceberg Write Support&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Full DML&lt;/td&gt;
&lt;td&gt;Full DML&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;
&lt;p&gt;For organizations already using Athena, Dremio adds federation, AI analytics, and cost savings through Reflections. Many teams run both: Athena for quick ad-hoc S3 queries, Dremio for cross-source analytics and BI tool serving.&lt;/p&gt;
&lt;h2&gt;AWS Lake Formation Integration&lt;/h2&gt;
&lt;p&gt;AWS Lake Formation provides fine-grained access control for Glue-managed tables. When connecting to Glue through Dremio:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Lake Formation permissions&lt;/strong&gt; govern which tables and columns the Dremio IAM role can access&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Dremio FGAC&lt;/strong&gt; adds additional governance layers (column masking, row-level filtering) on top of Lake Formation&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Both layers work together:&lt;/strong&gt; Lake Formation controls what Dremio can see; Dremio FGAC controls what individual users see&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;This dual-layer governance model gives you AWS-native access control at the storage level and Dremio-managed access control at the query level — comprehensive governance without compromising on either side.&lt;/p&gt;
&lt;h2&gt;Get Started&lt;/h2&gt;
&lt;p&gt;AWS Glue Data Catalog users can query, write, optimize, and AI-enrich their Iceberg tables through Dremio Cloud — with federation, governance, and performance acceleration that Athena and EMR don&apos;t provide.&lt;/p&gt;
&lt;p&gt;&lt;a href=&quot;https://www.dremio.com/get-started?utm_source=ev_buffer&amp;amp;utm_medium=influencer&amp;amp;utm_campaign=pag&amp;amp;utm_term=connector-aws-glue-dremio-cloud&amp;amp;utm_content=alexmerced&quot;&gt;Try Dremio Cloud free for 30 days&lt;/a&gt; and connect your Glue catalog.&lt;/p&gt;
</content:encoded><author>Alex Merced</author></item><item><title>Connect Apache Druid to Dremio Cloud: Add SQL Joins, AI, and Governance to Your Real-Time Analytics</title><link>https://iceberglakehouse.com/posts/2026-03-connector-apache-druid/</link><guid isPermaLink="true">https://iceberglakehouse.com/posts/2026-03-connector-apache-druid/</guid><description>
Apache Druid is a real-time analytics database designed for sub-second queries on high-ingestion-rate event data. Clickstream analytics, application ...</description><pubDate>Sun, 01 Mar 2026 23:00:00 GMT</pubDate><content:encoded>&lt;p&gt;Apache Druid is a real-time analytics database designed for sub-second queries on high-ingestion-rate event data. Clickstream analytics, application monitoring, IoT telemetry, and ad-tech workloads rely on Druid&apos;s columnar storage and inverted indexes for instantaneous queries.&lt;/p&gt;
&lt;p&gt;Dremio Cloud connects to Druid as a federated data source, giving you the ability to join Druid event data with relational databases, data lakes, and cloud warehouses. Dremio adds governance (column masking, row-level filtering), Reflection-based acceleration, and AI capabilities (AI Agent, MCP Server, AI SQL Functions) that Druid doesn&apos;t provide natively.&lt;/p&gt;
&lt;p&gt;Druid excels at one thing: fast aggregation queries on time-series event data. But production analytics rarely involve just one data source. When a product manager asks &amp;quot;Show me user engagement metrics correlated with support ticket volume and revenue impact,&amp;quot; that query requires joining Druid&apos;s event data with a CRM database and a financial system. Druid can&apos;t do these joins natively — it doesn&apos;t support standard SQL JOINs. Dremio bridges this gap by reading Druid data and joining it with any other source in a single SQL query.&lt;/p&gt;
&lt;p&gt;But Druid has fundamental limitations that become painful as analytics needs grow. It doesn&apos;t support traditional SQL joins between datasources. It doesn&apos;t connect to external databases. Its query model is optimized for aggregations on its own ingested segments, not for the kind of cross-source, enriched analytics modern organizations need.&lt;/p&gt;
&lt;p&gt;Dremio Cloud connects to Apache Druid and queries it alongside relational databases, data lakes, and cloud warehouses. You get the speed of Druid for real-time aggregations combined with Dremio&apos;s ability to join that data with any other source, accelerate queries with Reflections, apply governance, and enable AI analytics.&lt;/p&gt;
&lt;h2&gt;Why Druid Users Need Dremio&lt;/h2&gt;
&lt;h3&gt;SQL Joins with Real-Time Data&lt;/h3&gt;
&lt;p&gt;Druid doesn&apos;t support traditional SQL joins between datasources. If you want to answer &amp;quot;What is the conversion rate by customer segment in the last hour?&amp;quot; you need the real-time event data from Druid and the customer segment data from your CRM database. Without Dremio, you&apos;d need to either pre-join the data before ingesting into Druid (losing flexibility) or build application code that queries both systems and merges results in memory.&lt;/p&gt;
&lt;p&gt;Dremio queries Druid for its real-time aggregations and joins the results with PostgreSQL customer data, S3 behavior logs, Snowflake revenue data, or any other connected source — all in a single SQL query.&lt;/p&gt;
&lt;h3&gt;Enrich Real-Time Metrics with Business Context&lt;/h3&gt;
&lt;p&gt;Druid provides fast counts, averages, percentiles, and approximate distinct counts on event data. But enriching those metrics with customer names, product descriptions, geographic hierarchies, or organizational data requires joining with dimensional data that lives in other systems.&lt;/p&gt;
&lt;p&gt;Dremio&apos;s federation provides that enrichment without duplicating dimensional data into Druid. Your Druid segments stay lean (just events), and Dremio handles the enrichment at query time.&lt;/p&gt;
&lt;h3&gt;Historical Analysis Across Time Ranges&lt;/h3&gt;
&lt;p&gt;Druid is optimized for recent data (hot segments). Historical analysis across months or years — trend analysis, year-over-year comparisons — often hits cold segments that are slower to query. Dremio&apos;s Reflections cache aggregated historical results, providing fast access to time-series trends without depending on Druid&apos;s tiered storage.&lt;/p&gt;
&lt;h3&gt;Unified Governance&lt;/h3&gt;
&lt;p&gt;Druid has basic authentication but limited access control. There&apos;s no column masking, no row-level filtering, no consistent policy framework. Dremio&apos;s Fine-Grained Access Control adds these capabilities, ensuring that sensitive event data (user IDs, IP addresses, location data) is properly governed across Druid and every other connected source.&lt;/p&gt;
&lt;h2&gt;Prerequisites&lt;/h2&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Druid Broker hostname or IP address&lt;/strong&gt; — the Broker node handles query routing&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Port&lt;/strong&gt; — typically &lt;code&gt;8082&lt;/code&gt; for the Broker HTTP API&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Network access&lt;/strong&gt; from Dremio Cloud to the Druid Broker&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Dremio Cloud account&lt;/strong&gt; — &lt;a href=&quot;https://www.dremio.com/get-started?utm_source=ev_buffer&amp;amp;utm_medium=influencer&amp;amp;utm_campaign=pag&amp;amp;utm_term=connector-apache-druid-dremio-cloud&amp;amp;utm_content=alexmerced&quot;&gt;sign up free for 30 days&lt;/a&gt; with $400 in compute credits&lt;/li&gt;
&lt;/ul&gt;
&lt;h2&gt;Step-by-Step: Connect Druid to Dremio Cloud&lt;/h2&gt;
&lt;h3&gt;1. Add the Source&lt;/h3&gt;
&lt;p&gt;In the Dremio console, click &lt;strong&gt;&amp;quot;+&amp;quot;&lt;/strong&gt; and select &lt;strong&gt;Apache Druid&lt;/strong&gt;.&lt;/p&gt;
&lt;h3&gt;2. Configure Connection Details&lt;/h3&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Name:&lt;/strong&gt; A descriptive identifier (e.g., &lt;code&gt;druid-realtime&lt;/code&gt; or &lt;code&gt;event-analytics&lt;/code&gt;).&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Host:&lt;/strong&gt; Druid Broker hostname or IP.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Port:&lt;/strong&gt; Default &lt;code&gt;8082&lt;/code&gt;.&lt;/li&gt;
&lt;/ul&gt;
&lt;h3&gt;3. Set Authentication&lt;/h3&gt;
&lt;p&gt;Configure credentials if your Druid deployment requires authentication.&lt;/p&gt;
&lt;h3&gt;4. Configure Advanced Settings&lt;/h3&gt;
&lt;p&gt;Set Reflection Refresh, Metadata refresh intervals, and any connection properties.&lt;/p&gt;
&lt;h3&gt;5. Set Privileges and Save&lt;/h3&gt;
&lt;h2&gt;Query Real-Time Druid Data&lt;/h2&gt;
&lt;pre&gt;&lt;code class=&quot;language-sql&quot;&gt;-- Real-time page view metrics
SELECT
  DATE_TRUNC(&apos;hour&apos;, __time) AS event_hour,
  page,
  COUNT(*) AS page_views,
  COUNT(DISTINCT user_id) AS unique_visitors,
  ROUND(COUNT(*) * 1.0 / COUNT(DISTINCT user_id), 2) AS views_per_visitor
FROM &amp;quot;druid-realtime&amp;quot;.druid.pageviews
WHERE __time &amp;gt; CURRENT_TIMESTAMP - INTERVAL &apos;24&apos; HOUR
GROUP BY 1, 2
ORDER BY page_views DESC
LIMIT 20;
&lt;/code&gt;&lt;/pre&gt;
&lt;h2&gt;Federate: Enrich Real-Time Data with Business Context&lt;/h2&gt;
&lt;pre&gt;&lt;code class=&quot;language-sql&quot;&gt;-- Join Druid real-time events with PostgreSQL user segments and S3 product data
SELECT
  d.event_hour,
  c.user_segment,
  p.product_category,
  SUM(d.page_views) AS total_views,
  COUNT(DISTINCT d.user_id) AS unique_users,
  CASE
    WHEN c.user_segment = &apos;Enterprise&apos; THEN ROUND(SUM(d.page_views) * 2.5, 2)
    WHEN c.user_segment = &apos;Pro&apos; THEN ROUND(SUM(d.page_views) * 1.5, 2)
    ELSE ROUND(SUM(d.page_views) * 0.5, 2)
  END AS estimated_value
FROM (
  SELECT
    DATE_TRUNC(&apos;hour&apos;, __time) AS event_hour,
    user_id,
    page,
    COUNT(*) AS page_views
  FROM &amp;quot;druid-realtime&amp;quot;.druid.pageviews
  WHERE __time &amp;gt; CURRENT_TIMESTAMP - INTERVAL &apos;24&apos; HOUR
  GROUP BY 1, 2, 3
) d
LEFT JOIN &amp;quot;postgres-crm&amp;quot;.public.users c ON d.user_id = c.user_id
LEFT JOIN &amp;quot;s3-catalog&amp;quot;.products.page_mappings p ON d.page = p.page_url
GROUP BY d.event_hour, c.user_segment, p.product_category
ORDER BY estimated_value DESC;
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Druid handles the real-time event aggregation, PostgreSQL provides user context, S3 maps pages to products, and Dremio joins everything.&lt;/p&gt;
&lt;h2&gt;Build a Semantic Layer Over Real-Time Data&lt;/h2&gt;
&lt;pre&gt;&lt;code class=&quot;language-sql&quot;&gt;CREATE VIEW analytics.gold.realtime_engagement AS
SELECT
  DATE_TRUNC(&apos;hour&apos;, __time) AS event_hour,
  page,
  COUNT(*) AS page_views,
  COUNT(DISTINCT user_id) AS unique_visitors,
  ROUND(COUNT(*) * 1.0 / COUNT(DISTINCT user_id), 2) AS views_per_visitor,
  CASE
    WHEN COUNT(*) &amp;gt; 10000 THEN &apos;Trending&apos;
    WHEN COUNT(*) &amp;gt; 1000 THEN &apos;Active&apos;
    WHEN COUNT(*) &amp;gt; 100 THEN &apos;Normal&apos;
    ELSE &apos;Low Traffic&apos;
  END AS traffic_tier
FROM &amp;quot;druid-realtime&amp;quot;.druid.pageviews
WHERE __time &amp;gt; CURRENT_TIMESTAMP - INTERVAL &apos;7&apos; DAY
GROUP BY 1, 2;
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Navigate to the &lt;strong&gt;Catalog&lt;/strong&gt;, click &lt;strong&gt;Edit&lt;/strong&gt; (pencil icon), and &lt;strong&gt;Generate Wiki&lt;/strong&gt; and &lt;strong&gt;Generate Tags&lt;/strong&gt;. Create descriptions like &amp;quot;realtime_engagement: Hourly page view metrics from the real-time clickstream, classified by traffic tier.&amp;quot;&lt;/p&gt;
&lt;h2&gt;AI-Powered Analytics on Real-Time Data&lt;/h2&gt;
&lt;h3&gt;Dremio AI Agent&lt;/h3&gt;
&lt;p&gt;The AI Agent lets users ask questions about real-time event data in plain English. Instead of writing complex time-window SQL, a product manager asks &amp;quot;Which pages are trending in the last 6 hours?&amp;quot; or &amp;quot;What&apos;s the average engagement per visitor for enterprise users today?&amp;quot; The Agent reads your wiki descriptions and generates accurate SQL.&lt;/p&gt;
&lt;p&gt;This is particularly valuable for Druid data because time-series queries can be complex — date truncation, windowing, and aggregation syntax varies. The AI Agent handles this complexity automatically.&lt;/p&gt;
&lt;h3&gt;Dremio MCP Server&lt;/h3&gt;
&lt;p&gt;The &lt;a href=&quot;https://github.com/dremio/dremio-mcp&quot;&gt;Dremio MCP Server&lt;/a&gt; connects Claude, ChatGPT, and other AI clients to your Dremio data:&lt;/p&gt;
&lt;ol&gt;
&lt;li&gt;Create a Native OAuth application in Dremio Cloud&lt;/li&gt;
&lt;li&gt;Configure redirect URLs for your AI client&lt;/li&gt;
&lt;li&gt;Connect using &lt;code&gt;mcp.dremio.cloud/mcp/{project_id}&lt;/code&gt;&lt;/li&gt;
&lt;/ol&gt;
&lt;p&gt;A marketing team lead can ask Claude &amp;quot;Show me our highest-traffic pages from Druid data in the last 24 hours, broken down by user segment&amp;quot; and get real-time insights without writing SQL.&lt;/p&gt;
&lt;h3&gt;AI SQL Functions&lt;/h3&gt;
&lt;p&gt;Use AI to classify and analyze real-time event patterns:&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-sql&quot;&gt;-- Classify page traffic patterns with AI
SELECT
  page,
  page_views,
  unique_visitors,
  AI_CLASSIFY(
    &apos;Based on this web traffic pattern, classify the likely content type&apos;,
    &apos;Page: &apos; || page || &apos;, Views: &apos; || CAST(page_views AS VARCHAR) || &apos;, Unique visitors: &apos; || CAST(unique_visitors AS VARCHAR) || &apos;, Views per visitor: &apos; || CAST(views_per_visitor AS VARCHAR),
    ARRAY[&apos;Product Page&apos;, &apos;Blog Content&apos;, &apos;Landing Page&apos;, &apos;Documentation&apos;, &apos;Support&apos;]
  ) AS inferred_content_type
FROM analytics.gold.realtime_engagement
WHERE traffic_tier = &apos;Trending&apos;;

-- Generate real-time traffic summaries
SELECT
  event_hour,
  AI_GENERATE(
    &apos;Write a brief traffic summary for this hour&apos;,
    &apos;Hour: &apos; || CAST(event_hour AS VARCHAR) || &apos;, Total Views: &apos; || CAST(SUM(page_views) AS VARCHAR) || &apos;, Unique Visitors: &apos; || CAST(SUM(unique_visitors) AS VARCHAR) || &apos;, Trending Pages: &apos; || CAST(COUNT(CASE WHEN traffic_tier = &apos;Trending&apos; THEN 1 END) AS VARCHAR)
  ) AS hourly_summary
FROM analytics.gold.realtime_engagement
GROUP BY event_hour
ORDER BY event_hour DESC
LIMIT 24;
&lt;/code&gt;&lt;/pre&gt;
&lt;h2&gt;Accelerate with Reflections&lt;/h2&gt;
&lt;p&gt;For historical aggregations over Druid data, create Reflections:&lt;/p&gt;
&lt;ol&gt;
&lt;li&gt;Build a view that aggregates Druid data by day/hour/week&lt;/li&gt;
&lt;li&gt;Navigate to the view in the &lt;strong&gt;Catalog&lt;/strong&gt; and click the &lt;strong&gt;Reflections&lt;/strong&gt; tab&lt;/li&gt;
&lt;li&gt;Choose &lt;strong&gt;Raw Reflection&lt;/strong&gt; or &lt;strong&gt;Aggregation Reflection&lt;/strong&gt;&lt;/li&gt;
&lt;li&gt;Select columns and set the &lt;strong&gt;Refresh Interval&lt;/strong&gt; — for real-time data, hourly; for historical trends, daily&lt;/li&gt;
&lt;li&gt;Click &lt;strong&gt;Save&lt;/strong&gt;&lt;/li&gt;
&lt;/ol&gt;
&lt;p&gt;Dashboard queries for &amp;quot;last 30 days&amp;quot; or &amp;quot;year-over-year&amp;quot; hit the Reflection instead of scanning Druid&apos;s cold segments. Real-time queries for &amp;quot;last hour&amp;quot; still go directly to Druid for sub-second latency.&lt;/p&gt;
&lt;h2&gt;Governance on Real-Time Data&lt;/h2&gt;
&lt;p&gt;Druid has basic authentication but no column masking or row-level filtering. Dremio&apos;s Fine-Grained Access Control (FGAC) adds these capabilities:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Column masking:&lt;/strong&gt; Mask user IDs, IP addresses, and location data from specific roles. A product manager sees engagement metrics but not individual user data.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Row-level filtering:&lt;/strong&gt; Restrict real-time data access by team or region. A regional marketing team sees only their region&apos;s clickstream.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Unified policies:&lt;/strong&gt; Same governance applies across Druid, PostgreSQL, S3, and all other sources.&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;These policies apply across SQL Runner, BI tools (Arrow Flight/ODBC), AI Agent, and MCP Server.&lt;/p&gt;
&lt;h2&gt;Connect BI Tools via Arrow Flight&lt;/h2&gt;
&lt;p&gt;Arrow Flight provides 10-100x faster data transfer than JDBC/ODBC:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Tableau:&lt;/strong&gt; Dremio connector for direct Arrow Flight access to real-time dashboards&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Power BI:&lt;/strong&gt; Dremio ODBC driver or native connector&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Python/Pandas:&lt;/strong&gt; &lt;code&gt;pyarrow.flight&lt;/code&gt; for programmatic access to event data&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Looker:&lt;/strong&gt; Connect via JDBC&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;dbt:&lt;/strong&gt; &lt;code&gt;dbt-dremio&lt;/code&gt; for SQL-based transformations on event data&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;All queries benefit from Reflections, governance, and the semantic layer.&lt;/p&gt;
&lt;h2&gt;VS Code Copilot Integration&lt;/h2&gt;
&lt;p&gt;Dremio&apos;s VS Code extension with Copilot integration lets developers query Druid data from their IDE. Ask Copilot &amp;quot;Show me trending pages from Druid in the last 6 hours&amp;quot; and get SQL generated from your semantic layer.&lt;/p&gt;
&lt;h2&gt;When to Keep Data in Druid vs. Migrate&lt;/h2&gt;
&lt;p&gt;&lt;strong&gt;Keep in Druid:&lt;/strong&gt; Real-time event streams that need sub-second query latency, high-ingestion-rate data (thousands of events per second), data that powers real-time operational dashboards with sub-second SLAs.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Migrate to Iceberg:&lt;/strong&gt; Historical event archives older than 30-90 days, data that needs SQL joins (Druid can&apos;t do them natively), analytics that combine events with dimensional data, data consumed by BI tools that expect standard SQL, archival data for compliance and auditing.&lt;/p&gt;
&lt;p&gt;For active Druid data, create manual Reflections with refresh schedules that balance freshness and performance. For migrated Iceberg data in Dremio&apos;s Open Catalog, you get automated compaction, Autonomous Reflections, and significantly lower storage costs.&lt;/p&gt;
&lt;h2&gt;Real-Time Tiering Strategy&lt;/h2&gt;
&lt;p&gt;Combine Druid&apos;s real-time capabilities with Dremio&apos;s historical analysis:&lt;/p&gt;
&lt;h3&gt;Tier 1: Real-Time (Druid — 0 to 24 hours)&lt;/h3&gt;
&lt;p&gt;Druid ingests and serves sub-second queries on live event data. Dremio queries Druid directly for &amp;quot;last hour&amp;quot; or &amp;quot;last 6 hours&amp;quot; dashboards.&lt;/p&gt;
&lt;h3&gt;Tier 2: Recent Historical (Iceberg — 1 to 90 days)&lt;/h3&gt;
&lt;p&gt;Daily batch jobs move yesterday&apos;s data from Druid into Iceberg tables in Dremio&apos;s Open Catalog. Analytical queries for &amp;quot;last 30 days&amp;quot; hit Iceberg tables with Autonomous Reflections.&lt;/p&gt;
&lt;h3&gt;Tier 3: Long-Term Archive (Iceberg — 90+ days)&lt;/h3&gt;
&lt;p&gt;Older data stays in Iceberg cold storage (S3 Infrequent Access). Compliance and audit queries use time travel against archived snapshots.&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-sql&quot;&gt;-- Dremio view that combines real-time and historical data
CREATE VIEW analytics.gold.unified_events AS
SELECT event_type, user_id, event_timestamp, &apos;real-time&apos; AS data_tier
FROM &amp;quot;druid-cluster&amp;quot;.clickstream.events
WHERE event_timestamp &amp;gt;= CURRENT_TIMESTAMP - INTERVAL &apos;24&apos; HOUR
UNION ALL
SELECT event_type, user_id, event_timestamp, &apos;historical&apos; AS data_tier
FROM analytics.silver.events_archive
WHERE event_timestamp &amp;lt; CURRENT_TIMESTAMP - INTERVAL &apos;24&apos; HOUR
  AND event_timestamp &amp;gt;= CURRENT_TIMESTAMP - INTERVAL &apos;90&apos; DAY;
&lt;/code&gt;&lt;/pre&gt;
&lt;h2&gt;Event Pipeline Integration&lt;/h2&gt;
&lt;p&gt;Common Druid deployment patterns that work with Dremio:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Kafka → Druid → Dremio:&lt;/strong&gt; Real-time events flow through Kafka into Druid. Dremio queries Druid for analytics and joins with slow-changing dimensional data from PostgreSQL or S3.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Kafka → Druid + S3:&lt;/strong&gt; Events land in both Druid (real-time) and S3 (archive). Dremio queries both seamlessly through federation.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Kinesis → Druid → Dremio:&lt;/strong&gt; AWS-native pattern where Kinesis streams feed Druid, and Dremio provides multi-source analytics over streamed data.&lt;/li&gt;
&lt;/ul&gt;
&lt;h2&gt;Get Started&lt;/h2&gt;
&lt;p&gt;Apache Druid users can extend their real-time analytics with cross-source joins, AI-powered insights, enterprise governance, and Reflection-based acceleration — all through Dremio Cloud.&lt;/p&gt;
&lt;p&gt;&lt;a href=&quot;https://www.dremio.com/get-started?utm_source=ev_buffer&amp;amp;utm_medium=influencer&amp;amp;utm_campaign=pag&amp;amp;utm_term=connector-apache-druid-dremio-cloud&amp;amp;utm_content=alexmerced&quot;&gt;Try Dremio Cloud free for 30 days&lt;/a&gt; and connect your Druid cluster.&lt;/p&gt;
</content:encoded><author>Alex Merced</author></item><item><title>Connect MongoDB to Dremio Cloud: SQL Analytics on Document Data</title><link>https://iceberglakehouse.com/posts/2026-03-connector-mongodb/</link><guid isPermaLink="true">https://iceberglakehouse.com/posts/2026-03-connector-mongodb/</guid><description>
MongoDB is the most popular NoSQL document database. It stores data in flexible JSON-like documents, making it ideal for applications with evolving s...</description><pubDate>Sun, 01 Mar 2026 22:00:00 GMT</pubDate><content:encoded>&lt;p&gt;MongoDB is the most popular NoSQL document database. It stores data in flexible JSON-like documents, making it ideal for applications with evolving schemas — user profiles, product catalogs, IoT sensor data, and content management systems. But MongoDB&apos;s document model creates analytics challenges: you can&apos;t run SQL joins natively, aggregation pipelines are complex, and connecting MongoDB data to relational sources requires custom application code or ETL.&lt;/p&gt;
&lt;p&gt;Dremio Cloud connects to MongoDB and exposes its collections as SQL-queryable tables. Nested documents appear as structured columns, and you can join MongoDB data with relational databases, data lakes, and cloud warehouses using standard SQL.&lt;/p&gt;
&lt;h2&gt;Why MongoDB Users Need Dremio&lt;/h2&gt;
&lt;p&gt;&lt;strong&gt;SQL on documents.&lt;/strong&gt; MongoDB&apos;s query language (MQL) is powerful but different from SQL. Your analysts know SQL. Dremio transforms MongoDB collections into SQL-queryable tables, so analysts don&apos;t need to learn MQL or write aggregation pipelines.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Join documents with relational data.&lt;/strong&gt; Your user profiles are in MongoDB, your order data is in PostgreSQL, and your marketing data is in S3. Without Dremio, combining these requires application code that queries each system separately and merges results in memory. Dremio federates all three in a single SQL query.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Flatten nested structures.&lt;/strong&gt; MongoDB documents often contain nested objects and arrays. Dremio&apos;s &lt;code&gt;FLATTEN&lt;/code&gt; function expands arrays into rows, and nested objects become addressable columns (e.g., &lt;code&gt;address.city&lt;/code&gt;, &lt;code&gt;preferences.theme&lt;/code&gt;).&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Consistent governance.&lt;/strong&gt; MongoDB has authentication and roles, but they don&apos;t extend to other data sources. Dremio&apos;s FGAC applies consistent column masking and row filtering across MongoDB and all other connected sources.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;AI analytics.&lt;/strong&gt; MongoDB&apos;s unstructured nature makes it difficult for AI tools to query directly. Dremio&apos;s semantic layer creates structured views with business context, enabling the AI Agent to answer questions about MongoDB data.&lt;/p&gt;
&lt;h2&gt;Prerequisites&lt;/h2&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;MongoDB hostname or IP address&lt;/strong&gt; (or MongoDB Atlas connection string)&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Port&lt;/strong&gt; — default &lt;code&gt;27017&lt;/code&gt;&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Database name(s)&lt;/strong&gt; — MongoDB databases you want to query&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Username and password&lt;/strong&gt; (with &lt;code&gt;read&lt;/code&gt; role on target databases)&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Network access&lt;/strong&gt; — port 27017 open to Dremio Cloud. For MongoDB Atlas, add Dremio&apos;s IP range to the Atlas IP Access List&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Dremio Cloud account&lt;/strong&gt; — &lt;a href=&quot;https://www.dremio.com/get-started?utm_source=ev_buffer&amp;amp;utm_medium=influencer&amp;amp;utm_campaign=pag&amp;amp;utm_term=connector-mongodb-dremio-cloud&amp;amp;utm_content=alexmerced&quot;&gt;sign up free for 30 days&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;h2&gt;Connect MongoDB to Dremio Cloud&lt;/h2&gt;
&lt;ol&gt;
&lt;li&gt;Click &lt;strong&gt;&amp;quot;+&amp;quot;&lt;/strong&gt; and select &lt;strong&gt;MongoDB&lt;/strong&gt;.&lt;/li&gt;
&lt;li&gt;Enter &lt;strong&gt;Name&lt;/strong&gt;, &lt;strong&gt;Host&lt;/strong&gt;, &lt;strong&gt;Port&lt;/strong&gt; (27017).&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Authentication Type:&lt;/strong&gt; Choose Standard (username/password) or No Authentication.&lt;/li&gt;
&lt;li&gt;Configure &lt;strong&gt;Advanced Options&lt;/strong&gt;:
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Use SSL:&lt;/strong&gt; Enable for MongoDB Atlas or SSL-configured instances.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Auth Database:&lt;/strong&gt; The database used for authentication (default: &lt;code&gt;admin&lt;/code&gt;).&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Read preference:&lt;/strong&gt; Control whether queries hit primary or secondary replicas (&lt;code&gt;primary&lt;/code&gt;, &lt;code&gt;primaryPreferred&lt;/code&gt;, &lt;code&gt;secondary&lt;/code&gt;, &lt;code&gt;secondaryPreferred&lt;/code&gt;, &lt;code&gt;nearest&lt;/code&gt;).&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Subpartition size:&lt;/strong&gt; Controls how Dremio partitions large collections for parallel reads.&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;li&gt;Configure &lt;strong&gt;Reflection Refresh&lt;/strong&gt; and &lt;strong&gt;Metadata&lt;/strong&gt;.&lt;/li&gt;
&lt;li&gt;Set &lt;strong&gt;Privileges&lt;/strong&gt; and &lt;strong&gt;Save&lt;/strong&gt;.&lt;/li&gt;
&lt;/ol&gt;
&lt;h2&gt;Query MongoDB Data with SQL&lt;/h2&gt;
&lt;pre&gt;&lt;code class=&quot;language-sql&quot;&gt;-- Query a MongoDB collection as a SQL table
SELECT user_id, name, email, signup_date
FROM &amp;quot;mongo-users&amp;quot;.app.users
WHERE signup_date &amp;gt; &apos;2024-01-01&apos;
ORDER BY signup_date DESC;

-- Access nested fields
SELECT
  user_id,
  name,
  address.city AS city,
  address.state AS state,
  preferences.theme AS ui_theme
FROM &amp;quot;mongo-users&amp;quot;.app.users
WHERE address.state = &apos;CA&apos;;
&lt;/code&gt;&lt;/pre&gt;
&lt;h2&gt;Flatten Nested Arrays&lt;/h2&gt;
&lt;p&gt;MongoDB documents frequently contain arrays. Use &lt;code&gt;FLATTEN&lt;/code&gt; to expand them into rows:&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-sql&quot;&gt;-- If each user document has an orders array
SELECT
  u.user_id,
  u.name,
  o.order_id,
  o.total_amount,
  o.order_date
FROM &amp;quot;mongo-users&amp;quot;.app.users u,
  FLATTEN(u.orders) AS t(o)
WHERE o.total_amount &amp;gt; 100
ORDER BY o.order_date DESC;
&lt;/code&gt;&lt;/pre&gt;
&lt;h2&gt;Federate MongoDB with Relational Sources&lt;/h2&gt;
&lt;pre&gt;&lt;code class=&quot;language-sql&quot;&gt;-- Join MongoDB user profiles with PostgreSQL orders and S3 analytics
SELECT
  m.name AS customer_name,
  m.address.city AS city,
  COUNT(pg.order_id) AS total_orders,
  SUM(pg.amount) AS total_spent,
  COUNT(s3.event_id) AS engagement_events
FROM &amp;quot;mongo-users&amp;quot;.app.users m
LEFT JOIN &amp;quot;postgres-orders&amp;quot;.public.orders pg ON m.user_id = pg.customer_id
LEFT JOIN &amp;quot;s3-events&amp;quot;.clickstream.events s3 ON m.user_id = s3.user_id
GROUP BY m.name, m.address.city
ORDER BY total_spent DESC;
&lt;/code&gt;&lt;/pre&gt;
&lt;h2&gt;Build a Semantic Layer&lt;/h2&gt;
&lt;pre&gt;&lt;code class=&quot;language-sql&quot;&gt;CREATE VIEW analytics.gold.customer_profile AS
SELECT
  m.user_id,
  m.name,
  m.email,
  m.address.city AS city,
  m.address.state AS state,
  m.signup_date,
  CASE
    WHEN m.subscription.tier = &apos;premium&apos; THEN &apos;Premium&apos;
    WHEN m.subscription.tier = &apos;pro&apos; THEN &apos;Pro&apos;
    ELSE &apos;Free&apos;
  END AS subscription_tier
FROM &amp;quot;mongo-users&amp;quot;.app.users m;
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Navigate to the &lt;strong&gt;Catalog&lt;/strong&gt;, click &lt;strong&gt;Edit&lt;/strong&gt; (pencil icon), go to the &lt;strong&gt;Details&lt;/strong&gt; tab, and click &lt;strong&gt;Generate Wiki&lt;/strong&gt; and &lt;strong&gt;Generate Tags&lt;/strong&gt;. Dremio&apos;s generative AI samples the view schema and data to produce descriptions like: &amp;quot;customer_profile: Contains one row per user combining profile data from MongoDB with subscription tier classification.&amp;quot; Review and refine these descriptions — add business context like &amp;quot;Premium subscribers qualify for the dedicated support tier and priority feature access.&amp;quot;&lt;/p&gt;
&lt;p&gt;These wikis and labels are the context that powers Dremio&apos;s AI capabilities.&lt;/p&gt;
&lt;h2&gt;AI-Powered Analytics on MongoDB Data&lt;/h2&gt;
&lt;p&gt;MongoDB&apos;s flexible document model makes it notoriously difficult for AI tools to query directly — nested objects, variable schemas, and BSON types create barriers. Dremio&apos;s semantic layer solves this by creating structured, well-documented views over MongoDB data that AI tools can understand and query accurately.&lt;/p&gt;
&lt;h3&gt;Dremio AI Agent&lt;/h3&gt;
&lt;p&gt;The AI Agent lets business users ask questions about MongoDB data in plain English. Instead of learning MongoDB&apos;s aggregation framework or SQL with nested field syntax, a product manager asks &amp;quot;How many Premium subscribers are in California?&amp;quot; and the Agent generates the correct SQL using your semantic layer.&lt;/p&gt;
&lt;p&gt;The Agent reads the wiki descriptions you attached to views to understand what &amp;quot;Premium&amp;quot; means in your data (subscription.tier = &apos;premium&apos;), what &amp;quot;California&amp;quot; maps to (address.state = &apos;CA&apos;), and which view to query. Better wikis produce more accurate AI responses.&lt;/p&gt;
&lt;h3&gt;Dremio MCP Server&lt;/h3&gt;
&lt;p&gt;The &lt;a href=&quot;https://github.com/dremio/dremio-mcp&quot;&gt;Dremio MCP Server&lt;/a&gt; extends AI capabilities to external chat clients. Connect Claude or ChatGPT to your MongoDB data through the hosted MCP Server with OAuth authentication:&lt;/p&gt;
&lt;ol&gt;
&lt;li&gt;Create a Native OAuth application in Dremio Cloud&lt;/li&gt;
&lt;li&gt;Configure redirect URLs (e.g., &lt;code&gt;https://claude.ai/api/mcp/auth_callback&lt;/code&gt; for Claude, &lt;code&gt;https://chatgpt.com/connector_platform_oauth_redirect&lt;/code&gt; for ChatGPT)&lt;/li&gt;
&lt;li&gt;Connect using &lt;code&gt;mcp.dremio.cloud/mcp/{project_id}&lt;/code&gt; (US) or &lt;code&gt;mcp.eu.dremio.cloud/mcp/{project_id}&lt;/code&gt; (EU)&lt;/li&gt;
&lt;/ol&gt;
&lt;p&gt;Now your team can ask Claude &amp;quot;Show me user growth trends by subscription tier from MongoDB data&amp;quot; and get governed, accurate results — without knowing MongoDB query syntax or SQL.&lt;/p&gt;
&lt;h3&gt;AI SQL Functions&lt;/h3&gt;
&lt;p&gt;Use Dremio&apos;s built-in AI SQL functions to enrich MongoDB data directly in queries:&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-sql&quot;&gt;-- Classify users based on their MongoDB profile data
SELECT
  name,
  subscription_tier,
  city,
  state,
  AI_CLASSIFY(
    &apos;Based on this user profile, classify their likely engagement level&apos;,
    &apos;Name: &apos; || name || &apos;, Subscription: &apos; || subscription_tier || &apos;, City: &apos; || city || &apos;, State: &apos; || state,
    ARRAY[&apos;Highly Engaged&apos;, &apos;Active&apos;, &apos;At Risk&apos;, &apos;Churned&apos;]
  ) AS engagement_prediction
FROM analytics.gold.customer_profile
WHERE subscription_tier IN (&apos;Premium&apos;, &apos;Pro&apos;);

-- Generate personalized outreach messages
SELECT
  name,
  subscription_tier,
  AI_GENERATE(
    &apos;Write a one-sentence personalized upgrade message for this user&apos;,
    &apos;User: &apos; || name || &apos;, Current Tier: &apos; || subscription_tier || &apos;, Location: &apos; || city || &apos;, &apos; || state
  ) AS upgrade_message
FROM analytics.gold.customer_profile
WHERE subscription_tier = &apos;Free&apos;;
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;&lt;code&gt;AI_CLASSIFY&lt;/code&gt; categorizes users based on profile attributes. &lt;code&gt;AI_GENERATE&lt;/code&gt; creates personalized text. Both run inline in your SQL queries, enriching MongoDB data with AI in real time.&lt;/p&gt;
&lt;h2&gt;Accelerate MongoDB Analytics with Reflections&lt;/h2&gt;
&lt;p&gt;MongoDB isn&apos;t designed for heavy analytical workloads. Running 50 dashboard queries per hour against MongoDB competes with your application&apos;s read/write operations. Create Reflections on your MongoDB views to cache results:&lt;/p&gt;
&lt;ol&gt;
&lt;li&gt;Navigate to the view in the Catalog&lt;/li&gt;
&lt;li&gt;Create a Reflection with the columns and aggregations used most&lt;/li&gt;
&lt;li&gt;Set the refresh interval (e.g., every 30 minutes for near-real-time, hourly for daily reporting)&lt;/li&gt;
&lt;/ol&gt;
&lt;p&gt;BI tools connected to Dremio via Arrow Flight or ODBC get sub-second response times from Reflections — MongoDB handles zero analytical load.&lt;/p&gt;
&lt;h2&gt;MongoDB-Specific Considerations&lt;/h2&gt;
&lt;p&gt;&lt;strong&gt;Schema sampling.&lt;/strong&gt; MongoDB is schema-less — each document can have different fields. Dremio samples documents to infer the schema. If your documents have highly variable schemas, some fields might not appear until more documents are sampled. You can increase the sample size in the source configuration.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Read preference.&lt;/strong&gt; For MongoDB replica sets, use &lt;code&gt;secondaryPreferred&lt;/code&gt; to route analytical queries to secondary replicas, avoiding impact on your primary node&apos;s CRUD operations.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Data types.&lt;/strong&gt; MongoDB&apos;s BSON types map to Dremio types: &lt;code&gt;ObjectID&lt;/code&gt; → &lt;code&gt;VARCHAR&lt;/code&gt;, &lt;code&gt;NumberLong&lt;/code&gt; → &lt;code&gt;BIGINT&lt;/code&gt;, &lt;code&gt;NumberInt&lt;/code&gt; → &lt;code&gt;INT&lt;/code&gt;, &lt;code&gt;Date&lt;/code&gt; → &lt;code&gt;TIMESTAMP&lt;/code&gt;. Nested objects become structured columns addressable with dot notation. Arrays can be flattened with &lt;code&gt;FLATTEN&lt;/code&gt;.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;MongoDB Atlas.&lt;/strong&gt; Add Dremio Cloud&apos;s IP range to your Atlas IP Access List. Enable SSL in the Dremio connection settings. Use the standard connection string hostname (not the SRV hostname).&lt;/p&gt;
&lt;h2&gt;When to Keep Data in MongoDB vs. Migrate to Iceberg&lt;/h2&gt;
&lt;p&gt;&lt;strong&gt;Keep in MongoDB:&lt;/strong&gt; Data your application actively reads and writes, documents with evolving schemas that benefit from MongoDB&apos;s flexibility, operational data where real-time updates matter, data where document-level transactions are important.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Migrate to Iceberg:&lt;/strong&gt; Historical user data, analytics-heavy aggregations, datasets that need SQL joins with relational sources, time-series data you query in aggregate, data consumed primarily by BI tools or AI agents.&lt;/p&gt;
&lt;p&gt;For data that stays in MongoDB, create manual Reflections with refresh schedules matching your data freshness needs. This offloads analytical load from MongoDB while keeping data current. For migrated Iceberg data, Dremio provides automated compaction, time travel, results caching, and Autonomous Reflections.&lt;/p&gt;
&lt;h2&gt;Governance on MongoDB Data&lt;/h2&gt;
&lt;p&gt;MongoDB has database-level and collection-level access control, but no column masking or row-level filtering. Dremio&apos;s Fine-Grained Access Control (FGAC) adds these capabilities:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Column masking:&lt;/strong&gt; Mask user emails, phone numbers, or payment details from specific roles. A product analyst sees user behavior patterns but not PII.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Row-level filtering:&lt;/strong&gt; Restrict data by user role. A regional team sees only their region&apos;s user data.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Unified policies:&lt;/strong&gt; Same governance applies across MongoDB, PostgreSQL, S3, Snowflake, and all other sources.&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;These policies apply across SQL Runner, BI tools (Arrow Flight/ODBC), AI Agent, and MCP Server.&lt;/p&gt;
&lt;h2&gt;Connect BI Tools via Arrow Flight&lt;/h2&gt;
&lt;p&gt;Arrow Flight provides 10-100x faster data transfer than JDBC/ODBC:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Tableau:&lt;/strong&gt; Dremio connector — turns MongoDB documents into tabular data for Tableau&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Power BI:&lt;/strong&gt; Dremio ODBC driver or native connector&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Python/Pandas:&lt;/strong&gt; &lt;code&gt;pyarrow.flight&lt;/code&gt; for programmatic access to flattened MongoDB data&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Looker:&lt;/strong&gt; Connect via JDBC&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;dbt:&lt;/strong&gt; &lt;code&gt;dbt-dremio&lt;/code&gt; for SQL-based transformations on MongoDB data&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;All queries benefit from Reflections, governance, and the semantic layer.&lt;/p&gt;
&lt;h2&gt;VS Code Copilot Integration&lt;/h2&gt;
&lt;p&gt;Dremio&apos;s VS Code extension with Copilot integration lets developers query MongoDB data from their IDE. Ask Copilot &amp;quot;Show me user signup trends from MongoDB&amp;quot; and get SQL generated using your semantic layer — no aggregation pipeline knowledge needed.&lt;/p&gt;
&lt;h2&gt;Schema Flattening and Nested Documents&lt;/h2&gt;
&lt;p&gt;MongoDB stores data as nested JSON documents. Dremio automatically converts nested structures into queryable columns:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Top-level fields&lt;/strong&gt; map directly to columns (&lt;code&gt;name&lt;/code&gt;, &lt;code&gt;email&lt;/code&gt;, &lt;code&gt;created_at&lt;/code&gt;)&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Nested objects&lt;/strong&gt; use dot notation (&lt;code&gt;address.city&lt;/code&gt;, &lt;code&gt;address.state&lt;/code&gt;)&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Arrays&lt;/strong&gt; can be flattened using &lt;code&gt;FLATTEN()&lt;/code&gt; to create one row per array element&lt;/li&gt;
&lt;/ul&gt;
&lt;pre&gt;&lt;code class=&quot;language-sql&quot;&gt;-- Flatten nested order items from MongoDB documents
SELECT
  o.customer_id,
  o.order_date,
  f.item_name,
  f.quantity,
  f.unit_price
FROM &amp;quot;mongodb-app&amp;quot;.ecommerce.orders o,
LATERAL FLATTEN(o.items) AS f(item_name, quantity, unit_price);
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;This SQL approach is simpler than MongoDB&apos;s aggregation pipeline (&lt;code&gt;$unwind&lt;/code&gt;, &lt;code&gt;$lookup&lt;/code&gt;, &lt;code&gt;$group&lt;/code&gt;) for most analytical queries.&lt;/p&gt;
&lt;h2&gt;Dremio vs. MongoDB Atlas Data Federation&lt;/h2&gt;
&lt;p&gt;MongoDB Atlas Data Federation provides SQL-like access to MongoDB data. Key differences:&lt;/p&gt;
&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Feature&lt;/th&gt;
&lt;th&gt;Dremio Cloud&lt;/th&gt;
&lt;th&gt;Atlas Data Federation&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Cross-source joins&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;PostgreSQL, S3, Snowflake, etc.&lt;/td&gt;
&lt;td&gt;MongoDB + S3 only&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Reflections&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;✅ Cache results&lt;/td&gt;
&lt;td&gt;❌ Query every time&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;AI Agent&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;✅ Natural language queries&lt;/td&gt;
&lt;td&gt;❌&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Governance&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Column masking + row filtering&lt;/td&gt;
&lt;td&gt;MongoDB role-based access&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;BI connectivity&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Arrow Flight (10-100x faster)&lt;/td&gt;
&lt;td&gt;ODBC/JDBC only&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Semantic layer&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Views with wiki + tags&lt;/td&gt;
&lt;td&gt;❌&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;
&lt;p&gt;Dremio provides a broader analytical platform, while Atlas Data Federation is specific to the MongoDB ecosystem.&lt;/p&gt;
&lt;h2&gt;Document-to-Analytics Pipeline&lt;/h2&gt;
&lt;p&gt;Optimize how MongoDB data flows into analytics:&lt;/p&gt;
&lt;ol&gt;
&lt;li&gt;&lt;strong&gt;Source layer:&lt;/strong&gt; Dremio reads MongoDB collections directly — no ETL&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Flattened views:&lt;/strong&gt; Create SQL views that flatten nested documents into tabular format&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Enrichment:&lt;/strong&gt; Join flattened MongoDB data with relational and data lake sources&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Semantic layer:&lt;/strong&gt; Create business-ready views with wiki descriptions for AI&lt;/li&gt;
&lt;/ol&gt;
&lt;p&gt;This pipeline runs entirely in SQL, eliminating the need for custom Python/Node.js ETL scripts to extract and transform MongoDB data.&lt;/p&gt;
&lt;h2&gt;Get Started&lt;/h2&gt;
&lt;p&gt;MongoDB users can query their document data with SQL, flatten nested structures, join MongoDB with relational databases and data lakes, and enable AI analytics — all without ETL pipelines or learning MongoDB&apos;s aggregation framework.&lt;/p&gt;
&lt;p&gt;&lt;a href=&quot;https://www.dremio.com/get-started?utm_source=ev_buffer&amp;amp;utm_medium=influencer&amp;amp;utm_campaign=pag&amp;amp;utm_term=connector-mongodb-dremio-cloud&amp;amp;utm_content=alexmerced&quot;&gt;Try Dremio Cloud free for 30 days&lt;/a&gt; and connect your MongoDB instances.&lt;/p&gt;
</content:encoded><author>Alex Merced</author></item><item><title>Connect Vertica to Dremio Cloud: Federation for Analytics-Optimized Data</title><link>https://iceberglakehouse.com/posts/2026-03-connector-vertica/</link><guid isPermaLink="true">https://iceberglakehouse.com/posts/2026-03-connector-vertica/</guid><description>
Vertica is a columnar analytics database engineered for fast aggregate queries on large datasets. It was built from the ground up for analytical work...</description><pubDate>Sun, 01 Mar 2026 21:00:00 GMT</pubDate><content:encoded>&lt;p&gt;Vertica is a columnar analytics database engineered for fast aggregate queries on large datasets. It was built from the ground up for analytical workloads — column-oriented storage, massively parallel processing, and automatic database design optimization. Organizations running Vertica typically have years of investment in analytics infrastructure: curated schemas, optimized projections, and sophisticated workloads that depend on Vertica&apos;s high-performance query engine.&lt;/p&gt;
&lt;p&gt;But Vertica has limitations that become more painful as data ecosystems grow. Licensing costs scale with data volume. Federation with non-Vertica sources requires complex ETL. And connecting Vertica data to modern cloud tools, AI platforms, and cross-cloud architectures requires exporting data or building custom connectors.&lt;/p&gt;
&lt;p&gt;Dremio Cloud connects to Vertica and queries it alongside your other data sources. Dremio&apos;s predicate pushdowns leverage Vertica&apos;s columnar engine for filtering and aggregation, while Reflections cache results to reduce ongoing Vertica compute load. You keep Vertica for what it does well and extend its reach to every other system in your organization.&lt;/p&gt;
&lt;h2&gt;Why Vertica Users Need Dremio&lt;/h2&gt;
&lt;h3&gt;Reduce Vertica License Costs&lt;/h3&gt;
&lt;p&gt;Vertica&apos;s licensing model ties cost to data volume and node count. Every analytical query consumes cluster resources. As your data grows and more teams want access, the cost of scaling Vertica becomes significant. Dremio&apos;s Reflections provide an alternative: pre-compute the results of your most common queries and serve them from Dremio&apos;s cache instead of hitting Vertica on every request. Dashboard queries, scheduled reports, and ad-hoc exploration can all be served from Reflections, reducing the compute pressure on your Vertica cluster.&lt;/p&gt;
&lt;h3&gt;Federate with Cloud Sources&lt;/h3&gt;
&lt;p&gt;Vertica excels at analytical queries on its own data, but your organization&apos;s data lives in many places: S3 data lakes, PostgreSQL application databases, Snowflake cloud warehouses, MongoDB document stores. Without a federation layer, combining these with Vertica data requires ETL pipelines that extract from each source, transform, and load into Vertica. Dremio queries each source in place and joins the results — no data movement needed.&lt;/p&gt;
&lt;h3&gt;Modernize Without a Big-Bang Migration&lt;/h3&gt;
&lt;p&gt;Migrating away from Vertica is a large, risky project. Dremio lets you gradually shift analytical workloads. Start by querying Vertica through Dremio alongside new cloud-native sources (Apache Iceberg tables, S3 data lakes). As confidence grows, migrate specific datasets from Vertica to Iceberg tables in Dremio&apos;s Open Catalog, where they benefit from automated maintenance and lower storage costs. The migration happens incrementally, and Vertica continues serving critical workloads throughout.&lt;/p&gt;
&lt;h3&gt;Unified Governance&lt;/h3&gt;
&lt;p&gt;Vertica has its own access control, but it doesn&apos;t extend to your other data sources. Dremio&apos;s Fine-Grained Access Control applies consistent column masking and row-level filtering across Vertica, PostgreSQL, S3, and every other connected source from a single governance layer.&lt;/p&gt;
&lt;h2&gt;Prerequisites&lt;/h2&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Vertica hostname or IP address&lt;/strong&gt; — the coordinator node of your Vertica cluster&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Port&lt;/strong&gt; — Vertica defaults to &lt;code&gt;5433&lt;/code&gt;&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Database name&lt;/strong&gt;&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Username and password&lt;/strong&gt; — a Vertica user with &lt;code&gt;SELECT&lt;/code&gt; privileges on the tables you want to query&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Network access&lt;/strong&gt; — port 5433 must be reachable from Dremio Cloud&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Dremio Cloud account&lt;/strong&gt; — &lt;a href=&quot;https://www.dremio.com/get-started?utm_source=ev_buffer&amp;amp;utm_medium=influencer&amp;amp;utm_campaign=pag&amp;amp;utm_term=connector-vertica-dremio-cloud&amp;amp;utm_content=alexmerced&quot;&gt;sign up free for 30 days&lt;/a&gt; with $400 in compute credits&lt;/li&gt;
&lt;/ul&gt;
&lt;h2&gt;Step-by-Step: Connect Vertica to Dremio Cloud&lt;/h2&gt;
&lt;h3&gt;1. Add the Vertica Source&lt;/h3&gt;
&lt;p&gt;In the Dremio console, click the &lt;strong&gt;&amp;quot;+&amp;quot;&lt;/strong&gt; button in the left sidebar and select &lt;strong&gt;Vertica&lt;/strong&gt; from the database source types.&lt;/p&gt;
&lt;h3&gt;2. Configure General Settings&lt;/h3&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Name:&lt;/strong&gt; A descriptive identifier (e.g., &lt;code&gt;analytics-vertica&lt;/code&gt; or &lt;code&gt;web-analytics&lt;/code&gt;). This name appears in SQL queries as the source prefix.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Host:&lt;/strong&gt; Your Vertica coordinator host.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Port:&lt;/strong&gt; Default &lt;code&gt;5433&lt;/code&gt;.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Database:&lt;/strong&gt; The Vertica database name.&lt;/li&gt;
&lt;/ul&gt;
&lt;h3&gt;3. Set Authentication&lt;/h3&gt;
&lt;p&gt;Provide the username and password for a Vertica user with read access. You can also use Secret Resource URL for password management through AWS Secrets Manager.&lt;/p&gt;
&lt;h3&gt;4. Configure Advanced Options&lt;/h3&gt;
&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Setting&lt;/th&gt;
&lt;th&gt;Purpose&lt;/th&gt;
&lt;th&gt;Default&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Record fetch size&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Rows per batch from Vertica&lt;/td&gt;
&lt;td&gt;200&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Maximum Idle Connections&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Idle connection pool size&lt;/td&gt;
&lt;td&gt;8&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Connection Idle Time (s)&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Seconds before idle connections close&lt;/td&gt;
&lt;td&gt;60&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Connection Properties&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Custom JDBC parameters&lt;/td&gt;
&lt;td&gt;None&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;
&lt;h3&gt;5. Set Reflection and Metadata Refresh&lt;/h3&gt;
&lt;p&gt;Configure how often Reflections refresh (re-query Vertica) and how often Dremio checks for new tables or schema changes. Click &lt;strong&gt;Save&lt;/strong&gt;.&lt;/p&gt;
&lt;h2&gt;Query Vertica Data from Dremio&lt;/h2&gt;
&lt;pre&gt;&lt;code class=&quot;language-sql&quot;&gt;SELECT device_type, COUNT(*) AS sessions, AVG(session_duration_seconds) AS avg_duration, 
  SUM(page_views) AS total_page_views
FROM &amp;quot;analytics-vertica&amp;quot;.web.sessions
WHERE session_date &amp;gt;= &apos;2024-01-01&apos; AND session_date &amp;lt; &apos;2024-07-01&apos;
GROUP BY device_type
ORDER BY sessions DESC;
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Dremio pushes the date filter and aggregation to Vertica&apos;s columnar engine, which processes them efficiently against its compressed, column-oriented storage.&lt;/p&gt;
&lt;h2&gt;Federate Vertica with Other Sources&lt;/h2&gt;
&lt;pre&gt;&lt;code class=&quot;language-sql&quot;&gt;-- Join Vertica web analytics with PostgreSQL CRM and S3 marketing data
SELECT
  c.customer_segment,
  COUNT(v.session_id) AS total_sessions,
  AVG(v.session_duration_seconds) AS avg_session_duration,
  COUNT(DISTINCT v.user_id) AS unique_visitors,
  SUM(s3.ad_spend) AS marketing_spend,
  ROUND(COUNT(v.session_id) / NULLIF(SUM(s3.ad_spend), 0) * 1000, 2) AS sessions_per_thousand_dollars
FROM &amp;quot;analytics-vertica&amp;quot;.web.sessions v
JOIN &amp;quot;postgres-crm&amp;quot;.public.customers c ON v.user_id = c.customer_id
LEFT JOIN &amp;quot;s3-marketing&amp;quot;.campaigns.spend_by_segment s3 ON c.customer_segment = s3.segment
WHERE v.session_date &amp;gt;= &apos;2024-01-01&apos;
GROUP BY c.customer_segment
ORDER BY total_sessions DESC;
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Vertica handles the session aggregation, PostgreSQL handles the customer lookup, and Dremio handles the cross-source join.&lt;/p&gt;
&lt;h2&gt;Build a Semantic Layer&lt;/h2&gt;
&lt;pre&gt;&lt;code class=&quot;language-sql&quot;&gt;CREATE VIEW analytics.gold.web_performance AS
SELECT
  v.device_type,
  v.session_date,
  COUNT(*) AS sessions,
  AVG(v.session_duration_seconds) AS avg_duration_seconds,
  SUM(v.page_views) AS total_page_views,
  SUM(CASE WHEN v.converted = true THEN 1 ELSE 0 END) AS conversions,
  ROUND(SUM(CASE WHEN v.converted = true THEN 1 ELSE 0 END) * 100.0 / COUNT(*), 2) AS conversion_rate_pct,
  CASE
    WHEN AVG(v.session_duration_seconds) &amp;gt; 300 THEN &apos;High Engagement&apos;
    WHEN AVG(v.session_duration_seconds) &amp;gt; 120 THEN &apos;Moderate Engagement&apos;
    ELSE &apos;Low Engagement&apos;
  END AS engagement_tier
FROM &amp;quot;analytics-vertica&amp;quot;.web.sessions v
GROUP BY v.device_type, v.session_date;
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Navigate to the &lt;strong&gt;Catalog&lt;/strong&gt;, click &lt;strong&gt;Edit&lt;/strong&gt; (pencil icon) on the view, and &lt;strong&gt;Generate Wiki&lt;/strong&gt; and &lt;strong&gt;Generate Tags&lt;/strong&gt;. This creates the business context that powers AI features.&lt;/p&gt;
&lt;h2&gt;AI-Powered Analytics on Vertica Data&lt;/h2&gt;
&lt;h3&gt;Dremio AI Agent&lt;/h3&gt;
&lt;p&gt;The built-in AI Agent lets users ask questions about your Vertica data in plain English. Instead of writing complex analytical SQL, a marketing manager can ask &amp;quot;What&apos;s our conversion rate on mobile this quarter?&amp;quot; The Agent reads the wiki descriptions attached to your views, understands what &amp;quot;conversion rate&amp;quot; and &amp;quot;mobile&amp;quot; mean in your data, and generates the correct SQL.&lt;/p&gt;
&lt;p&gt;The quality of the AI Agent&apos;s responses depends directly on the quality of your semantic layer. Wikis that explain &amp;quot;conversion_rate_pct is the percentage of web sessions that resulted in a purchase&amp;quot; produce better results than technical column names alone.&lt;/p&gt;
&lt;h3&gt;Dremio MCP Server&lt;/h3&gt;
&lt;p&gt;The &lt;a href=&quot;https://github.com/dremio/dremio-mcp&quot;&gt;Dremio MCP Server&lt;/a&gt; extends AI capabilities to external chat clients. Connect Claude or ChatGPT to your Dremio data through the hosted MCP Server with OAuth authentication:&lt;/p&gt;
&lt;ol&gt;
&lt;li&gt;Create a Native OAuth application in Dremio Cloud&lt;/li&gt;
&lt;li&gt;Configure redirect URLs for your AI client&lt;/li&gt;
&lt;li&gt;Connect using &lt;code&gt;mcp.dremio.cloud/mcp/{project_id}&lt;/code&gt;&lt;/li&gt;
&lt;/ol&gt;
&lt;p&gt;Now your team can ask Claude &amp;quot;Analyze our web engagement trends from Vertica data this quarter&amp;quot; and get accurate, governed results — without writing SQL or accessing Vertica directly.&lt;/p&gt;
&lt;h3&gt;AI SQL Functions&lt;/h3&gt;
&lt;p&gt;Use AI SQL functions directly in queries to enrich Vertica data:&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-sql&quot;&gt;-- Classify web sessions by potential value
SELECT
  session_id,
  device_type,
  page_views,
  session_duration_seconds,
  AI_CLASSIFY(
    &apos;Based on this browsing behavior, classify the user intent&apos;,
    &apos;Device: &apos; || device_type || &apos;, Pages: &apos; || CAST(page_views AS VARCHAR) || &apos;, Duration: &apos; || CAST(session_duration_seconds AS VARCHAR) || &apos;s&apos;,
    ARRAY[&apos;Purchase Intent&apos;, &apos;Research&apos;, &apos;Browsing&apos;, &apos;Bounced&apos;]
  ) AS predicted_intent
FROM &amp;quot;analytics-vertica&amp;quot;.web.sessions
WHERE session_date = CURRENT_DATE;
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;&lt;code&gt;AI_CLASSIFY&lt;/code&gt; runs LLM inference inside your SQL query, classifying each web session from Vertica data into intent categories. &lt;code&gt;AI_GENERATE&lt;/code&gt; can produce narrative summaries, and &lt;code&gt;AI_SIMILARITY&lt;/code&gt; can find semantic matches between text fields.&lt;/p&gt;
&lt;h2&gt;Accelerate Vertica Queries with Reflections&lt;/h2&gt;
&lt;p&gt;Create Reflections on your most frequently queried views:&lt;/p&gt;
&lt;ol&gt;
&lt;li&gt;Navigate to the view in the &lt;strong&gt;Catalog&lt;/strong&gt;&lt;/li&gt;
&lt;li&gt;Click the &lt;strong&gt;Reflections&lt;/strong&gt; tab&lt;/li&gt;
&lt;li&gt;Choose &lt;strong&gt;Raw Reflection&lt;/strong&gt; (full cache) or &lt;strong&gt;Aggregation Reflection&lt;/strong&gt; (pre-computed metrics)&lt;/li&gt;
&lt;li&gt;Select columns and aggregations&lt;/li&gt;
&lt;li&gt;Set the &lt;strong&gt;Refresh Interval&lt;/strong&gt; — for Vertica data that updates daily, daily refresh works; for real-time dashboards, match the refresh to your SLA&lt;/li&gt;
&lt;li&gt;Click &lt;strong&gt;Save&lt;/strong&gt;&lt;/li&gt;
&lt;/ol&gt;
&lt;p&gt;BI tools connected via Arrow Flight or ODBC get sub-second response times from Reflections, even though the underlying data lives in Vertica. A conversion analytics dashboard that queries Vertica 96 times per day with a daily Reflection refresh consumes Vertica resources only once — a 99% reduction in cluster load.&lt;/p&gt;
&lt;h2&gt;Governance Across Vertica and Other Sources&lt;/h2&gt;
&lt;p&gt;Dremio&apos;s Fine-Grained Access Control (FGAC) provides governance that extends beyond Vertica to every connected source:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Column masking:&lt;/strong&gt; Mask conversion rates, revenue data, or user identifiers from specific roles. A product manager sees engagement metrics but not raw revenue.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Row-level filtering:&lt;/strong&gt; Restrict data visibility based on user roles. Regional teams see only their region&apos;s data automatically.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Unified policies:&lt;/strong&gt; Same governance applies across Vertica, PostgreSQL, S3, BigQuery, and all other sources — no per-database policy management.&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;These policies apply across SQL Runner, BI tools (Arrow Flight/ODBC), AI Agent, MCP Server, and Arrow Flight clients.&lt;/p&gt;
&lt;h2&gt;Connect BI Tools via Arrow Flight&lt;/h2&gt;
&lt;p&gt;Arrow Flight provides 10-100x faster data transfer than JDBC/ODBC:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Tableau:&lt;/strong&gt; Dremio connector for direct Arrow Flight access&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Power BI:&lt;/strong&gt; Dremio ODBC driver or native connector&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Python/Pandas:&lt;/strong&gt; &lt;code&gt;pyarrow.flight&lt;/code&gt; for programmatic data access to Vertica analytics&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Looker:&lt;/strong&gt; Connect via JDBC&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;dbt:&lt;/strong&gt; &lt;code&gt;dbt-dremio&lt;/code&gt; for SQL-based transformations&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;All queries benefit from Reflections, governance, and the semantic layer.&lt;/p&gt;
&lt;h2&gt;VS Code Copilot Integration&lt;/h2&gt;
&lt;p&gt;Dremio&apos;s VS Code extension with Copilot integration lets developers query Vertica data from their IDE. Ask Copilot &amp;quot;Show me conversion rates by device type from web analytics&amp;quot; and get SQL generated from your semantic layer — without switching to the Dremio console.&lt;/p&gt;
&lt;h2&gt;When to Keep Data in Vertica vs. Migrate to Iceberg&lt;/h2&gt;
&lt;p&gt;&lt;strong&gt;Keep in Vertica:&lt;/strong&gt; Active analytical workloads optimized with Vertica projections, data with complex Vertica-specific features (database designer optimizations, flex tables), workloads that depend on Vertica&apos;s sub-second response times for real-time dashboards.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Migrate to Iceberg:&lt;/strong&gt; Historical data and archives, datasets consumed by non-Vertica tools, data where Vertica licensing cost per TB exceeds the analytical value, datasets that benefit from time travel and automated compaction.&lt;/p&gt;
&lt;p&gt;For data that stays in Vertica, create manual Reflections to reduce query load. For migrated data, Dremio&apos;s Open Catalog provides automated compaction, time travel, and Autonomous Reflections at a fraction of the per-TB cost.&lt;/p&gt;
&lt;h2&gt;Vertica Deployment Modes and Dremio&lt;/h2&gt;
&lt;p&gt;Vertica has two deployment modes, both compatible with Dremio:&lt;/p&gt;
&lt;h3&gt;Enterprise Mode (On-Premises)&lt;/h3&gt;
&lt;p&gt;Traditional deployment with local storage. Dremio connects via JDBC and pushes SQL operations to Vertica&apos;s engine when possible. Reflections are particularly valuable here — they offload analytical queries and reduce the on-premises compute needed.&lt;/p&gt;
&lt;h3&gt;EON Mode (Cloud-Optimized)&lt;/h3&gt;
&lt;p&gt;Vertica&apos;s compute-storage separation architecture on AWS, Azure, or GCP. Dremio connects the same way, but EON mode&apos;s elastic compute makes Reflections&apos; cost-saving impact even more significant — when Dremio serves cached results, EON subclusters can scale down.&lt;/p&gt;
&lt;h2&gt;Vertica-Specific SQL Considerations&lt;/h2&gt;
&lt;p&gt;Dremio handles most Vertica SQL natively. For Vertica-specific syntax:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Projections:&lt;/strong&gt; Vertica projections are transparent to Dremio — Vertica automatically uses optimal projections for queries pushed down&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Flex tables:&lt;/strong&gt; Dremio reads flex table columns as VARCHAR — cast to appropriate types in your Dremio views&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;COPY LOCAL:&lt;/strong&gt; Not available through Dremio — use Dremio&apos;s own CREATE TABLE AS SELECT for data loading&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Vertica ML functions:&lt;/strong&gt; Use external queries for Vertica-specific ML functions: &lt;code&gt;SELECT * FROM TABLE(&amp;quot;vertica-analytics&amp;quot;.EXTERNAL_QUERY(&apos;SELECT PREDICT_LINEAR...&apos;))&lt;/code&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;h2&gt;Migration ROI Example&lt;/h2&gt;
&lt;p&gt;A mid-sized organization with 50TB in Vertica Enterprise:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Current cost:&lt;/strong&gt; ~$500K/year in Vertica licensing (per-TB pricing)&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Migrate 30TB of historical data to Iceberg:&lt;/strong&gt; Eliminates 60% of licensed data volume&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Remaining 20TB in Vertica:&lt;/strong&gt; Active analytical workloads, protected by Reflections&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Net result:&lt;/strong&gt; Potential 40-60% reduction in Vertica licensing costs, with improved analytics capabilities (AI, federation, governance) on all data&lt;/li&gt;
&lt;/ul&gt;
&lt;h2&gt;Get Started&lt;/h2&gt;
&lt;p&gt;Vertica users can reduce licensing pressure, federate with cloud sources, modernize incrementally, and add AI analytics — all through Dremio Cloud. Connect your Vertica cluster to Dremio, create Reflections on your most-queried tables, and start tracking the reduction in Vertica query load as Dremio serves cached results.&lt;/p&gt;
&lt;p&gt;&lt;a href=&quot;https://www.dremio.com/get-started?utm_source=ev_buffer&amp;amp;utm_medium=influencer&amp;amp;utm_campaign=pag&amp;amp;utm_term=connector-vertica-dremio-cloud&amp;amp;utm_content=alexmerced&quot;&gt;Try Dremio Cloud free for 30 days&lt;/a&gt; and connect your Vertica cluster alongside your other data sources.&lt;/p&gt;
</content:encoded><author>Alex Merced</author></item><item><title>Connect Azure Synapse Analytics to Dremio Cloud: Multi-Cloud Data Warehouse Federation</title><link>https://iceberglakehouse.com/posts/2026-03-connector-azure-synapse/</link><guid isPermaLink="true">https://iceberglakehouse.com/posts/2026-03-connector-azure-synapse/</guid><description>
Microsoft Azure Synapse Analytics combines big data analytics and enterprise data warehousing into a single Azure-integrated platform. If your organi...</description><pubDate>Sun, 01 Mar 2026 20:00:00 GMT</pubDate><content:encoded>&lt;p&gt;Microsoft Azure Synapse Analytics combines big data analytics and enterprise data warehousing into a single Azure-integrated platform. If your organization has chosen the Microsoft cloud ecosystem, your cleaned and modeled analytical data likely lives in Synapse dedicated SQL pools or serverless SQL pools. Synapse works well within Azure, but it creates challenges when you need to connect that data with AWS, Google Cloud, or on-premises databases. Azure Data Factory pipelines handle some of this, but they add cost, latency, and engineering complexity.&lt;/p&gt;
&lt;p&gt;Dremio Cloud connects to Azure Synapse and federates it with every other data source in your organization. Synapse queries push down to Synapse&apos;s engine for processing, and Dremio handles cross-source joins, query acceleration with Reflections, unified governance, and AI-powered analytics. You keep your investment in Synapse while extending its reach beyond the Azure ecosystem.&lt;/p&gt;
&lt;h2&gt;Why Azure Synapse Users Need Dremio&lt;/h2&gt;
&lt;h3&gt;Multi-Cloud Analytics Without Data Movement&lt;/h3&gt;
&lt;p&gt;Your Azure Synapse workspace holds curated sales and finance data, but your application database runs on Amazon RDS (PostgreSQL), your marketing attribution data is in Google BigQuery, and your raw event logs sit in Amazon S3. Without a federation layer, joining these datasets requires Azure Data Factory to extract data from non-Azure sources, transform it, and load it into Synapse — a process that can take hours and costs real money in compute and data egress.&lt;/p&gt;
&lt;p&gt;Dremio eliminates this entirely. Connect Synapse, PostgreSQL, BigQuery, and S3 as separate sources in Dremio, and write a single SQL query that joins across all four. Dremio&apos;s query optimizer pushes filtering and aggregation to each source (predicate pushdown), transfers only the results, and handles the cross-source join in its own engine. No pipelines. No data movement.&lt;/p&gt;
&lt;h3&gt;Cost Optimization Through Reflections&lt;/h3&gt;
&lt;p&gt;Synapse dedicated SQL pools charge based on the Data Warehouse Units (DWUs) provisioned, and serverless pools charge per TB of data processed. Dashboard queries that run every 15 minutes, ad-hoc exploration by analysts, and scheduled reports all consume Synapse compute resources.&lt;/p&gt;
&lt;p&gt;Dremio&apos;s Reflections create pre-computed materializations of your most frequently run queries. After the initial execution, subsequent queries that match the Reflection pattern are served from Dremio&apos;s cache — not from Synapse. This can reduce Synapse compute consumption by 50-80% for dashboard and reporting workloads, directly lowering your Azure bill.&lt;/p&gt;
&lt;h3&gt;Unified Governance Across Clouds&lt;/h3&gt;
&lt;p&gt;Azure Synapse has role-based access control and Azure Active Directory integration within the Azure ecosystem. But those policies don&apos;t extend to your AWS databases or Google Cloud data. Dremio&apos;s Fine-Grained Access Control (FGAC) applies consistent column masking (hiding Social Security numbers, email addresses) and row-level filtering (restricting data by region or department) across Synapse and every other connected source. One governance policy, applied everywhere.&lt;/p&gt;
&lt;h3&gt;The Semantic Layer for Business Context&lt;/h3&gt;
&lt;p&gt;Raw Synapse tables have technical column names and no business context. Dremio lets you create views that encapsulate business logic (what &amp;quot;active customer&amp;quot; or &amp;quot;quarterly revenue&amp;quot; means), then attach wiki descriptions and labels to those views. This semantic layer makes your data self-documenting and powers Dremio&apos;s AI capabilities.&lt;/p&gt;
&lt;h2&gt;Prerequisites&lt;/h2&gt;
&lt;p&gt;Before connecting Azure Synapse to Dremio Cloud, confirm you have:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Synapse SQL endpoint&lt;/strong&gt; — the fully qualified server name from your Synapse workspace (e.g., &lt;code&gt;myworkspace.sql.azuresynapse.net&lt;/code&gt;)&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Port number&lt;/strong&gt; — default &lt;code&gt;1433&lt;/code&gt; (Synapse uses the same port as SQL Server)&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Database name&lt;/strong&gt; — the specific SQL pool (dedicated or serverless) you want to connect&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Username and password&lt;/strong&gt; — SQL authentication credentials with read access to the tables you want to query&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Network access&lt;/strong&gt; — Synapse&apos;s firewall must allow connections from Dremio Cloud&apos;s IP addresses. Configure this in the Synapse workspace&apos;s networking settings&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Dremio Cloud account&lt;/strong&gt; — &lt;a href=&quot;https://www.dremio.com/get-started?utm_source=ev_buffer&amp;amp;utm_medium=influencer&amp;amp;utm_campaign=pag&amp;amp;utm_term=connector-azure-synapse-dremio-cloud&amp;amp;utm_content=alexmerced&quot;&gt;sign up free for 30 days&lt;/a&gt; with $400 in compute credits&lt;/li&gt;
&lt;/ul&gt;
&lt;h2&gt;Step-by-Step: Connect Azure Synapse to Dremio Cloud&lt;/h2&gt;
&lt;h3&gt;1. Add the Synapse Source&lt;/h3&gt;
&lt;p&gt;In the Dremio console, click the &lt;strong&gt;&amp;quot;+&amp;quot;&lt;/strong&gt; button in the left sidebar and select &lt;strong&gt;Microsoft Azure Synapse Analytics&lt;/strong&gt; from the database source types. Alternatively, navigate to &lt;strong&gt;Databases&lt;/strong&gt; and click &lt;strong&gt;Add database&lt;/strong&gt;.&lt;/p&gt;
&lt;h3&gt;2. Configure General Settings&lt;/h3&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Name:&lt;/strong&gt; A descriptive identifier for this source (e.g., &lt;code&gt;synapse-analytics&lt;/code&gt; or &lt;code&gt;azure-sales-warehouse&lt;/code&gt;). This name appears in your SQL queries as the source prefix. Cannot include &lt;code&gt;/&lt;/code&gt;, &lt;code&gt;:&lt;/code&gt;, &lt;code&gt;[&lt;/code&gt;, or &lt;code&gt;]&lt;/code&gt;.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Host:&lt;/strong&gt; Your Synapse SQL endpoint (e.g., &lt;code&gt;myworkspace.sql.azuresynapse.net&lt;/code&gt;).&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Port:&lt;/strong&gt; Default &lt;code&gt;1433&lt;/code&gt;.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Database:&lt;/strong&gt; The SQL pool name you want to connect to.&lt;/li&gt;
&lt;/ul&gt;
&lt;h3&gt;3. Set Authentication&lt;/h3&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Master Credentials:&lt;/strong&gt; Enter the SQL authentication username and password with &lt;code&gt;SELECT&lt;/code&gt; permissions on the schemas and tables you want to query.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Secret Resource URL:&lt;/strong&gt; Store the password in AWS Secrets Manager and provide the ARN. Dremio fetches the password at connection time for centralized credential management.&lt;/li&gt;
&lt;/ul&gt;
&lt;h3&gt;4. Configure Advanced Options&lt;/h3&gt;
&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Setting&lt;/th&gt;
&lt;th&gt;What It Does&lt;/th&gt;
&lt;th&gt;Default&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Record fetch size&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Rows per batch from Synapse&lt;/td&gt;
&lt;td&gt;200&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Maximum Idle Connections&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Idle connection pool size&lt;/td&gt;
&lt;td&gt;8&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Connection Idle Time (s)&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Seconds before idle connections close&lt;/td&gt;
&lt;td&gt;60&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Encrypt Connection&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Enable SSL/TLS encryption&lt;/td&gt;
&lt;td&gt;Off&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Connection Properties&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Custom JDBC parameters&lt;/td&gt;
&lt;td&gt;None&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;
&lt;h3&gt;5. Set Reflection Refresh and Metadata&lt;/h3&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Reflection Refresh:&lt;/strong&gt; How often Dremio re-queries Synapse to update cached materializations. For dashboards with hourly data, set to 1-4 hours. For stable reporting data, daily or weekly.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Metadata Refresh:&lt;/strong&gt; How often Dremio checks for new tables or schema changes. Default 1 hour for discovery, 1 hour for details.&lt;/li&gt;
&lt;/ul&gt;
&lt;h3&gt;6. Set Privileges and Save&lt;/h3&gt;
&lt;p&gt;Optionally restrict which Dremio users or roles can access this Synapse source. Click &lt;strong&gt;Save&lt;/strong&gt; to create the connection.&lt;/p&gt;
&lt;h2&gt;Query Azure Synapse Data from Dremio&lt;/h2&gt;
&lt;p&gt;Once connected, browse your Synapse schemas and tables in the SQL Runner:&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-sql&quot;&gt;SELECT region, product_line, SUM(revenue) AS total_revenue, COUNT(order_id) AS order_count
FROM &amp;quot;synapse-analytics&amp;quot;.dbo.sales_summary
WHERE fiscal_year = 2024 AND region IN (&apos;EMEA&apos;, &apos;APAC&apos;, &apos;Americas&apos;)
GROUP BY region, product_line
ORDER BY total_revenue DESC;
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Dremio pushes the &lt;code&gt;WHERE&lt;/code&gt; clause and aggregation to Synapse — only the summarized result crosses the network.&lt;/p&gt;
&lt;h2&gt;Federate Azure Synapse with Other Sources&lt;/h2&gt;
&lt;p&gt;The real power emerges when you combine Synapse data with non-Azure sources:&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-sql&quot;&gt;-- Join Synapse sales data with AWS-hosted CRM and S3 marketing data
SELECT
  syn.region,
  syn.product_line,
  syn.total_revenue AS synapse_revenue,
  pg.customer_count,
  s3.marketing_spend,
  ROUND(syn.total_revenue / NULLIF(s3.marketing_spend, 0), 2) AS revenue_per_marketing_dollar
FROM (
  SELECT region, product_line, SUM(revenue) AS total_revenue
  FROM &amp;quot;synapse-analytics&amp;quot;.dbo.sales_summary
  WHERE fiscal_year = 2024
  GROUP BY region, product_line
) syn
LEFT JOIN (
  SELECT region, COUNT(DISTINCT customer_id) AS customer_count
  FROM &amp;quot;postgres-crm&amp;quot;.public.customers
  GROUP BY region
) pg ON syn.region = pg.region
LEFT JOIN &amp;quot;s3-marketing&amp;quot;.campaigns.regional_spend s3
  ON syn.region = s3.region
ORDER BY revenue_per_marketing_dollar DESC;
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Three clouds (Azure, AWS, S3), one query, no ETL pipelines.&lt;/p&gt;
&lt;h2&gt;Build a Semantic Layer Over Synapse Data&lt;/h2&gt;
&lt;p&gt;Create views that translate technical Synapse schemas into business-friendly analytics:&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-sql&quot;&gt;CREATE VIEW analytics.gold.regional_performance AS
SELECT
  s.region,
  s.product_line,
  SUM(s.revenue) AS total_revenue,
  SUM(s.cost) AS total_cost,
  SUM(s.revenue) - SUM(s.cost) AS gross_profit,
  ROUND((SUM(s.revenue) - SUM(s.cost)) / NULLIF(SUM(s.revenue), 0) * 100, 1) AS profit_margin_pct,
  CASE
    WHEN SUM(s.revenue) &amp;gt; 1000000 THEN &apos;Major Market&apos;
    WHEN SUM(s.revenue) &amp;gt; 250000 THEN &apos;Growth Market&apos;
    ELSE &apos;Emerging Market&apos;
  END AS market_tier
FROM &amp;quot;synapse-analytics&amp;quot;.dbo.sales_summary s
WHERE s.fiscal_year = 2024
GROUP BY s.region, s.product_line;
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Navigate to the &lt;strong&gt;Catalog&lt;/strong&gt;, click &lt;strong&gt;Edit&lt;/strong&gt; (pencil icon) on this view, go to the &lt;strong&gt;Details&lt;/strong&gt; tab, and click &lt;strong&gt;Generate Wiki&lt;/strong&gt; and &lt;strong&gt;Generate Tags&lt;/strong&gt;. Dremio&apos;s generative AI samples the view schema and data to produce descriptions that help analysts and AI tools understand the dataset.&lt;/p&gt;
&lt;h2&gt;AI-Powered Analytics on Synapse Data&lt;/h2&gt;
&lt;p&gt;Dremio provides three AI capabilities that transform how you work with Synapse data:&lt;/p&gt;
&lt;h3&gt;Dremio AI Agent&lt;/h3&gt;
&lt;p&gt;The built-in AI Agent lets users ask questions about your Synapse data in plain English. Instead of writing SQL, a business user can ask &amp;quot;What&apos;s our profit margin by region?&amp;quot; and the AI Agent generates the correct SQL based on the semantic layer (wikis, labels, view definitions) you&apos;ve built.&lt;/p&gt;
&lt;p&gt;The AI Agent reads the wiki descriptions you attached to your views to understand what columns mean in business terms. This is why the semantic layer matters — better metadata produces more accurate AI-generated queries. For example, if your &lt;code&gt;regional_performance&lt;/code&gt; view has a wiki that says &amp;quot;profit_margin_pct represents the gross profit margin after cost of goods sold,&amp;quot; the Agent uses that context to correctly answer &amp;quot;Which regions are most profitable?&amp;quot;&lt;/p&gt;
&lt;h3&gt;Dremio MCP Server&lt;/h3&gt;
&lt;p&gt;The &lt;a href=&quot;https://github.com/dremio/dremio-mcp&quot;&gt;Dremio MCP Server&lt;/a&gt; extends AI capabilities beyond Dremio&apos;s own interface. It&apos;s an open-source project that enables AI chat clients like Claude and ChatGPT to securely interact with your Dremio data using natural language.&lt;/p&gt;
&lt;p&gt;The Dremio-hosted MCP Server provides OAuth support, which guarantees user identity, authentication, and authorization for all interactions. Once connected, you can use natural language in Claude or ChatGPT to:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Explore your Synapse data schemas and tables&lt;/li&gt;
&lt;li&gt;Run analytical queries and get results&lt;/li&gt;
&lt;li&gt;Create visualizations from query results&lt;/li&gt;
&lt;li&gt;Build and save views&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;&lt;strong&gt;Setup is straightforward:&lt;/strong&gt;&lt;/p&gt;
&lt;ol&gt;
&lt;li&gt;Create a Native OAuth application in Dremio Cloud&lt;/li&gt;
&lt;li&gt;Configure the redirect URLs for your AI chat client&lt;/li&gt;
&lt;li&gt;Connect using the MCP endpoint: &lt;code&gt;mcp.dremio.cloud/mcp/{project_id}&lt;/code&gt; (US) or &lt;code&gt;mcp.eu.dremio.cloud/mcp/{project_id}&lt;/code&gt; (EU)&lt;/li&gt;
&lt;/ol&gt;
&lt;p&gt;This means a marketing manager can ask Claude &amp;quot;Show me our top 5 regions by profit margin from the Synapse sales data&amp;quot; and get accurate, governed results — without knowing SQL or having direct Synapse access.&lt;/p&gt;
&lt;h3&gt;AI SQL Functions&lt;/h3&gt;
&lt;p&gt;Dremio provides built-in AI SQL functions that you can use directly in queries against any connected data, including Synapse:&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-sql&quot;&gt;-- Classify products based on their Synapse metadata
SELECT
  product_line,
  total_revenue,
  AI_CLASSIFY(
    &apos;Based on this revenue and growth pattern, classify the product health&apos;,
    product_line || &apos;: $&apos; || CAST(total_revenue AS VARCHAR) || &apos; revenue&apos;,
    ARRAY[&apos;Thriving&apos;, &apos;Stable&apos;, &apos;Declining&apos;, &apos;At Risk&apos;]
  ) AS product_health
FROM &amp;quot;synapse-analytics&amp;quot;.dbo.product_summary;

-- Generate summaries from Synapse data
SELECT
  region,
  AI_GENERATE(
    &apos;Write a one-sentence business summary for this regional performance&apos;,
    &apos;Region: &apos; || region || &apos;, Revenue: $&apos; || CAST(revenue AS VARCHAR) || &apos;, Growth: &apos; || CAST(yoy_growth AS VARCHAR) || &apos;%&apos;
  ) AS executive_summary
FROM &amp;quot;synapse-analytics&amp;quot;.dbo.regional_metrics;
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;These functions run LLM inference directly in your SQL queries, turning raw Synapse data into AI-enriched insights.&lt;/p&gt;
&lt;h2&gt;Accelerate Synapse Queries with Reflections&lt;/h2&gt;
&lt;p&gt;For queries that run repeatedly (dashboard refreshes, scheduled reports):&lt;/p&gt;
&lt;ol&gt;
&lt;li&gt;Build a view over your Synapse data (like &lt;code&gt;regional_performance&lt;/code&gt; above).&lt;/li&gt;
&lt;li&gt;In the Catalog, select the view and create a &lt;strong&gt;Reflection&lt;/strong&gt;.&lt;/li&gt;
&lt;li&gt;Choose the columns and aggregations to include.&lt;/li&gt;
&lt;li&gt;Set the refresh interval (how often Dremio re-queries Synapse to update the Reflection).&lt;/li&gt;
&lt;/ol&gt;
&lt;p&gt;After the Reflection is built, Dremio&apos;s query optimizer automatically routes matching queries to the Reflection. Your BI tools (Power BI, Tableau) connected via Arrow Flight or ODBC get sub-second responses from the Reflection instead of waiting for Synapse to process the query. The acceleration is completely transparent — users write the same SQL and see the same data, just faster.&lt;/p&gt;
&lt;h2&gt;When to Keep Data in Synapse vs. Migrate to Iceberg&lt;/h2&gt;
&lt;p&gt;&lt;strong&gt;Keep in Synapse:&lt;/strong&gt; Data actively consumed by Azure-native tools (Power BI with DirectQuery, Azure Machine Learning), data with complex Synapse-specific transformations, data shared through Azure Data Share.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Migrate to Iceberg:&lt;/strong&gt; Historical archive data that&apos;s rarely updated, large analytical datasets that would benefit from automated compaction and manifest optimization, datasets that need time travel (query as of any past timestamp), data that other teams access through non-Azure tools.&lt;/p&gt;
&lt;p&gt;For data that stays in Synapse, create manual Reflections with refresh schedules matching your data freshness requirements. For migrated Iceberg data, Dremio&apos;s Open Catalog provides automated compaction, time travel, and Autonomous Reflections.&lt;/p&gt;
&lt;h2&gt;Governance Across Azure Synapse and Other Sources&lt;/h2&gt;
&lt;p&gt;Dremio&apos;s Fine-Grained Access Control (FGAC) extends Synapse&apos;s Azure AD-based security to every connected source:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Column masking:&lt;/strong&gt; Mask revenue, cost, and margin data from specific roles. A marketing analyst sees conversion counts but not financial details.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Row-level filtering:&lt;/strong&gt; Regional managers see only their region&apos;s data automatically across all sources.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Unified policies:&lt;/strong&gt; Same governance applies to Synapse, PostgreSQL, S3, BigQuery, and all other sources — no per-service security configuration.&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;These policies apply across SQL Runner, BI tools (Arrow Flight/ODBC), AI Agent, MCP Server, and Arrow Flight clients.&lt;/p&gt;
&lt;h2&gt;Connect BI Tools via Arrow Flight&lt;/h2&gt;
&lt;p&gt;Arrow Flight provides 10-100x faster data transfer than JDBC/ODBC:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Power BI:&lt;/strong&gt; Dremio native connector — ideal for Azure-centric organizations&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Tableau:&lt;/strong&gt; Dremio connector for direct Arrow Flight access&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Python/Pandas:&lt;/strong&gt; &lt;code&gt;pyarrow.flight&lt;/code&gt; for programmatic data access&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Looker:&lt;/strong&gt; Connect via JDBC&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;dbt:&lt;/strong&gt; &lt;code&gt;dbt-dremio&lt;/code&gt; adapter for SQL-based transformations&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;All queries benefit from Reflections, governance, and the semantic layer.&lt;/p&gt;
&lt;h2&gt;VS Code Copilot Integration&lt;/h2&gt;
&lt;p&gt;Dremio&apos;s VS Code extension with Copilot integration lets developers query Synapse data from their IDE. Ask Copilot &amp;quot;Show me regional profit margins from Azure Synapse&amp;quot; and get SQL generated from your semantic layer.&lt;/p&gt;
&lt;h2&gt;Get Started&lt;/h2&gt;
&lt;p&gt;Azure Synapse users can extend their warehouse beyond the Azure ecosystem, reduce compute costs with Reflections, and enable AI-powered analytics across all their data sources — all through Dremio Cloud.&lt;/p&gt;
&lt;p&gt;&lt;a href=&quot;https://www.dremio.com/get-started?utm_source=ev_buffer&amp;amp;utm_medium=influencer&amp;amp;utm_campaign=pag&amp;amp;utm_term=connector-azure-synapse-dremio-cloud&amp;amp;utm_content=alexmerced&quot;&gt;Try Dremio Cloud free for 30 days&lt;/a&gt; and connect your Azure Synapse workspace alongside your other data sources.&lt;/p&gt;
</content:encoded><author>Alex Merced</author></item><item><title>Connect Snowflake to Dremio Cloud: Federate, Govern, and Accelerate Beyond Snowflake</title><link>https://iceberglakehouse.com/posts/2026-03-connector-snowflake/</link><guid isPermaLink="true">https://iceberglakehouse.com/posts/2026-03-connector-snowflake/</guid><description>
Snowflake is a popular cloud data warehouse known for its separation of storage and compute, near-zero maintenance, and broad ecosystem. Many organiz...</description><pubDate>Sun, 01 Mar 2026 19:00:00 GMT</pubDate><content:encoded>&lt;p&gt;Snowflake is a popular cloud data warehouse known for its separation of storage and compute, near-zero maintenance, and broad ecosystem. Many organizations have made Snowflake their primary analytics platform. But as data ecosystems mature, limitations emerge: Snowflake credits are consumed on every query, connecting Snowflake data to non-Snowflake sources requires data sharing agreements or ETL, and running all workloads in Snowflake means paying Snowflake prices for everything — including repetitive dashboard queries and ad-hoc exploration.&lt;/p&gt;
&lt;p&gt;Dremio Cloud connects to Snowflake as a federated data source. You can query Snowflake tables directly, join them with PostgreSQL, S3, MongoDB, BigQuery, and any other connected source in a single SQL query, and accelerate repeated queries with Reflections so they don&apos;t burn Snowflake credits on every execution.&lt;/p&gt;
&lt;p&gt;Snowflake&apos;s native Iceberg Tables feature allows managing Iceberg-formatted data within Snowflake. However, this still keeps your compute costs within Snowflake&apos;s pricing model. By combining Dremio Cloud with Snowflake (and potentially Snowflake&apos;s Open Catalog for shared Iceberg access), organizations can use Snowflake for data engineering while leveraging Dremio for cost-optimized analytical serving. This hybrid approach gives you Snowflake&apos;s data engineering strengths without paying Snowflake credit rates for every analytical query.&lt;/p&gt;
&lt;p&gt;The cost concern is real: organizations regularly report that 40-60% of their Snowflake spend comes from dashboards, scheduled reports, and ad-hoc queries — workloads that are fundamentally repetitive and ideal for Reflection-based caching.&lt;/p&gt;
&lt;h2&gt;Why Snowflake Users Need Dremio&lt;/h2&gt;
&lt;h3&gt;Reduce Snowflake Credit Consumption&lt;/h3&gt;
&lt;p&gt;Every query in Snowflake consumes credits based on the warehouse size and query runtime. Dashboard queries that run every 15 minutes, analytics training sessions, ad-hoc data exploration by 50 analysts, and nightly scheduled reports all consume credits.&lt;/p&gt;
&lt;p&gt;Dremio&apos;s Reflections create pre-computed materializations of frequently executed queries. After the initial run, matching queries are served from Dremio&apos;s cache instead of Snowflake. For organizations spending over $100K/year on Snowflake compute, routing read-heavy analytical and dashboard workloads through Dremio can reduce credit consumption by 30-70% on those workloads.&lt;/p&gt;
&lt;h3&gt;Federation Beyond Snowflake&lt;/h3&gt;
&lt;p&gt;Snowflake&apos;s data sharing works between Snowflake accounts. But what about your PostgreSQL application database, your S3 data lake, your MongoDB user profiles, or your on-premises Oracle ERP? Joining these with Snowflake data requires ETL pipelines — extracting from each source, transforming, and loading into Snowflake. Dremio queries each source in place and joins the results in its own engine. No data movement, no Snowflake ingestion costs.&lt;/p&gt;
&lt;h3&gt;Unified Governance&lt;/h3&gt;
&lt;p&gt;Snowflake has robust access controls within Snowflake. But governing data across Snowflake, PostgreSQL, S3, and MongoDB requires separate policies in each system. Dremio&apos;s Fine-Grained Access Control applies consistent column masking and row-level filtering across all connected sources from a single interface.&lt;/p&gt;
&lt;h3&gt;AI Analytics Across All Sources&lt;/h3&gt;
&lt;p&gt;Snowflake has AI/ML features within its ecosystem (Cortex). Dremio adds AI capabilities that span your entire data estate, not just Snowflake — including an AI Agent for natural language queries, an MCP Server for external AI tools, and SQL-level AI functions.&lt;/p&gt;
&lt;h2&gt;Prerequisites&lt;/h2&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Snowflake account URL&lt;/strong&gt; (e.g., &lt;code&gt;myaccount.snowflakecomputing.com&lt;/code&gt;)&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Username and password&lt;/strong&gt; (or OAuth/key pair authentication)&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Warehouse name&lt;/strong&gt; — the compute resource Snowflake uses for queries&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Database name&lt;/strong&gt; — the Snowflake database you want to connect&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Network access&lt;/strong&gt; from Dremio Cloud to your Snowflake instance&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Dremio Cloud account&lt;/strong&gt; — &lt;a href=&quot;https://www.dremio.com/get-started?utm_source=ev_buffer&amp;amp;utm_medium=influencer&amp;amp;utm_campaign=pag&amp;amp;utm_term=connector-snowflake-dremio-cloud&amp;amp;utm_content=alexmerced&quot;&gt;sign up free for 30 days&lt;/a&gt; with $400 in compute credits&lt;/li&gt;
&lt;/ul&gt;
&lt;h2&gt;Step-by-Step: Connect Snowflake to Dremio Cloud&lt;/h2&gt;
&lt;h3&gt;1. Add the Snowflake Source&lt;/h3&gt;
&lt;p&gt;In the Dremio console, click &lt;strong&gt;&amp;quot;+&amp;quot;&lt;/strong&gt; in the left sidebar and select &lt;strong&gt;Snowflake&lt;/strong&gt;.&lt;/p&gt;
&lt;h3&gt;2. Configure Connection Details&lt;/h3&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Name:&lt;/strong&gt; A descriptive identifier (e.g., &lt;code&gt;snowflake-warehouse&lt;/code&gt; or &lt;code&gt;analytics-snowflake&lt;/code&gt;).&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Account URL:&lt;/strong&gt; Your Snowflake account URL.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Warehouse:&lt;/strong&gt; The Snowflake virtual warehouse to use for queries.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Database:&lt;/strong&gt; The Snowflake database to connect to.&lt;/li&gt;
&lt;/ul&gt;
&lt;h3&gt;3. Set Authentication&lt;/h3&gt;
&lt;p&gt;Choose from Master Credentials (username/password), OAuth, or key pair authentication.&lt;/p&gt;
&lt;h3&gt;4. Configure Advanced Settings&lt;/h3&gt;
&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Setting&lt;/th&gt;
&lt;th&gt;Purpose&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Record fetch size&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Rows per batch from Snowflake&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Maximum Idle Connections&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Connection pool management&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Connection Properties&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Custom Snowflake connection parameters&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;
&lt;h3&gt;5. Set Reflection and Metadata Refresh&lt;/h3&gt;
&lt;p&gt;Configure how often Reflections refresh and how often Dremio checks for schema changes. Click &lt;strong&gt;Save&lt;/strong&gt;.&lt;/p&gt;
&lt;h2&gt;Query Snowflake Data from Dremio&lt;/h2&gt;
&lt;pre&gt;&lt;code class=&quot;language-sql&quot;&gt;SELECT
  product_category,
  SUM(sales_amount) AS total_sales,
  COUNT(DISTINCT customer_id) AS unique_buyers,
  ROUND(SUM(sales_amount) / COUNT(DISTINCT customer_id), 2) AS avg_spend_per_customer
FROM &amp;quot;snowflake-warehouse&amp;quot;.PUBLIC.SALES_FACT
WHERE sale_date &amp;gt;= &apos;2024-01-01&apos; AND sale_date &amp;lt; &apos;2024-07-01&apos;
GROUP BY product_category
ORDER BY total_sales DESC;
&lt;/code&gt;&lt;/pre&gt;
&lt;h2&gt;Federate Snowflake with Other Sources&lt;/h2&gt;
&lt;pre&gt;&lt;code class=&quot;language-sql&quot;&gt;-- Join Snowflake sales with PostgreSQL reviews and S3 return data
SELECT
  sf.product_category,
  sf.total_sales,
  sf.unique_buyers,
  pg.avg_review_score,
  pg.review_count,
  s3.return_rate,
  ROUND(sf.total_sales * (1 - s3.return_rate), 2) AS net_revenue
FROM (
  SELECT product_category, SUM(sales_amount) AS total_sales, COUNT(DISTINCT customer_id) AS unique_buyers
  FROM &amp;quot;snowflake-warehouse&amp;quot;.PUBLIC.SALES_FACT
  WHERE sale_date &amp;gt;= &apos;2024-01-01&apos;
  GROUP BY product_category
) sf
LEFT JOIN &amp;quot;postgres-reviews&amp;quot;.public.product_reviews pg ON sf.product_category = pg.category
LEFT JOIN &amp;quot;s3-analytics&amp;quot;.returns.category_return_rates s3 ON sf.product_category = s3.category
ORDER BY net_revenue DESC;
&lt;/code&gt;&lt;/pre&gt;
&lt;h2&gt;Build a Semantic Layer&lt;/h2&gt;
&lt;pre&gt;&lt;code class=&quot;language-sql&quot;&gt;CREATE VIEW analytics.gold.product_health AS
SELECT
  sf.product_category,
  SUM(sf.sales_amount) AS total_revenue,
  COUNT(DISTINCT sf.customer_id) AS unique_customers,
  ROUND(SUM(sf.sales_amount) / COUNT(DISTINCT sf.customer_id), 2) AS customer_value,
  CASE
    WHEN SUM(sf.sales_amount) &amp;gt; 1000000 THEN &apos;Category Leader&apos;
    WHEN SUM(sf.sales_amount) &amp;gt; 250000 THEN &apos;Growth Category&apos;
    ELSE &apos;Emerging&apos;
  END AS category_tier
FROM &amp;quot;snowflake-warehouse&amp;quot;.PUBLIC.SALES_FACT sf
GROUP BY sf.product_category;
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Click &lt;strong&gt;Edit&lt;/strong&gt; (pencil icon) in the Catalog, then &lt;strong&gt;Generate Wiki&lt;/strong&gt; and &lt;strong&gt;Generate Tags&lt;/strong&gt; to create AI-readable business context.&lt;/p&gt;
&lt;h2&gt;AI-Powered Analytics on Snowflake Data&lt;/h2&gt;
&lt;h3&gt;Dremio AI Agent&lt;/h3&gt;
&lt;p&gt;Users ask questions in plain English: &amp;quot;Which product categories are growing fastest?&amp;quot; The AI Agent reads your wiki descriptions and generates accurate SQL. The semantic layer you&apos;ve built is the foundation — better descriptions mean better AI responses.&lt;/p&gt;
&lt;h3&gt;Dremio MCP Server&lt;/h3&gt;
&lt;p&gt;The &lt;a href=&quot;https://github.com/dremio/dremio-mcp&quot;&gt;Dremio MCP Server&lt;/a&gt; connects external AI tools (Claude, ChatGPT) to your Dremio data with OAuth authentication:&lt;/p&gt;
&lt;ol&gt;
&lt;li&gt;Create a Native OAuth application in Dremio Cloud&lt;/li&gt;
&lt;li&gt;Configure redirect URLs for your AI client&lt;/li&gt;
&lt;li&gt;Connect using &lt;code&gt;mcp.dremio.cloud/mcp/{project_id}&lt;/code&gt;&lt;/li&gt;
&lt;/ol&gt;
&lt;p&gt;A product manager can ask ChatGPT &amp;quot;What are our top 5 product categories by net revenue from Snowflake?&amp;quot; and get governed, accurate results.&lt;/p&gt;
&lt;h3&gt;AI SQL Functions&lt;/h3&gt;
&lt;pre&gt;&lt;code class=&quot;language-sql&quot;&gt;-- Generate product insights with AI
SELECT
  product_category,
  total_revenue,
  customer_value,
  AI_GENERATE(
    &apos;Write a one-sentence product strategy recommendation&apos;,
    &apos;Category: &apos; || product_category || &apos;, Revenue: $&apos; || CAST(total_revenue AS VARCHAR) || &apos;, Customer Value: $&apos; || CAST(customer_value AS VARCHAR) || &apos;, Tier: &apos; || category_tier
  ) AS strategy_recommendation
FROM analytics.gold.product_health;

-- Classify product categories
SELECT
  product_category,
  AI_CLASSIFY(
    &apos;Based on these metrics, classify the investment priority&apos;,
    &apos;Revenue: $&apos; || CAST(total_revenue AS VARCHAR) || &apos;, Customers: &apos; || CAST(unique_customers AS VARCHAR),
    ARRAY[&apos;High Priority&apos;, &apos;Medium Priority&apos;, &apos;Low Priority&apos;, &apos;Divest&apos;]
  ) AS investment_priority
FROM analytics.gold.product_health;
&lt;/code&gt;&lt;/pre&gt;
&lt;h2&gt;Accelerate with Reflections&lt;/h2&gt;
&lt;p&gt;Create Reflections on frequently queried Snowflake views to offload repeated queries from Snowflake credits:&lt;/p&gt;
&lt;ol&gt;
&lt;li&gt;In the &lt;strong&gt;Catalog&lt;/strong&gt;, navigate to the view you want to accelerate&lt;/li&gt;
&lt;li&gt;Click the &lt;strong&gt;Reflections&lt;/strong&gt; tab&lt;/li&gt;
&lt;li&gt;Choose &lt;strong&gt;Raw Reflection&lt;/strong&gt; (full dataset cache) or &lt;strong&gt;Aggregation Reflection&lt;/strong&gt; (pre-computed SUM/COUNT/AVG)&lt;/li&gt;
&lt;li&gt;Select the columns and aggregations to include&lt;/li&gt;
&lt;li&gt;Set the &lt;strong&gt;Refresh Interval&lt;/strong&gt; — balance between data freshness and Snowflake credit consumption&lt;/li&gt;
&lt;li&gt;Click &lt;strong&gt;Save&lt;/strong&gt;&lt;/li&gt;
&lt;/ol&gt;
&lt;p&gt;After creation, Dremio&apos;s query optimizer automatically routes matching queries to the Reflection. Dashboard queries and scheduled reports hit the cache instead of consuming Snowflake credits. BI tools connected via Arrow Flight get sub-second response times.&lt;/p&gt;
&lt;h3&gt;Example: Dashboard Acceleration&lt;/h3&gt;
&lt;p&gt;A Tableau dashboard that refreshes every 15 minutes queries &lt;code&gt;product_health&lt;/code&gt;. Without Reflections, that&apos;s 96 Snowflake queries per day. With a Reflection that refreshes every 2 hours, Dremio serves 84 of those queries from cache — an 87.5% reduction in Snowflake credit consumption for that dashboard alone. Multiply that across 50 dashboards and the savings become significant.&lt;/p&gt;
&lt;h2&gt;Governance Across Snowflake and Other Sources&lt;/h2&gt;
&lt;p&gt;Dremio&apos;s Fine-Grained Access Control (FGAC) provides governance capabilities that work across Snowflake and every other connected source:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Column masking:&lt;/strong&gt; Mask sensitive customer data (PII, financial details) from specific user roles. A marketing analyst sees &lt;code&gt;customer_name&lt;/code&gt; but not &lt;code&gt;social_security_number&lt;/code&gt;. An auditor sees both.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Row-level filtering:&lt;/strong&gt; Automatically filter data based on the querying user&apos;s role. A regional manager sees only their region&apos;s data across all sources.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Unified policies:&lt;/strong&gt; The same governance rules apply whether data comes from Snowflake, PostgreSQL, S3, or any other source — no per-source policy management.&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;These policies apply across all access methods: SQL Runner, BI tools, AI Agent, MCP Server, and Arrow Flight clients.&lt;/p&gt;
&lt;h2&gt;Connect BI Tools via Arrow Flight&lt;/h2&gt;
&lt;p&gt;Arrow Flight provides 10-100x faster data transfer compared to JDBC/ODBC. After building views over Snowflake data:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Tableau:&lt;/strong&gt; Use the Dremio connector for direct Arrow Flight access&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Power BI:&lt;/strong&gt; Use Dremio&apos;s ODBC driver or native connector&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Python/Pandas:&lt;/strong&gt; Use &lt;code&gt;pyarrow.flight&lt;/code&gt; for programmatic data access&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Looker:&lt;/strong&gt; Connect via JDBC with Dremio&apos;s driver&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;dbt:&lt;/strong&gt; Use &lt;code&gt;dbt-dremio&lt;/code&gt; for SQL-based transformations&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;All queries benefit from Reflections, governance, and semantic layer context.&lt;/p&gt;
&lt;h2&gt;VS Code Copilot Integration&lt;/h2&gt;
&lt;p&gt;Dremio&apos;s VS Code extension with Copilot integration lets developers query Snowflake data from their IDE. Ask Copilot &amp;quot;Show me product health metrics from Snowflake&amp;quot; and it generates SQL using your semantic layer — without switching to the Dremio console or Snowflake&apos;s Worksheets.&lt;/p&gt;
&lt;h2&gt;External Queries&lt;/h2&gt;
&lt;p&gt;For Snowflake-specific functions not natively supported in Dremio&apos;s SQL, use external queries:&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-sql&quot;&gt;SELECT * FROM TABLE(
  &amp;quot;snowflake-warehouse&amp;quot;.EXTERNAL_QUERY(
    &apos;SELECT APPROX_COUNT_DISTINCT(customer_id), MEDIAN(sales_amount) FROM PUBLIC.SALES_FACT WHERE sale_date &amp;gt;= &apos;&apos;2024-01-01&apos;&apos;&apos;
  )
);
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;External queries pass raw SQL to Snowflake for execution, returning results through Dremio. This is useful for functions like &lt;code&gt;APPROX_COUNT_DISTINCT&lt;/code&gt;, &lt;code&gt;QUALIFY&lt;/code&gt;, or Snowflake-specific window functions.&lt;/p&gt;
&lt;h2&gt;When to Keep Data in Snowflake vs. Migrate&lt;/h2&gt;
&lt;p&gt;&lt;strong&gt;Keep in Snowflake:&lt;/strong&gt; Data consumed by Snowflake-native tools (Snowpipe, Streams, Tasks), data shared through Snowflake Data Sharing, workloads with Snowflake-specific features (materialized views, dynamic tables), datasets actively managed by Snowflake-based ETL.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Migrate to Iceberg:&lt;/strong&gt; Historical data that rarely changes, archival tables, datasets consumed primarily through non-Snowflake tools, workloads where Snowflake credit costs exceed the analytical value delivered. Migrated Iceberg tables benefit from Dremio&apos;s automatic compaction, time travel, Autonomous Reflections, and zero per-query storage costs.&lt;/p&gt;
&lt;p&gt;For data that stays in Snowflake, create manual Reflections to reduce credit consumption. For migrated Iceberg data, Dremio handles optimization automatically.&lt;/p&gt;
&lt;h2&gt;Snowflake Credit Optimization with Dremio&lt;/h2&gt;
&lt;h3&gt;Credit Consumption by Warehouse Size&lt;/h3&gt;
&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Warehouse Size&lt;/th&gt;
&lt;th&gt;Credits/Hour&lt;/th&gt;
&lt;th&gt;Dremio Reflection Impact&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;X-Small&lt;/td&gt;
&lt;td&gt;1&lt;/td&gt;
&lt;td&gt;Reflections serve cached queries — warehouse suspends faster&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Small&lt;/td&gt;
&lt;td&gt;2&lt;/td&gt;
&lt;td&gt;Same pattern — faster auto-suspend reduces credit burn&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Medium&lt;/td&gt;
&lt;td&gt;4&lt;/td&gt;
&lt;td&gt;Dashboard workloads offloaded — downsize to Small&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Large&lt;/td&gt;
&lt;td&gt;8&lt;/td&gt;
&lt;td&gt;Interactive + scheduled workloads offloaded — significant savings&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;X-Large&lt;/td&gt;
&lt;td&gt;16&lt;/td&gt;
&lt;td&gt;Heavy analytical workloads cached — potential 50%+ reduction&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;
&lt;h3&gt;Quantifying Credit Savings&lt;/h3&gt;
&lt;p&gt;Example calculation for a medium-sized analytics team:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Without Dremio:&lt;/strong&gt; 50 analysts + 20 dashboards consume ~$15,000/month in Snowflake credits&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;With Dremio Reflections:&lt;/strong&gt; Dashboard queries (60% of total) served from cache → ~$6,000/month savings&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Net impact:&lt;/strong&gt; $9,000/month Snowflake bill + Dremio costs, typically netting 20-40% total savings&lt;/li&gt;
&lt;/ul&gt;
&lt;h3&gt;Snowflake Data Cloud Integration&lt;/h3&gt;
&lt;p&gt;Dremio doesn&apos;t replace Snowflake&apos;s Data Cloud capabilities — it complements them:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Data Sharing:&lt;/strong&gt; Continue sharing datasets via Snowflake Data Sharing with other Snowflake accounts&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Marketplace:&lt;/strong&gt; Access Snowflake Marketplace datasets alongside your own, but federate them with non-Snowflake sources through Dremio&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Snowpark:&lt;/strong&gt; Continue using Snowpark for Python/Java/Scala processing within Snowflake&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Dremio&apos;s role:&lt;/strong&gt; Federation with non-Snowflake data, AI analytics, Reflection-based BI serving, and unified governance&lt;/li&gt;
&lt;/ul&gt;
&lt;h2&gt;Get Started&lt;/h2&gt;
&lt;p&gt;Snowflake users can reduce credit consumption, federate beyond Snowflake&apos;s ecosystem, and add AI analytics — all through Dremio Cloud. The combination of Reflections (offloading repetitive dashboard and report queries), federation (joining Snowflake with PostgreSQL, S3, MongoDB, and other sources without ETL), and AI capabilities (Agent, MCP Server, SQL Functions) makes Dremio a natural complement to any Snowflake deployment.&lt;/p&gt;
&lt;p&gt;Start by connecting Snowflake to Dremio Cloud, creating Reflections on your most-queried views, and monitoring the reduction in Snowflake credit consumption. Most organizations see measurable savings within the first week as dashboard queries shift to Dremio&apos;s Reflection cache. The setup takes minutes and the ROI is immediate.&lt;/p&gt;
&lt;p&gt;&lt;a href=&quot;https://www.dremio.com/get-started?utm_source=ev_buffer&amp;amp;utm_medium=influencer&amp;amp;utm_campaign=pag&amp;amp;utm_term=connector-snowflake-dremio-cloud&amp;amp;utm_content=alexmerced&quot;&gt;Try Dremio Cloud free for 30 days&lt;/a&gt; and connect your Snowflake warehouse.&lt;/p&gt;
</content:encoded><author>Alex Merced</author></item><item><title>Connect Google BigQuery to Dremio Cloud: Cross-Cloud Analytics Without Data Movement</title><link>https://iceberglakehouse.com/posts/2026-03-connector-google-bigquery/</link><guid isPermaLink="true">https://iceberglakehouse.com/posts/2026-03-connector-google-bigquery/</guid><description>
Google BigQuery is Google Cloud&apos;s serverless data warehouse. If your organization uses Google Cloud Platform, BigQuery is where your analytics data, ...</description><pubDate>Sun, 01 Mar 2026 18:00:00 GMT</pubDate><content:encoded>&lt;p&gt;Google BigQuery is Google Cloud&apos;s serverless data warehouse. If your organization uses Google Cloud Platform, BigQuery is where your analytics data, marketing attribution, Google Analytics exports, and machine learning model outputs live. BigQuery is powerful within Google&apos;s ecosystem, but it creates challenges when your data spans multiple clouds or when costs grow with usage.&lt;/p&gt;
&lt;p&gt;BigQuery&apos;s on-demand pricing charges per terabyte scanned. For organizations with large datasets queried frequently — especially by dashboards that refresh automatically — this can result in monthly bills that grow unpredictably. And connecting BigQuery data to non-Google tools and other cloud providers requires data exports, cross-cloud networking, or third-party ETL platforms.&lt;/p&gt;
&lt;p&gt;Dremio Cloud connects to BigQuery and queries it alongside data from AWS, Azure, on-premises databases, and any other connected source. You get multi-cloud federation without data movement, AI-powered analytics, and cost optimization through Reflections.&lt;/p&gt;
&lt;p&gt;Data gravity is a real challenge for BigQuery users. Once data lands in BigQuery, Google&apos;s ecosystem encourages keeping everything there — Looker for BI, Vertex AI for ML, Cloud Dataflow for processing. But most enterprises aren&apos;t all-Google. They have data in AWS RDS, Azure SQL, S3 data lakes, and on-premises systems. Moving all that data into BigQuery is expensive (ingestion costs, ongoing storage) and creates vendor lock-in. Dremio&apos;s federation approach queries each source in place, avoiding the data gravity trap while still giving you unified analytics across your entire data estate.&lt;/p&gt;
&lt;h2&gt;Why BigQuery Users Need Dremio&lt;/h2&gt;
&lt;h3&gt;Control BigQuery Costs with Reflections&lt;/h3&gt;
&lt;p&gt;BigQuery&apos;s on-demand pricing charges per terabyte scanned, regardless of whether you&apos;ve run the same query before. A dashboard that refreshes every 15 minutes, querying the same 500GB table, generates substantial costs. Dremio&apos;s Reflections solve this: after the first query execution, Dremio caches the results as a pre-computed materialization. Subsequent queries that match the Reflection pattern are served from cache — no BigQuery scan, no per-TB charge.&lt;/p&gt;
&lt;p&gt;For organizations with heavy dashboard and reporting workloads, this can reduce BigQuery costs by 50-80% on those specific query patterns.&lt;/p&gt;
&lt;h3&gt;Multi-Cloud Analytics&lt;/h3&gt;
&lt;p&gt;Your Google Analytics data is in BigQuery, your application database is in PostgreSQL (running on AWS RDS), your product catalog is in SQL Server (on Azure), and your raw event logs are in Amazon S3. Without a federation layer, joining these datasets requires building ETL pipelines for each source-destination pair. Dremio eliminates this: connect all four as sources and write a single SQL query that joins across them.&lt;/p&gt;
&lt;h3&gt;Unified Governance Across Clouds&lt;/h3&gt;
&lt;p&gt;BigQuery has IAM policies and column-level security within Google Cloud. But those policies don&apos;t extend to your PostgreSQL database, S3 data lake, or Snowflake warehouse. Dremio&apos;s Fine-Grained Access Control (FGAC) applies consistent row-level security and column masking across BigQuery and every other connected source. One governance policy, everywhere.&lt;/p&gt;
&lt;h3&gt;The Semantic Layer for AI&lt;/h3&gt;
&lt;p&gt;Raw BigQuery tables have technical column names and fragmented schemas. Dremio lets you create views that consolidate and rename these into business-friendly structures, then attach wiki descriptions and labels. This semantic layer makes your BigQuery data queryable by AI tools — both Dremio&apos;s built-in AI Agent and external AI clients through the MCP Server.&lt;/p&gt;
&lt;h2&gt;Prerequisites&lt;/h2&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Google Cloud project ID&lt;/strong&gt; — the GCP project containing your BigQuery datasets&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Service Account JSON key&lt;/strong&gt; — a GCP service account with the BigQuery Data Viewer role (or custom role with &lt;code&gt;bigquery.tables.getData&lt;/code&gt;, &lt;code&gt;bigquery.jobs.create&lt;/code&gt; permissions)&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Network access&lt;/strong&gt; — Dremio Cloud connects to Google Cloud APIs over HTTPS&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Dremio Cloud account&lt;/strong&gt; — &lt;a href=&quot;https://www.dremio.com/get-started?utm_source=ev_buffer&amp;amp;utm_medium=influencer&amp;amp;utm_campaign=pag&amp;amp;utm_term=connector-google-bigquery-dremio-cloud&amp;amp;utm_content=alexmerced&quot;&gt;sign up free for 30 days&lt;/a&gt; with $400 in compute credits&lt;/li&gt;
&lt;/ul&gt;
&lt;h2&gt;Step-by-Step: Connect BigQuery to Dremio Cloud&lt;/h2&gt;
&lt;h3&gt;1. Add the BigQuery Source&lt;/h3&gt;
&lt;p&gt;In the Dremio console, click &lt;strong&gt;&amp;quot;+&amp;quot;&lt;/strong&gt; in the left sidebar and select &lt;strong&gt;Google BigQuery&lt;/strong&gt;.&lt;/p&gt;
&lt;h3&gt;2. Configure Connection Details&lt;/h3&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Name:&lt;/strong&gt; A descriptive identifier (e.g., &lt;code&gt;bigquery-marketing&lt;/code&gt; or &lt;code&gt;gcp-analytics&lt;/code&gt;). This appears in SQL queries as the source prefix.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Project ID:&lt;/strong&gt; Your Google Cloud project ID (e.g., &lt;code&gt;my-company-analytics-123456&lt;/code&gt;).&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Service Account Key:&lt;/strong&gt; Upload or paste the JSON key file for your service account.&lt;/li&gt;
&lt;/ul&gt;
&lt;h3&gt;3. Configure Advanced Settings&lt;/h3&gt;
&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Setting&lt;/th&gt;
&lt;th&gt;Purpose&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Caching Enabled&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Cache BigQuery metadata locally for faster schema browsing&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Billing Project&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Specify which GCP project is billed for queries (important for cross-project access)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Connection Properties&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Custom parameters for the BigQuery connection&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;
&lt;h3&gt;4. Set Reflection and Metadata Refresh&lt;/h3&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Reflection Refresh:&lt;/strong&gt; How often Dremio re-queries BigQuery to update cached Reflections. Balance between data freshness and BigQuery scan costs.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Metadata Refresh:&lt;/strong&gt; How often Dremio checks for new datasets or schema changes.&lt;/li&gt;
&lt;/ul&gt;
&lt;h3&gt;5. Set Privileges and Save&lt;/h3&gt;
&lt;p&gt;Optionally restrict access, then click &lt;strong&gt;Save&lt;/strong&gt;.&lt;/p&gt;
&lt;h2&gt;Query BigQuery Data from Dremio&lt;/h2&gt;
&lt;pre&gt;&lt;code class=&quot;language-sql&quot;&gt;-- Query BigQuery marketing data
SELECT
  campaign_name,
  SUM(clicks) AS total_clicks,
  SUM(conversions) AS total_conversions,
  ROUND(SUM(conversions) * 100.0 / NULLIF(SUM(clicks), 0), 2) AS conversion_rate
FROM &amp;quot;bigquery-marketing&amp;quot;.analytics.campaign_metrics
WHERE date &amp;gt;= &apos;2024-01-01&apos; AND date &amp;lt; &apos;2024-07-01&apos;
GROUP BY campaign_name
ORDER BY total_conversions DESC;
&lt;/code&gt;&lt;/pre&gt;
&lt;h2&gt;Federate BigQuery with Other Clouds&lt;/h2&gt;
&lt;p&gt;Join BigQuery marketing data with AWS-hosted application data and Azure revenue:&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-sql&quot;&gt;SELECT
  bq.campaign_name,
  bq.total_clicks,
  bq.total_conversions,
  SUM(pg.order_total) AS attributed_revenue,
  ROUND(SUM(pg.order_total) / NULLIF(bq.total_conversions, 0), 2) AS revenue_per_conversion
FROM (
  SELECT campaign_name, user_id, SUM(clicks) AS total_clicks, SUM(conversions) AS total_conversions
  FROM &amp;quot;bigquery-marketing&amp;quot;.analytics.campaign_clicks
  WHERE date &amp;gt;= &apos;2024-01-01&apos;
  GROUP BY campaign_name, user_id
) bq
JOIN &amp;quot;postgres-orders&amp;quot;.public.orders pg
  ON bq.user_id = pg.customer_id
  AND pg.order_date &amp;gt;= &apos;2024-01-01&apos;
GROUP BY bq.campaign_name, bq.total_clicks, bq.total_conversions
ORDER BY attributed_revenue DESC;
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Three clouds, one query, zero ETL.&lt;/p&gt;
&lt;h2&gt;Build a Semantic Layer&lt;/h2&gt;
&lt;pre&gt;&lt;code class=&quot;language-sql&quot;&gt;CREATE VIEW analytics.gold.campaign_performance AS
SELECT
  bq.campaign_name,
  SUM(bq.clicks) AS total_clicks,
  SUM(bq.conversions) AS total_conversions,
  ROUND(SUM(bq.conversions) * 100.0 / NULLIF(SUM(bq.clicks), 0), 2) AS conversion_rate_pct,
  SUM(bq.cost) AS total_ad_spend,
  CASE
    WHEN SUM(bq.conversions) * 100.0 / NULLIF(SUM(bq.clicks), 0) &amp;gt; 5 THEN &apos;High Performer&apos;
    WHEN SUM(bq.conversions) * 100.0 / NULLIF(SUM(bq.clicks), 0) &amp;gt; 2 THEN &apos;Average&apos;
    ELSE &apos;Underperforming&apos;
  END AS campaign_grade
FROM &amp;quot;bigquery-marketing&amp;quot;.analytics.campaign_metrics bq
GROUP BY bq.campaign_name;
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Navigate to the &lt;strong&gt;Catalog&lt;/strong&gt;, click &lt;strong&gt;Edit&lt;/strong&gt; (pencil icon), and &lt;strong&gt;Generate Wiki&lt;/strong&gt; and &lt;strong&gt;Generate Tags&lt;/strong&gt; to create business context for AI features.&lt;/p&gt;
&lt;h2&gt;AI-Powered Analytics on BigQuery Data&lt;/h2&gt;
&lt;h3&gt;Dremio AI Agent&lt;/h3&gt;
&lt;p&gt;The built-in AI Agent lets users ask questions in plain English: &amp;quot;Which campaigns had the highest conversion rate this quarter?&amp;quot; The Agent reads your wiki descriptions to understand what &amp;quot;conversion rate&amp;quot; and &amp;quot;high performer&amp;quot; mean, then generates accurate SQL.&lt;/p&gt;
&lt;h3&gt;Dremio MCP Server&lt;/h3&gt;
&lt;p&gt;The &lt;a href=&quot;https://github.com/dremio/dremio-mcp&quot;&gt;Dremio MCP Server&lt;/a&gt; connects Claude, ChatGPT, and other AI clients to your Dremio data. Setup:&lt;/p&gt;
&lt;ol&gt;
&lt;li&gt;Create a Native OAuth application in Dremio Cloud&lt;/li&gt;
&lt;li&gt;Configure redirect URLs (e.g., &lt;code&gt;https://claude.ai/api/mcp/auth_callback&lt;/code&gt;)&lt;/li&gt;
&lt;li&gt;Connect using &lt;code&gt;mcp.dremio.cloud/mcp/{project_id}&lt;/code&gt;&lt;/li&gt;
&lt;/ol&gt;
&lt;p&gt;A marketing executive can ask Claude &amp;quot;Compare our Q1 campaign performance against Q2 using the BigQuery data&amp;quot; and get governed, accurate results — no SQL required.&lt;/p&gt;
&lt;h3&gt;AI SQL Functions&lt;/h3&gt;
&lt;p&gt;Use AI directly in queries against BigQuery data:&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-sql&quot;&gt;-- Classify campaign performance with AI
SELECT
  campaign_name,
  total_clicks,
  conversion_rate_pct,
  AI_CLASSIFY(
    &apos;Based on these marketing metrics, recommend a budget action&apos;,
    &apos;Campaign: &apos; || campaign_name || &apos;, Clicks: &apos; || CAST(total_clicks AS VARCHAR) || &apos;, Conversion Rate: &apos; || CAST(conversion_rate_pct AS VARCHAR) || &apos;%&apos;,
    ARRAY[&apos;Increase Budget&apos;, &apos;Maintain Budget&apos;, &apos;Decrease Budget&apos;, &apos;Pause Campaign&apos;]
  ) AS budget_recommendation
FROM analytics.gold.campaign_performance;

-- Generate executive summaries
SELECT
  campaign_name,
  AI_GENERATE(
    &apos;Write a brief performance summary for this marketing campaign&apos;,
    &apos;Campaign: &apos; || campaign_name || &apos;, Clicks: &apos; || CAST(total_clicks AS VARCHAR) || &apos;, Conversions: &apos; || CAST(total_conversions AS VARCHAR) || &apos;, Spend: $&apos; || CAST(total_ad_spend AS VARCHAR)
  ) AS performance_summary
FROM analytics.gold.campaign_performance
WHERE campaign_grade = &apos;High Performer&apos;;
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;&lt;code&gt;AI_CLASSIFY&lt;/code&gt; categorizes data with AI. &lt;code&gt;AI_GENERATE&lt;/code&gt; produces narrative text. Both run inside your SQL query.&lt;/p&gt;
&lt;h2&gt;Accelerate with Reflections&lt;/h2&gt;
&lt;p&gt;For dashboard queries that run repeatedly against BigQuery:&lt;/p&gt;
&lt;ol&gt;
&lt;li&gt;Build a view over your BigQuery data&lt;/li&gt;
&lt;li&gt;Navigate to the view in the &lt;strong&gt;Catalog&lt;/strong&gt; and click the &lt;strong&gt;Reflections&lt;/strong&gt; tab&lt;/li&gt;
&lt;li&gt;Choose &lt;strong&gt;Raw Reflection&lt;/strong&gt; (full cache) or &lt;strong&gt;Aggregation Reflection&lt;/strong&gt; (pre-computed metrics)&lt;/li&gt;
&lt;li&gt;Select columns and set the refresh interval&lt;/li&gt;
&lt;li&gt;Click &lt;strong&gt;Save&lt;/strong&gt;&lt;/li&gt;
&lt;/ol&gt;
&lt;p&gt;Subsequent matching queries hit the Reflection instead of scanning BigQuery. This is particularly valuable for BigQuery&apos;s on-demand pricing, where every scan costs money. A dashboard with 10 widgets refreshing every 15 minutes would generate 960 BigQuery scans per day; with Reflections refreshing hourly, Dremio serves 936 of those from cache.&lt;/p&gt;
&lt;h2&gt;Governance Across BigQuery and Other Sources&lt;/h2&gt;
&lt;p&gt;Dremio&apos;s Fine-Grained Access Control (FGAC) provides governance that works across BigQuery and every other source:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Column masking:&lt;/strong&gt; Mask ad spend or conversion data from specific roles. A content creator sees campaign impressions but not revenue.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Row-level filtering:&lt;/strong&gt; Regional marketers see only campaigns in their territory.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Unified policies:&lt;/strong&gt; Same rules apply to BigQuery, PostgreSQL, S3, and all connected sources.&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;These policies apply across SQL Runner, BI tools, AI Agent, MCP Server, and Arrow Flight clients.&lt;/p&gt;
&lt;h2&gt;Connect BI Tools via Arrow Flight&lt;/h2&gt;
&lt;p&gt;Arrow Flight provides 10-100x faster data transfer than JDBC/ODBC:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Looker:&lt;/strong&gt; Ideal for Google Cloud environments — connect via JDBC&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Tableau:&lt;/strong&gt; Dremio connector for direct Arrow Flight access&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Power BI:&lt;/strong&gt; Dremio ODBC driver or native connector&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Python/Pandas:&lt;/strong&gt; &lt;code&gt;pyarrow.flight&lt;/code&gt; for programmatic data access&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;dbt:&lt;/strong&gt; &lt;code&gt;dbt-dremio&lt;/code&gt; adapter for transformations&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;All queries benefit from Reflections, governance, and the semantic layer.&lt;/p&gt;
&lt;h2&gt;VS Code Copilot Integration&lt;/h2&gt;
&lt;p&gt;Dremio&apos;s VS Code extension with Copilot integration lets developers query BigQuery data from their IDE. Ask Copilot &amp;quot;Show me campaign conversion rates from BigQuery&amp;quot; and get SQL generated from your semantic layer.&lt;/p&gt;
&lt;h2&gt;When to Keep Data in BigQuery vs. Migrate&lt;/h2&gt;
&lt;p&gt;&lt;strong&gt;Keep in BigQuery:&lt;/strong&gt; Data consumed by Google-native tools (Looker, Google Data Studio, Vertex AI), data pipelines managed by Cloud Dataflow or Dataproc, datasets with BigQuery ML models, data shared via BigQuery analytics hub.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Migrate to Iceberg:&lt;/strong&gt; Historical archive data, datasets queried by non-Google tools, data that benefits from Iceberg&apos;s time travel and automated compaction, workloads where BigQuery per-TB costs exceed value. Migrated Iceberg tables get Dremio&apos;s automatic maintenance and Autonomous Reflections.&lt;/p&gt;
&lt;p&gt;For data staying in BigQuery, create manual Reflections to eliminate per-TB scan costs for repeated queries.&lt;/p&gt;
&lt;h2&gt;BigQuery Cost Optimization with Dremio&lt;/h2&gt;
&lt;h3&gt;BigQuery Pricing Models&lt;/h3&gt;
&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Model&lt;/th&gt;
&lt;th&gt;How It&apos;s Priced&lt;/th&gt;
&lt;th&gt;Dremio&apos;s Impact&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;On-Demand&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;$6.25 per TB scanned&lt;/td&gt;
&lt;td&gt;Reflections eliminate repeat scans — 50-80% cost reduction&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Editions (Standard/Enterprise/Enterprise Plus)&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Slot reservations (autoscaling)&lt;/td&gt;
&lt;td&gt;Reflections reduce slot utilization, enabling lower commitments&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Flat Rate&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Fixed slot reservations&lt;/td&gt;
&lt;td&gt;Reflections free up slots for other workloads&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;
&lt;h3&gt;Google Analytics 4 (GA4) Integration&lt;/h3&gt;
&lt;p&gt;BigQuery is the default export destination for Google Analytics 4 data. GA4 exports create daily event tables (&lt;code&gt;events_YYYYMMDD&lt;/code&gt;) with nested schemas. Dremio handles this pattern:&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-sql&quot;&gt;-- Query GA4 events from BigQuery through Dremio
SELECT
  event_name,
  COUNT(*) AS event_count,
  COUNT(DISTINCT user_pseudo_id) AS unique_users,
  DATE_TRUNC(&apos;day&apos;, CAST(event_timestamp AS TIMESTAMP)) AS event_day
FROM &amp;quot;bigquery-analytics&amp;quot;.analytics_12345678.events_*
WHERE event_name IN (&apos;page_view&apos;, &apos;purchase&apos;, &apos;add_to_cart&apos;)
GROUP BY 1, 4
ORDER BY event_day DESC, event_count DESC;
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;By creating Reflections on GA4 views, you can serve real-time marketing dashboards without accumulating BigQuery scan costs.&lt;/p&gt;
&lt;h3&gt;Multi-Cloud Analytics Strategy&lt;/h3&gt;
&lt;p&gt;For organizations with data across Google Cloud, AWS, and Azure:&lt;/p&gt;
&lt;ol&gt;
&lt;li&gt;&lt;strong&gt;BigQuery&lt;/strong&gt; holds your Google-native data (GA4, Google Ads, Cloud Storage exports)&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;S3&lt;/strong&gt; holds your AWS data lake (application logs, IoT telemetry)&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Azure Storage&lt;/strong&gt; holds your Microsoft ecosystem data (Power Platform exports, Azure services)&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;PostgreSQL/MySQL&lt;/strong&gt; hold operational application data&lt;/li&gt;
&lt;/ol&gt;
&lt;p&gt;Dremio federates across all four clouds, applies unified governance, and serves all BI tools from a single connection. This eliminates the need for cross-cloud ETL pipelines.&lt;/p&gt;
&lt;h2&gt;Get Started&lt;/h2&gt;
&lt;p&gt;BigQuery users can break out of Google Cloud&apos;s walled garden, reduce per-TB scan costs with Reflections, and enable AI analytics across their entire data estate. Whether you&apos;re running a single BigQuery project or managing data across dozens of GCP projects alongside AWS and Azure resources, Dremio provides the federation layer that makes multi-cloud analytics practical.&lt;/p&gt;
&lt;p&gt;The combination of Reflections (eliminating repetitive per-TB charges), federation (joining BigQuery with non-Google sources without ETL), and AI capabilities (Agent, MCP Server, SQL Functions) transforms BigQuery from an isolated Google Cloud analytics tool into a connected node in your broader data ecosystem. Your marketing team asks the AI Agent questions about campaign performance and gets accurate answers drawn from BigQuery data enriched with context from your semantic layer.&lt;/p&gt;
&lt;p&gt;&lt;a href=&quot;https://www.dremio.com/get-started?utm_source=ev_buffer&amp;amp;utm_medium=influencer&amp;amp;utm_campaign=pag&amp;amp;utm_term=connector-google-bigquery-dremio-cloud&amp;amp;utm_content=alexmerced&quot;&gt;Try Dremio Cloud free for 30 days&lt;/a&gt; and connect your BigQuery projects.&lt;/p&gt;
</content:encoded><author>Alex Merced</author></item><item><title>Connect Amazon Redshift to Dremio Cloud: Extend Your Warehouse with Federation and AI Analytics</title><link>https://iceberglakehouse.com/posts/2026-03-connector-amazon-redshift/</link><guid isPermaLink="true">https://iceberglakehouse.com/posts/2026-03-connector-amazon-redshift/</guid><description>
Amazon Redshift is AWS&apos;s managed data warehouse, designed for petabyte-scale analytics. If your organization chose Redshift for analytical workloads,...</description><pubDate>Sun, 01 Mar 2026 17:00:00 GMT</pubDate><content:encoded>&lt;p&gt;Amazon Redshift is AWS&apos;s managed data warehouse, designed for petabyte-scale analytics. If your organization chose Redshift for analytical workloads, you&apos;ve built data pipelines, ETL jobs, and dashboards around it. But as data ecosystems grow, Redshift&apos;s limitations become painfully clear: connecting data outside Redshift requires ETL or Redshift Spectrum (additional cost per TB scanned), sharing Redshift data with non-AWS tools means exporting to S3, and Redshift&apos;s concurrency limits constrain how many dashboards and users can query simultaneously.&lt;/p&gt;
&lt;p&gt;Dremio Cloud connects to Redshift and queries it alongside every other data source in your organization. Instead of moving all your data into Redshift, or exporting Redshift data out, Dremio federates across sources and accelerates repeated queries with Reflections so your Redshift cluster handles less load.&lt;/p&gt;
&lt;p&gt;Redshift&apos;s concurrency scaling feature helps handle burst query volumes, but it charges per-second of additional cluster time. By routing repeated dashboard queries through Dremio Reflections, you reduce the need for concurrency scaling entirely — cached results are served without any Redshift cluster involvement. This difference is particularly impactful for organizations running dozens of auto-refreshing dashboards.&lt;/p&gt;
&lt;h3&gt;Redshift Data Sharing vs. Dremio Federation&lt;/h3&gt;
&lt;p&gt;Redshift Data Sharing allows sharing data between Redshift clusters. But it only works within the Redshift ecosystem — you can&apos;t share Redshift data with Snowflake, BigQuery, or PostgreSQL through Data Sharing. Dremio&apos;s federation provides a broader solution: join Redshift data with any connected source. Data Sharing works for Redshift-to-Redshift use cases; Dremio handles everything else.&lt;/p&gt;
&lt;h3&gt;Redshift Serverless Consideration&lt;/h3&gt;
&lt;p&gt;With Redshift Serverless, you pay per RPU-second consumed. Every query, including repeated dashboard queries, consumes RPUs. Dremio Reflections eliminate RPU consumption for cached queries — a direct and measurable cost reduction. For Serverless users, the ROI from Reflections is immediately visible in the AWS billing dashboard.&lt;/p&gt;
&lt;p&gt;Redshift&apos;s RA3 instances introduced compute-storage separation using Managed Storage backed by S3. While this improved scalability, all queries still consume RA3 compute resources. Dremio provides a complementary compute layer: Reflections handle repetitive analytical workloads while RA3 focuses on the data transformations and ingestion pipelines that require Redshift&apos;s native capabilities. This architectural separation — Redshift for data engineering, Dremio for analytics serving — maximizes the value of both platforms.&lt;/p&gt;
&lt;h2&gt;Why Redshift Users Need Dremio&lt;/h2&gt;
&lt;h3&gt;Extend Redshift Without Spectrum Costs&lt;/h3&gt;
&lt;p&gt;Redshift Spectrum charges per TB scanned against S3. Dremio&apos;s federation queries S3 data directly through its own engine without per-TB charges. You still get SQL joins between Redshift and S3 data — Dremio handles the federation transparently.&lt;/p&gt;
&lt;h3&gt;Reduce Redshift Cluster Costs&lt;/h3&gt;
&lt;p&gt;Redshift pricing scales with cluster size (RA3, DC2, or Serverless credits). Analytical dashboards that run the same queries repeatedly consume cluster resources on every refresh. Dremio&apos;s Reflections serve cached results for matching queries, offloading that load from Redshift. For organizations with heavy dashboard workloads, this can reduce the Redshift cluster size needed.&lt;/p&gt;
&lt;h3&gt;Multi-Warehouse Federation&lt;/h3&gt;
&lt;p&gt;Your Redshift warehouse holds sales data, but your Snowflake instance has marketing data, your BigQuery project has Google Analytics data, and your PostgreSQL database has CRM data. Dremio federates across all four in a single query.&lt;/p&gt;
&lt;h3&gt;External Queries&lt;/h3&gt;
&lt;p&gt;Dremio supports external queries against Redshift, allowing you to run Redshift-native SQL (including Redshift-specific functions like &lt;code&gt;APPROXIMATE COUNT(DISTINCT)&lt;/code&gt;, window functions, and late binding views) through Dremio when needed.&lt;/p&gt;
&lt;h3&gt;AI Analytics&lt;/h3&gt;
&lt;p&gt;Dremio&apos;s semantic layer, AI Agent, MCP Server, and AI SQL Functions add natural language querying and AI enrichment to Redshift data without building a separate BI layer.&lt;/p&gt;
&lt;h2&gt;Prerequisites&lt;/h2&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Redshift cluster endpoint&lt;/strong&gt; (hostname) — from the Redshift console&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Port&lt;/strong&gt; — default &lt;code&gt;5439&lt;/code&gt;&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Database name&lt;/strong&gt; — your Redshift database&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Username and password&lt;/strong&gt; — Redshift database user with SELECT permissions&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Network access&lt;/strong&gt; — Redshift cluster must be publicly accessible, or configure VPC peering with Dremio Cloud&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Dremio Cloud account&lt;/strong&gt; — &lt;a href=&quot;https://www.dremio.com/get-started?utm_source=ev_buffer&amp;amp;utm_medium=influencer&amp;amp;utm_campaign=pag&amp;amp;utm_term=connector-amazon-redshift-dremio-cloud&amp;amp;utm_content=alexmerced&quot;&gt;sign up free for 30 days&lt;/a&gt; with $400 in compute credits&lt;/li&gt;
&lt;/ul&gt;
&lt;h2&gt;Step-by-Step: Connect Redshift to Dremio Cloud&lt;/h2&gt;
&lt;h3&gt;1. Add the Source&lt;/h3&gt;
&lt;p&gt;Click &lt;strong&gt;&amp;quot;+&amp;quot;&lt;/strong&gt; in the Dremio console and select &lt;strong&gt;Amazon Redshift&lt;/strong&gt;.&lt;/p&gt;
&lt;h3&gt;2. Configure Connection Details&lt;/h3&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Name:&lt;/strong&gt; Descriptive identifier (e.g., &lt;code&gt;redshift-warehouse&lt;/code&gt; or &lt;code&gt;sales-analytics&lt;/code&gt;).&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Host:&lt;/strong&gt; Your Redshift cluster endpoint (e.g., &lt;code&gt;mycluster.xxxx.us-east-1.redshift.amazonaws.com&lt;/code&gt;).&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Port:&lt;/strong&gt; Default &lt;code&gt;5439&lt;/code&gt;.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Database:&lt;/strong&gt; Your Redshift database name.&lt;/li&gt;
&lt;/ul&gt;
&lt;h3&gt;3. Set Authentication&lt;/h3&gt;
&lt;p&gt;Master Credentials (username/password) or Secret Resource URL (AWS Secrets Manager).&lt;/p&gt;
&lt;h3&gt;4. Configure Advanced Options&lt;/h3&gt;
&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Setting&lt;/th&gt;
&lt;th&gt;Purpose&lt;/th&gt;
&lt;th&gt;Default&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Record fetch size&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Rows per batch from Redshift&lt;/td&gt;
&lt;td&gt;200&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Maximum Idle Connections&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Connection pool management&lt;/td&gt;
&lt;td&gt;8&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Connection Idle Time&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Seconds before idle connections close&lt;/td&gt;
&lt;td&gt;60&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Encrypt Connection&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Enable SSL/TLS&lt;/td&gt;
&lt;td&gt;Off&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;
&lt;h3&gt;5. Set Reflection and Metadata Refresh, then Save&lt;/h3&gt;
&lt;h2&gt;Query Redshift Data&lt;/h2&gt;
&lt;pre&gt;&lt;code class=&quot;language-sql&quot;&gt;SELECT
  date_trunc(&apos;month&apos;, sale_date) AS month,
  product_category,
  SUM(revenue) AS monthly_revenue,
  COUNT(DISTINCT customer_id) AS unique_customers,
  ROUND(SUM(revenue) / COUNT(DISTINCT customer_id), 2) AS revenue_per_customer
FROM &amp;quot;redshift-warehouse&amp;quot;.public.sales
WHERE sale_date &amp;gt;= &apos;2024-01-01&apos;
GROUP BY 1, 2
ORDER BY 1, monthly_revenue DESC;
&lt;/code&gt;&lt;/pre&gt;
&lt;h2&gt;External Queries&lt;/h2&gt;
&lt;p&gt;Run Redshift-native SQL through Dremio when you need Redshift-specific functions:&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-sql&quot;&gt;SELECT * FROM TABLE(
  &amp;quot;redshift-warehouse&amp;quot;.EXTERNAL_QUERY(
    &apos;SELECT TOP 100 querytxt, elapsed, starttime FROM stl_query WHERE starttime &amp;gt; GETDATE() - 7 ORDER BY elapsed DESC&apos;
  )
);
&lt;/code&gt;&lt;/pre&gt;
&lt;h2&gt;Federate Redshift with Other Sources&lt;/h2&gt;
&lt;pre&gt;&lt;code class=&quot;language-sql&quot;&gt;-- Join Redshift sales with PostgreSQL CRM and S3 marketing data
SELECT
  c.customer_name,
  c.segment,
  SUM(s.revenue) AS total_revenue,
  COUNT(s.sale_id) AS total_sales,
  m.campaign_name,
  m.attribution_channel,
  ROUND(SUM(s.revenue) / NULLIF(m.campaign_spend, 0), 2) AS roas
FROM &amp;quot;postgres-crm&amp;quot;.public.customers c
JOIN &amp;quot;redshift-warehouse&amp;quot;.public.sales s ON c.customer_id = s.customer_id
LEFT JOIN &amp;quot;s3-marketing&amp;quot;.attribution.customer_campaigns m ON c.customer_id = m.customer_id
WHERE s.sale_date &amp;gt;= &apos;2024-01-01&apos;
GROUP BY c.customer_name, c.segment, m.campaign_name, m.attribution_channel, m.campaign_spend
ORDER BY total_revenue DESC;
&lt;/code&gt;&lt;/pre&gt;
&lt;h2&gt;Build a Semantic Layer&lt;/h2&gt;
&lt;pre&gt;&lt;code class=&quot;language-sql&quot;&gt;CREATE VIEW analytics.gold.sales_performance AS
SELECT
  s.product_category,
  date_trunc(&apos;month&apos;, s.sale_date) AS month,
  SUM(s.revenue) AS revenue,
  COUNT(*) AS transactions,
  COUNT(DISTINCT s.customer_id) AS unique_buyers,
  ROUND(SUM(s.revenue) / COUNT(*), 2) AS avg_transaction_value,
  CASE
    WHEN SUM(s.revenue) &amp;gt; 500000 THEN &apos;Top Performer&apos;
    WHEN SUM(s.revenue) &amp;gt; 100000 THEN &apos;Solid&apos;
    ELSE &apos;Emerging&apos;
  END AS performance_tier
FROM &amp;quot;redshift-warehouse&amp;quot;.public.sales s
GROUP BY s.product_category, date_trunc(&apos;month&apos;, s.sale_date);
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;In the &lt;strong&gt;Catalog&lt;/strong&gt;, click &lt;strong&gt;Edit&lt;/strong&gt; → &lt;strong&gt;Generate Wiki&lt;/strong&gt; and &lt;strong&gt;Generate Tags&lt;/strong&gt;.&lt;/p&gt;
&lt;h2&gt;AI-Powered Analytics on Redshift Data&lt;/h2&gt;
&lt;h3&gt;Dremio AI Agent&lt;/h3&gt;
&lt;p&gt;The AI Agent lets users ask &amp;quot;What were our top performing product categories last quarter?&amp;quot; and generates accurate SQL from your semantic layer. The wiki descriptions attached to views tell the Agent what &amp;quot;top performing&amp;quot; means in your data context.&lt;/p&gt;
&lt;h3&gt;Dremio MCP Server&lt;/h3&gt;
&lt;p&gt;Connect &lt;a href=&quot;https://github.com/dremio/dremio-mcp&quot;&gt;Claude or ChatGPT&lt;/a&gt; to your Redshift data through Dremio:&lt;/p&gt;
&lt;ol&gt;
&lt;li&gt;Create a Native OAuth app in Dremio Cloud&lt;/li&gt;
&lt;li&gt;Configure redirect URLs for your AI client&lt;/li&gt;
&lt;li&gt;Connect via &lt;code&gt;mcp.dremio.cloud/mcp/{project_id}&lt;/code&gt;&lt;/li&gt;
&lt;/ol&gt;
&lt;p&gt;A VP of Sales asks Claude &amp;quot;Compare our Q1 revenue per customer across product categories using the Redshift data&amp;quot; and gets a governed, accurate benchmark without SQL.&lt;/p&gt;
&lt;h3&gt;AI SQL Functions&lt;/h3&gt;
&lt;pre&gt;&lt;code class=&quot;language-sql&quot;&gt;-- Generate strategic recommendations from sales data
SELECT
  product_category,
  revenue,
  performance_tier,
  AI_GENERATE(
    &apos;Write a strategic recommendation for this product category&apos;,
    &apos;Category: &apos; || product_category || &apos;, Revenue: $&apos; || CAST(revenue AS VARCHAR) || &apos;, Tier: &apos; || performance_tier || &apos;, Avg Transaction: $&apos; || CAST(avg_transaction_value AS VARCHAR)
  ) AS strategic_recommendation
FROM analytics.gold.sales_performance
WHERE month = DATE_TRUNC(&apos;month&apos;, CURRENT_DATE - INTERVAL &apos;1&apos; MONTH);

-- Classify product categories for budget allocation
SELECT
  product_category,
  AI_CLASSIFY(
    &apos;Based on this sales performance, classify the marketing budget priority&apos;,
    &apos;Revenue: $&apos; || CAST(revenue AS VARCHAR) || &apos;, Customers: &apos; || CAST(unique_buyers AS VARCHAR) || &apos;, Avg Transaction: $&apos; || CAST(avg_transaction_value AS VARCHAR),
    ARRAY[&apos;Increase Investment&apos;, &apos;Maintain Investment&apos;, &apos;Optimize Spend&apos;, &apos;Reduce Budget&apos;]
  ) AS budget_recommendation
FROM analytics.gold.sales_performance
WHERE month &amp;gt;= DATE_TRUNC(&apos;month&apos;, CURRENT_DATE - INTERVAL &apos;3&apos; MONTH);
&lt;/code&gt;&lt;/pre&gt;
&lt;h2&gt;Accelerate with Reflections&lt;/h2&gt;
&lt;p&gt;Create Reflections on Redshift views for dashboard acceleration:&lt;/p&gt;
&lt;ol&gt;
&lt;li&gt;Navigate to the view in the &lt;strong&gt;Catalog&lt;/strong&gt;&lt;/li&gt;
&lt;li&gt;Click the &lt;strong&gt;Reflections&lt;/strong&gt; tab&lt;/li&gt;
&lt;li&gt;Choose &lt;strong&gt;Raw Reflection&lt;/strong&gt; (full dataset cache) or &lt;strong&gt;Aggregation Reflection&lt;/strong&gt; (pre-computed SUM/COUNT/AVG)&lt;/li&gt;
&lt;li&gt;Select columns and aggregations&lt;/li&gt;
&lt;li&gt;Set the &lt;strong&gt;Refresh Interval&lt;/strong&gt; — balance freshness against Redshift cluster load&lt;/li&gt;
&lt;li&gt;Click &lt;strong&gt;Save&lt;/strong&gt;&lt;/li&gt;
&lt;/ol&gt;
&lt;p&gt;BI tools connected via Arrow Flight get sub-second responses from Reflections instead of waiting for Redshift cluster processing. A Tableau dashboard refreshing every 15 minutes generates zero Redshift cluster load after the Reflection is built.&lt;/p&gt;
&lt;h2&gt;Governance Across Redshift and Other Sources&lt;/h2&gt;
&lt;p&gt;Dremio&apos;s Fine-Grained Access Control (FGAC) adds governance capabilities that work uniformly across Redshift and every other connected source:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Column masking:&lt;/strong&gt; Mask sensitive revenue data or PII from specific roles. A marketing analyst sees conversion rates but not individual customer records.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Row-level filtering:&lt;/strong&gt; Restrict data visibility based on user roles. Regional managers see only their region&apos;s sales data.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Unified policies:&lt;/strong&gt; The same governance applies to Redshift, PostgreSQL, S3, and all other sources — no per-database security configuration.&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;These policies apply across all access methods: SQL Runner, BI tools, AI Agent, MCP Server, and Arrow Flight clients.&lt;/p&gt;
&lt;h2&gt;Connect BI Tools via Arrow Flight&lt;/h2&gt;
&lt;p&gt;Arrow Flight provides 10-100x faster data transfer than JDBC/ODBC for BI tools:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Tableau:&lt;/strong&gt; Use the Dremio connector for direct Arrow Flight access&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Power BI:&lt;/strong&gt; Use Dremio&apos;s ODBC driver or native connector&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Python/Pandas:&lt;/strong&gt; Use &lt;code&gt;pyarrow.flight&lt;/code&gt; for programmatic data access&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Looker:&lt;/strong&gt; Connect via JDBC&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;dbt:&lt;/strong&gt; Use &lt;code&gt;dbt-dremio&lt;/code&gt; for transformation workflows&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;All queries benefit from Reflections, governance, and the semantic layer — whether the underlying data comes from Redshift or any other source.&lt;/p&gt;
&lt;h2&gt;VS Code Copilot Integration&lt;/h2&gt;
&lt;p&gt;Dremio&apos;s VS Code extension with Copilot integration enables developers to query Redshift data from their IDE. Ask Copilot &amp;quot;Show me sales performance by category from Redshift&amp;quot; and it generates SQL using your semantic layer, eliminating context switching between tools.&lt;/p&gt;
&lt;h2&gt;When to Keep Data in Redshift vs. Migrate&lt;/h2&gt;
&lt;p&gt;&lt;strong&gt;Keep in Redshift:&lt;/strong&gt; Data actively used by Redshift-native tools and materializations, workloads with existing Redshift-based ETL pipelines, datasets managed by Redshift&apos;s automatic table optimization (sort keys, distribution styles).&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Migrate to Iceberg:&lt;/strong&gt; Historical data and archives, datasets consumed primarily by non-Redshift tools, data where Redshift cluster costs exceed analytical value. Migrated Iceberg tables benefit from Dremio&apos;s automatic compaction, time travel, Autonomous Reflections, and zero per-query storage costs.&lt;/p&gt;
&lt;p&gt;For data that stays in Redshift, create manual Reflections to reduce cluster load. For migrated Iceberg data, Dremio handles optimization automatically.&lt;/p&gt;
&lt;h2&gt;Redshift Cost Optimization with Dremio&lt;/h2&gt;
&lt;h3&gt;Redshift Pricing Models&lt;/h3&gt;
&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Model&lt;/th&gt;
&lt;th&gt;How It&apos;s Priced&lt;/th&gt;
&lt;th&gt;Dremio&apos;s Impact&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;RA3 Provisioned&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Per-node-hour + managed storage&lt;/td&gt;
&lt;td&gt;Reflections reduce node utilization, enabling cluster downsizing&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;DC2 Provisioned&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Per-node-hour, SSD storage included&lt;/td&gt;
&lt;td&gt;Same as RA3 — lower utilization means fewer nodes needed&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Serverless&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Per RPU-hour (compute consumed)&lt;/td&gt;
&lt;td&gt;Reflections eliminate RPU consumption for cached queries&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Spectrum&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Per TB scanned in S3&lt;/td&gt;
&lt;td&gt;Dremio queries S3 directly without per-TB charges&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;
&lt;h3&gt;Quantifying Savings&lt;/h3&gt;
&lt;p&gt;A typical dashboard workload might include:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;20 production dashboards, each refreshing every 15 minutes&lt;/li&gt;
&lt;li&gt;50+ ad-hoc queries per day from analysts&lt;/li&gt;
&lt;li&gt;Weekly scheduled reports generating 100+ queries&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;With Dremio Reflections, only the Reflection refresh queries hit Redshift. If Reflections refresh hourly:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Dashboard queries drop from 1,920/day to 24/day (hourly Reflection refresh × 24 hours) — a &lt;strong&gt;98.7% reduction&lt;/strong&gt;&lt;/li&gt;
&lt;li&gt;Ad-hoc queries matching Reflection patterns are served from cache — zero Redshift load&lt;/li&gt;
&lt;li&gt;Scheduled reports matching Reflections run instantly&lt;/li&gt;
&lt;/ul&gt;
&lt;h3&gt;Migration Strategy&lt;/h3&gt;
&lt;ol&gt;
&lt;li&gt;&lt;strong&gt;Assess:&lt;/strong&gt; Identify Redshift tables by query frequency and size. High-frequency, read-heavy tables are prime candidates for Reflections.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Accelerate:&lt;/strong&gt; Create Reflections on the 10-20 most-queried views. Monitor Redshift cluster utilization.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Right-size:&lt;/strong&gt; As utilization drops, reduce Redshift node count or switch from Provisioned to Serverless.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Migrate:&lt;/strong&gt; Move historical and archival data from Redshift to Iceberg tables. Use &lt;code&gt;CREATE TABLE ... AS SELECT&lt;/code&gt; in Dremio.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Optimize:&lt;/strong&gt; Continue moving more tables as Redshift costs decrease and Dremio handles more workloads.&lt;/li&gt;
&lt;/ol&gt;
&lt;h2&gt;Get Started&lt;/h2&gt;
&lt;p&gt;Redshift users can extend their warehouse with federation, reduce cluster costs with Reflections, add AI analytics, and apply unified governance across their entire data estate. Whether you&apos;re running Redshift Provisioned, Serverless, or RA3, Dremio Reflections immediately reduce compute costs by caching repetitive queries. Start by connecting your cluster and creating Reflections on your most-queried views to see immediate results.&lt;/p&gt;
&lt;p&gt;&lt;a href=&quot;https://www.dremio.com/get-started?utm_source=ev_buffer&amp;amp;utm_medium=influencer&amp;amp;utm_campaign=pag&amp;amp;utm_term=connector-amazon-redshift-dremio-cloud&amp;amp;utm_content=alexmerced&quot;&gt;Try Dremio Cloud free for 30 days&lt;/a&gt; and connect your Redshift cluster.&lt;/p&gt;
</content:encoded><author>Alex Merced</author></item><item><title>Connect Azure Storage to Dremio Cloud: Query Your Microsoft Data Lake with SQL and AI</title><link>https://iceberglakehouse.com/posts/2026-03-connector-azure-storage/</link><guid isPermaLink="true">https://iceberglakehouse.com/posts/2026-03-connector-azure-storage/</guid><description>
Azure Storage is Microsoft&apos;s cloud storage platform, spanning Blob Storage, Azure Data Lake Storage Gen2 (ADLS Gen2), and Azure Files. If your organi...</description><pubDate>Sun, 01 Mar 2026 16:00:00 GMT</pubDate><content:encoded>&lt;p&gt;Azure Storage is Microsoft&apos;s cloud storage platform, spanning Blob Storage, Azure Data Lake Storage Gen2 (ADLS Gen2), and Azure Files. If your organization uses Microsoft Azure, your data lake almost certainly lives in Azure Storage — Parquet files from Azure Data Factory pipelines, CSV exports from Azure SQL Database, JSON event streams from Azure Event Hubs, and raw data from Azure IoT Hub all land in Azure Storage containers.&lt;/p&gt;
&lt;p&gt;Dremio Cloud connects directly to Azure Storage and lets you query these files in place using standard SQL. You don&apos;t need Azure Synapse Analytics (DWU-based pricing), Azure Databricks (DBU costs), or HDInsight (cluster management) to run analytical queries against your data lake. Dremio reads the data, accelerates repeated queries with Reflections, and federates Azure Storage with every other source in your data ecosystem.&lt;/p&gt;
&lt;p&gt;Many Azure customers face a fragmented analytics experience: Synapse for warehouse workloads, Databricks for data engineering, Power BI for visualization, and Azure Data Explorer for log analytics — each with its own pricing model, access control, and query interface. Dremio consolidates the analytical layer by querying Azure Storage and other Azure (or non-Azure) services from a single SQL engine with unified governance and AI capabilities. Dremio reads Parquet, CSV, JSON, Delta Lake, and Apache Iceberg table formats from Azure Blob Storage and ADLS Gen2 containers. It pushes projection and filtering into its vectorized query engine and caches frequently accessed data on local NVMe drives (Columnar Cloud Cache, or C3) for near-instantaneous repeat queries.&lt;/p&gt;
&lt;h2&gt;Why Azure Storage Users Need Dremio&lt;/h2&gt;
&lt;h3&gt;SQL Without Azure Synapse Costs&lt;/h3&gt;
&lt;p&gt;Azure Synapse serverless SQL charges per terabyte of data processed. For large datasets queried frequently — dashboard refreshes, ad-hoc exploration, scheduled reports — costs accumulate quickly. Dremio&apos;s Reflections eliminate repeat scans by caching pre-computed results. C3 caching further reduces Azure Storage API calls for frequently accessed files. Your first query scans Azure Storage; subsequent matching queries hit Dremio&apos;s cache.&lt;/p&gt;
&lt;h3&gt;Federation Beyond Azure&lt;/h3&gt;
&lt;p&gt;Your Azure data lake holds event data and ETL outputs, but your operational database is in PostgreSQL on AWS, your marketing data is in Google BigQuery, and your CRM is in Salesforce (exported to S3). Dremio federates across all three cloud providers in a single SQL query — no ADF (Azure Data Factory) pipelines needed.&lt;/p&gt;
&lt;h3&gt;Apache Iceberg Table Management&lt;/h3&gt;
&lt;p&gt;Create Iceberg tables backed by Azure Storage (or Dremio-managed storage) with full DML support (INSERT, UPDATE, DELETE, MERGE). Dremio automatically handles compaction, manifest rewriting, clustering, and vacuuming. No manual &lt;code&gt;OPTIMIZE&lt;/code&gt; jobs, no maintenance scripts.&lt;/p&gt;
&lt;h3&gt;AI on Azure Data&lt;/h3&gt;
&lt;p&gt;Dremio&apos;s AI Agent, MCP Server, and AI SQL Functions make your Azure data queryable by non-technical users and external AI tools. Build a semantic layer over your Azure files, and let AI do the heavy lifting.&lt;/p&gt;
&lt;h2&gt;Prerequisites&lt;/h2&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Azure Storage Account&lt;/strong&gt; with Blob Storage or ADLS Gen2 enabled&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Authentication:&lt;/strong&gt; Azure Active Directory OAuth 2.0, Shared Access Key, or Shared Access Signature (SAS) Token&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Container names&lt;/strong&gt; you want to query&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Network access&lt;/strong&gt; from Dremio Cloud to Azure Storage endpoints&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Dremio Cloud account&lt;/strong&gt; — &lt;a href=&quot;https://www.dremio.com/get-started?utm_source=ev_buffer&amp;amp;utm_medium=influencer&amp;amp;utm_campaign=pag&amp;amp;utm_term=connector-azure-storage-dremio-cloud&amp;amp;utm_content=alexmerced&quot;&gt;sign up free for 30 days&lt;/a&gt; with $400 in compute credits&lt;/li&gt;
&lt;/ul&gt;
&lt;h2&gt;Step-by-Step: Connect Azure Storage to Dremio Cloud&lt;/h2&gt;
&lt;h3&gt;1. Add the Source&lt;/h3&gt;
&lt;p&gt;Click &lt;strong&gt;&amp;quot;+&amp;quot;&lt;/strong&gt; in the Dremio console and select &lt;strong&gt;Azure Storage&lt;/strong&gt;.&lt;/p&gt;
&lt;h3&gt;2. Configure Connection Details&lt;/h3&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Name:&lt;/strong&gt; Descriptive identifier (e.g., &lt;code&gt;azure-datalake&lt;/code&gt; or &lt;code&gt;adls-analytics&lt;/code&gt;).&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Account:&lt;/strong&gt; Your Azure Storage account name.&lt;/li&gt;
&lt;/ul&gt;
&lt;h3&gt;3. Set Authentication&lt;/h3&gt;
&lt;p&gt;Choose from:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Azure AD (OAuth 2.0):&lt;/strong&gt; Most secure, uses service principal or managed identity.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Shared Access Key:&lt;/strong&gt; Full access to the storage account. Simpler but less granular.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;SAS Token:&lt;/strong&gt; Scoped, time-limited access.&lt;/li&gt;
&lt;/ul&gt;
&lt;h3&gt;4. Configure Advanced Options&lt;/h3&gt;
&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Setting&lt;/th&gt;
&lt;th&gt;Purpose&lt;/th&gt;
&lt;th&gt;Default&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Root Path&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Starting container/path&lt;/td&gt;
&lt;td&gt;&lt;code&gt;/&lt;/code&gt; (all containers)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;CTAS Format&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Default CREATE TABLE format&lt;/td&gt;
&lt;td&gt;Iceberg recommended&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Encrypt Connection&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Enable HTTPS&lt;/td&gt;
&lt;td&gt;On&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Enable partition column inference&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Extract partition keys from folder structures&lt;/td&gt;
&lt;td&gt;Off&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Enable file status check&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Verify file existence before reads&lt;/td&gt;
&lt;td&gt;On&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;
&lt;h3&gt;5. Set Reflection and Metadata Refresh, then Save&lt;/h3&gt;
&lt;h2&gt;Query Azure Storage Data&lt;/h2&gt;
&lt;pre&gt;&lt;code class=&quot;language-sql&quot;&gt;-- Query Parquet files directly
SELECT transaction_id, customer_id, amount, transaction_date
FROM &amp;quot;azure-datalake&amp;quot;.sales.&amp;quot;transactions.parquet&amp;quot;
WHERE transaction_date &amp;gt;= &apos;2024-01-01&apos; AND amount &amp;gt; 100
ORDER BY amount DESC;

-- Query partitioned data (Hive-style partitions)
SELECT region, product_category, SUM(revenue) AS total_revenue
FROM &amp;quot;azure-datalake&amp;quot;.sales.transactions
WHERE year = &apos;2024&apos; AND quarter = &apos;Q1&apos;
GROUP BY region, product_category
ORDER BY total_revenue DESC;
&lt;/code&gt;&lt;/pre&gt;
&lt;h2&gt;Federate with Other Clouds&lt;/h2&gt;
&lt;pre&gt;&lt;code class=&quot;language-sql&quot;&gt;-- Join Azure data with AWS and Google Cloud sources
SELECT
  c.customer_name,
  c.segment,
  SUM(a.amount) AS azure_revenue,
  COUNT(s.event_id) AS aws_events,
  bq.campaign_clicks
FROM &amp;quot;postgres-crm&amp;quot;.public.customers c
LEFT JOIN &amp;quot;azure-datalake&amp;quot;.sales.transactions a ON c.customer_id = a.customer_id
LEFT JOIN &amp;quot;s3-events&amp;quot;.analytics.user_events s ON c.customer_id = s.user_id
LEFT JOIN &amp;quot;bigquery-marketing&amp;quot;.analytics.customer_clicks bq ON c.customer_id = bq.user_id
GROUP BY c.customer_name, c.segment, bq.campaign_clicks
ORDER BY azure_revenue DESC;
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Four clouds, one query.&lt;/p&gt;
&lt;h2&gt;Build a Semantic Layer&lt;/h2&gt;
&lt;pre&gt;&lt;code class=&quot;language-sql&quot;&gt;CREATE VIEW analytics.gold.customer_transactions AS
SELECT
  a.customer_id,
  a.transaction_date,
  a.amount,
  CASE
    WHEN a.amount &amp;gt; 1000 THEN &apos;High Value&apos;
    WHEN a.amount &amp;gt; 100 THEN &apos;Standard&apos;
    ELSE &apos;Micro&apos;
  END AS transaction_tier,
  DATE_TRUNC(&apos;month&apos;, a.transaction_date) AS transaction_month
FROM &amp;quot;azure-datalake&amp;quot;.sales.transactions a
WHERE a.transaction_date &amp;gt;= &apos;2024-01-01&apos;;
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;In the &lt;strong&gt;Catalog&lt;/strong&gt;, click &lt;strong&gt;Edit&lt;/strong&gt; (pencil icon) → &lt;strong&gt;Generate Wiki&lt;/strong&gt; and &lt;strong&gt;Generate Tags&lt;/strong&gt;.&lt;/p&gt;
&lt;h2&gt;AI-Powered Analytics on Azure Data&lt;/h2&gt;
&lt;h3&gt;Dremio AI Agent&lt;/h3&gt;
&lt;p&gt;Ask questions in plain English: &amp;quot;What&apos;s our total revenue from high-value transactions this quarter?&amp;quot; The AI Agent reads your wiki descriptions and generates accurate SQL against your Azure data.&lt;/p&gt;
&lt;h3&gt;Dremio MCP Server&lt;/h3&gt;
&lt;p&gt;Connect &lt;a href=&quot;https://github.com/dremio/dremio-mcp&quot;&gt;Claude or ChatGPT&lt;/a&gt; to your Azure data:&lt;/p&gt;
&lt;ol&gt;
&lt;li&gt;Create a Native OAuth app in Dremio Cloud&lt;/li&gt;
&lt;li&gt;Configure redirect URLs&lt;/li&gt;
&lt;li&gt;Connect using &lt;code&gt;mcp.dremio.cloud/mcp/{project_id}&lt;/code&gt;&lt;/li&gt;
&lt;/ol&gt;
&lt;p&gt;An operations team member can ask Claude &amp;quot;Show me a summary of our Azure sales data by region this month&amp;quot; — no SQL required.&lt;/p&gt;
&lt;h3&gt;AI SQL Functions&lt;/h3&gt;
&lt;pre&gt;&lt;code class=&quot;language-sql&quot;&gt;-- Classify transactions with AI
SELECT
  transaction_id,
  amount,
  AI_CLASSIFY(
    &apos;Based on this transaction, classify the likely purchase category&apos;,
    &apos;Amount: $&apos; || CAST(amount AS VARCHAR) || &apos;, Date: &apos; || CAST(transaction_date AS VARCHAR),
    ARRAY[&apos;Subscription&apos;, &apos;One-Time Purchase&apos;, &apos;Refund&apos;, &apos;Upgrade&apos;]
  ) AS inferred_category
FROM &amp;quot;azure-datalake&amp;quot;.sales.transactions
WHERE transaction_date = CURRENT_DATE;

-- Generate data quality summaries
SELECT
  transaction_month,
  COUNT(*) AS total_transactions,
  AI_GENERATE(
    &apos;Write a one-sentence summary of this month data quality&apos;,
    &apos;Transactions: &apos; || CAST(COUNT(*) AS VARCHAR) || &apos;, Avg Amount: $&apos; || CAST(ROUND(AVG(amount), 2) AS VARCHAR) || &apos;, Nulls: &apos; || CAST(SUM(CASE WHEN customer_id IS NULL THEN 1 ELSE 0 END) AS VARCHAR)
  ) AS quality_summary
FROM analytics.gold.customer_transactions
GROUP BY transaction_month;
&lt;/code&gt;&lt;/pre&gt;
&lt;h2&gt;Create Iceberg Tables from Azure Data&lt;/h2&gt;
&lt;p&gt;Promote raw Azure files into managed Iceberg tables with full ACID transaction support:&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-sql&quot;&gt;CREATE TABLE analytics.bronze.azure_events AS
SELECT event_type, user_id, CAST(event_timestamp AS TIMESTAMP) AS event_time, payload
FROM &amp;quot;azure-datalake&amp;quot;.events.&amp;quot;raw_events.parquet&amp;quot;
WHERE event_type IS NOT NULL;
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Iceberg tables benefit from automatic compaction, time travel, results caching, and Autonomous Reflections. You can also use time travel to query historical states:&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-sql&quot;&gt;-- Query as table existed 7 days ago
SELECT * FROM analytics.bronze.azure_events
AT TIMESTAMP &apos;2024-06-01 00:00:00&apos;;
&lt;/code&gt;&lt;/pre&gt;
&lt;h2&gt;Governance on Azure Data&lt;/h2&gt;
&lt;p&gt;Dremio&apos;s Fine-Grained Access Control (FGAC) adds governance capabilities that Azure Storage doesn&apos;t provide natively:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Column masking:&lt;/strong&gt; Mask PII fields (email, IP address, user ID) from specific roles. Marketing analysts see aggregated metrics but not individual user data.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Row-level filtering:&lt;/strong&gt; Automatically filter data by the querying user&apos;s role. A regional manager sees only their region&apos;s Azure data.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Unified policies:&lt;/strong&gt; Same governance applies whether data comes from Azure Storage, PostgreSQL, BigQuery, or any other connected source.&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;These policies apply across SQL Runner, BI tools (via Arrow Flight/ODBC), AI Agent queries, and MCP Server interactions.&lt;/p&gt;
&lt;h2&gt;Connect BI Tools via Arrow Flight&lt;/h2&gt;
&lt;p&gt;Dremio&apos;s Arrow Flight connector provides 10-100x faster data transfer than JDBC/ODBC. After building views over your Azure data:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Power BI:&lt;/strong&gt; Use Dremio&apos;s native connector or ODBC driver — ideal for Azure-centric organizations&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Tableau:&lt;/strong&gt; Use the Dremio connector for direct Arrow Flight access&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Python/Pandas:&lt;/strong&gt; Use &lt;code&gt;pyarrow.flight&lt;/code&gt; for high-speed data access&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;dbt:&lt;/strong&gt; Use &lt;code&gt;dbt-dremio&lt;/code&gt; adapter for SQL-based transformations&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Looker:&lt;/strong&gt; Connect via JDBC&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;All queries benefit from Reflections, governance, and the semantic layer.&lt;/p&gt;
&lt;h2&gt;VS Code Copilot Integration&lt;/h2&gt;
&lt;p&gt;Dremio&apos;s VS Code extension with Copilot integration lets developers query Azure data directly from their IDE. Ask Copilot &amp;quot;Show me daily transaction trends from Azure storage&amp;quot; and it generates SQL using your semantic layer — without leaving your development environment.&lt;/p&gt;
&lt;h2&gt;Reflections and C3 Caching&lt;/h2&gt;
&lt;p&gt;For frequently queried Azure Storage data, create Reflections to pre-compute results:&lt;/p&gt;
&lt;ol&gt;
&lt;li&gt;Navigate to the view in the Catalog&lt;/li&gt;
&lt;li&gt;Click the &lt;strong&gt;Reflections&lt;/strong&gt; tab and create a Raw or Aggregation Reflection&lt;/li&gt;
&lt;li&gt;Select columns and set the refresh interval&lt;/li&gt;
&lt;/ol&gt;
&lt;p&gt;C3 (Columnar Cloud Cache) automatically caches frequently accessed file data on local NVMe drives for sub-second access. You don&apos;t configure C3 manually — it works transparently.&lt;/p&gt;
&lt;h2&gt;When to Keep Data in Azure Storage vs. Migrate to Iceberg&lt;/h2&gt;
&lt;p&gt;&lt;strong&gt;Keep as raw files:&lt;/strong&gt; Data landing zones for Azure Data Factory, files consumed by Azure-native services (Databricks, Synapse, Azure ML), raw data in formats required by other tools.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Migrate to Iceberg tables:&lt;/strong&gt; Analytical datasets consumed by SQL queries, data that benefits from ACID transactions and time travel, historical data needing snapshot management, datasets consumed by BI tools and AI agents.&lt;/p&gt;
&lt;p&gt;For raw Azure files, query through the connector and create manual Reflections. For Iceberg tables (either in Dremio&apos;s Open Catalog or external catalogs), Dremio provides automated compaction, Autonomous Reflections, and zero-maintenance performance optimization.&lt;/p&gt;
&lt;h2&gt;Azure Storage Tiers and Dremio Performance&lt;/h2&gt;
&lt;p&gt;Azure Storage offers multiple access tiers that affect query performance:&lt;/p&gt;
&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Tier&lt;/th&gt;
&lt;th&gt;Access Latency&lt;/th&gt;
&lt;th&gt;Cost&lt;/th&gt;
&lt;th&gt;Dremio Recommendation&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Hot&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Milliseconds&lt;/td&gt;
&lt;td&gt;Highest storage, lowest access&lt;/td&gt;
&lt;td&gt;Active analytics data — best performance&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Cool&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Milliseconds&lt;/td&gt;
&lt;td&gt;Lower storage, higher access&lt;/td&gt;
&lt;td&gt;Infrequent queries — still fast&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Cold&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Milliseconds&lt;/td&gt;
&lt;td&gt;Even lower storage, higher access&lt;/td&gt;
&lt;td&gt;Archival analytics — acceptable latency&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Archive&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Hours (rehydrate required)&lt;/td&gt;
&lt;td&gt;Lowest storage, highest access&lt;/td&gt;
&lt;td&gt;Not suitable for Dremio queries — rehydrate first&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;
&lt;p&gt;For optimal Dremio performance, keep analytical data in Hot or Cool tiers. Use Azure lifecycle management policies to automatically transition data between tiers based on last access time.&lt;/p&gt;
&lt;h2&gt;ADLS Gen2 vs. Blob Storage&lt;/h2&gt;
&lt;p&gt;Dremio&apos;s Azure Storage connector supports both Azure Data Lake Storage Gen2 (ADLS Gen2) and Azure Blob Storage:&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;ADLS Gen2&lt;/strong&gt; is the recommended option for analytical workloads:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Hierarchical namespace enables true directory operations (faster metadata operations)&lt;/li&gt;
&lt;li&gt;Fine-grained POSIX-like permissions for directory and file-level access&lt;/li&gt;
&lt;li&gt;Optimized for large-scale analytics workloads&lt;/li&gt;
&lt;li&gt;Required for Iceberg table creation and Azure Synapse integration&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;&lt;strong&gt;Azure Blob Storage&lt;/strong&gt; works for read-only access to existing file datasets but lacks hierarchical namespace features.&lt;/p&gt;
&lt;p&gt;When creating your Azure Storage source in Dremio, specify the storage account and container. For ADLS Gen2 accounts, Dremio automatically uses the &lt;code&gt;abfss://&lt;/code&gt; protocol for optimized access.&lt;/p&gt;
&lt;h2&gt;Azure-Specific Integration Patterns&lt;/h2&gt;
&lt;h3&gt;Azure Data Factory + Dremio&lt;/h3&gt;
&lt;p&gt;Azure Data Factory (ADF) lands data into Azure Storage containers. Dremio queries this data in place:&lt;/p&gt;
&lt;ol&gt;
&lt;li&gt;ADF pipelines extract from Azure SQL, Cosmos DB, or external APIs&lt;/li&gt;
&lt;li&gt;ADF writes Parquet files to ADLS Gen2 containers&lt;/li&gt;
&lt;li&gt;Dremio queries the Parquet files via the Azure Storage connector&lt;/li&gt;
&lt;li&gt;Dremio creates Iceberg tables from the Parquet data for optimized analytics&lt;/li&gt;
&lt;/ol&gt;
&lt;h3&gt;Azure Synapse + Dremio + Azure Storage&lt;/h3&gt;
&lt;p&gt;Connect both Azure Synapse and Azure Storage to Dremio Cloud. Dremio federates data across both:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Synapse contains summarized, modeled data&lt;/li&gt;
&lt;li&gt;Azure Storage contains raw files and Iceberg tables&lt;/li&gt;
&lt;li&gt;Dremio joins both sources in a single query&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;This eliminates the need to load all Azure Storage data into Synapse, reducing Synapse DWU consumption and costs.&lt;/p&gt;
&lt;h2&gt;Get Started&lt;/h2&gt;
&lt;p&gt;Azure Storage users can query their cloud data lake with SQL, federate with other sources, build a semantic layer, and enable AI analytics — all without data movement or ETL pipelines.&lt;/p&gt;
&lt;p&gt;&lt;a href=&quot;https://www.dremio.com/get-started?utm_source=ev_buffer&amp;amp;utm_medium=influencer&amp;amp;utm_campaign=pag&amp;amp;utm_term=connector-azure-storage-dremio-cloud&amp;amp;utm_content=alexmerced&quot;&gt;Try Dremio Cloud free for 30 days&lt;/a&gt; and connect your Azure Storage accounts alongside your other data sources.&lt;/p&gt;
</content:encoded><author>Alex Merced</author></item><item><title>Connect Amazon S3 to Dremio Cloud: Query Your Data Lake with SQL, Federation, and AI</title><link>https://iceberglakehouse.com/posts/2026-03-connector-amazon-s3/</link><guid isPermaLink="true">https://iceberglakehouse.com/posts/2026-03-connector-amazon-s3/</guid><description>
Amazon S3 is the default landing zone for data in the cloud. Log files, Parquet datasets, CSV exports, JSON events, IoT telemetry, and raw data dumps...</description><pubDate>Sun, 01 Mar 2026 15:00:00 GMT</pubDate><content:encoded>&lt;p&gt;Amazon S3 is the default landing zone for data in the cloud. Log files, Parquet datasets, CSV exports, JSON events, IoT telemetry, and raw data dumps — it all ends up in S3 buckets. But S3 is storage, not an analytics engine. You can&apos;t run SQL against S3 natively. To query it, you need Amazon Athena (per-TB pricing), AWS Glue ETL jobs (cluster management), or a data warehouse that imports the data. All add cost, complexity, and latency.&lt;/p&gt;
&lt;p&gt;Dremio Cloud connects directly to S3 and lets you query files in place using standard SQL. Dremio reads Parquet, CSV, JSON, Delta Lake, and Apache Iceberg table formats. It pushes projection and filter operations into its vectorized query engine and caches frequently accessed data on local NVMe drives (Columnar Cloud Cache, or C3) for near-instantaneous repeat queries.&lt;/p&gt;
&lt;p&gt;For organizations with hundreds or thousands of S3 buckets accumulated over years, data lake sprawl is a major challenge. Data lands in S3 from application logs, CDC pipelines, third-party integrations, and manual uploads — often without consistent naming conventions, schemas, or documentation. Dremio provides the organizational layer: connect S3 buckets, create views that standardize column names and types, build a semantic layer with wiki descriptions, and expose clean datasets to analysts and AI tools. This turns an unstructured &amp;quot;data swamp&amp;quot; into a governed, queryable data lake.&lt;/p&gt;
&lt;h2&gt;Why S3 Users Need Dremio&lt;/h2&gt;
&lt;h3&gt;SQL on Your Data Lake Without Athena Costs&lt;/h3&gt;
&lt;p&gt;Athena charges per terabyte of data scanned. For large datasets queried frequently — dashboards refreshing every 15 minutes, analysts exploring data, scheduled reports — costs grow unpredictably. Dremio&apos;s Reflections pre-compute results so repeated queries don&apos;t re-scan S3. C3 caching further reduces S3 GET requests. You pay for Dremio compute time, not per-TB scanned.&lt;/p&gt;
&lt;h3&gt;Format Flexibility&lt;/h3&gt;
&lt;p&gt;Dremio reads Parquet, CSV, JSON, Avro, Delta Lake, and Apache Iceberg from S3. You don&apos;t need to convert everything to one format before querying. Mixed-format data lakes work out of the box.&lt;/p&gt;
&lt;h3&gt;Federation with Databases and Warehouses&lt;/h3&gt;
&lt;p&gt;Your event data is in S3, but your customer data is in PostgreSQL, your financial data is in Snowflake, and your marketing data is in BigQuery. Dremio joins across all of them in a single SQL query without copying data between systems.&lt;/p&gt;
&lt;h3&gt;Apache Iceberg Table Management&lt;/h3&gt;
&lt;p&gt;Create Iceberg tables in Dremio&apos;s Open Catalog (backed by S3 or Dremio-managed storage) with full DML support. Dremio automatically handles compaction (merging small files), manifest rewriting, clustering, and vacuuming — no manual &lt;code&gt;OPTIMIZE&lt;/code&gt; jobs needed.&lt;/p&gt;
&lt;h3&gt;AI on S3 Data&lt;/h3&gt;
&lt;p&gt;Dremio&apos;s AI Agent, MCP Server, and AI SQL Functions make your raw S3 files queryable by business users and external AI tools — no data engineering required.&lt;/p&gt;
&lt;h2&gt;Prerequisites&lt;/h2&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;AWS Account&lt;/strong&gt; with S3 access&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;IAM Role or Access Key/Secret Key&lt;/strong&gt; with &lt;code&gt;s3:GetObject&lt;/code&gt;, &lt;code&gt;s3:ListBucket&lt;/code&gt;, and &lt;code&gt;s3:GetBucketLocation&lt;/code&gt; permissions&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Bucket names&lt;/strong&gt; or specific paths you want to query&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Dremio Cloud account&lt;/strong&gt; — &lt;a href=&quot;https://www.dremio.com/get-started?utm_source=ev_buffer&amp;amp;utm_medium=influencer&amp;amp;utm_campaign=pag&amp;amp;utm_term=connector-amazon-s3-dremio-cloud&amp;amp;utm_content=alexmerced&quot;&gt;sign up free for 30 days&lt;/a&gt; with $400 in compute credits&lt;/li&gt;
&lt;/ul&gt;
&lt;h2&gt;Step-by-Step: Connect S3 to Dremio Cloud&lt;/h2&gt;
&lt;h3&gt;1. Add the Source&lt;/h3&gt;
&lt;p&gt;Click &lt;strong&gt;&amp;quot;+&amp;quot;&lt;/strong&gt; and select &lt;strong&gt;Amazon S3&lt;/strong&gt;.&lt;/p&gt;
&lt;h3&gt;2. Configure Connection&lt;/h3&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Name:&lt;/strong&gt; Descriptive identifier (e.g., &lt;code&gt;s3-datalake&lt;/code&gt; or &lt;code&gt;event-logs&lt;/code&gt;).&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Authentication:&lt;/strong&gt; IAM Role ARN (recommended) or Access Key/Secret Key.&lt;/li&gt;
&lt;/ul&gt;
&lt;h3&gt;3. Configure Advanced Options&lt;/h3&gt;
&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Setting&lt;/th&gt;
&lt;th&gt;Purpose&lt;/th&gt;
&lt;th&gt;When to Use&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Root Path&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Starting path in the bucket&lt;/td&gt;
&lt;td&gt;Restrict to subfolder: &lt;code&gt;/data/analytics/&lt;/code&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Allowlisted Buckets&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Limit which buckets appear&lt;/td&gt;
&lt;td&gt;Multi-bucket accounts&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Enable partition column inference&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Extract partition keys from folders&lt;/td&gt;
&lt;td&gt;Hive-style partitioned data&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Default CTAS Format&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;CREATE TABLE format&lt;/td&gt;
&lt;td&gt;Iceberg recommended&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Encrypt Connection&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Enable SSL&lt;/td&gt;
&lt;td&gt;Always recommended&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Requester Pays&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;For requester-pays buckets&lt;/td&gt;
&lt;td&gt;Cross-account access&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Enable compatibility mode&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;S3-compatible storage&lt;/td&gt;
&lt;td&gt;MinIO, R2, etc.&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Connection Properties&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Custom settings&lt;/td&gt;
&lt;td&gt;&lt;code&gt;fs.s3a.endpoint&lt;/code&gt; for non-AWS&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;
&lt;h3&gt;4. Set Reflection and Metadata Refresh, then Save&lt;/h3&gt;
&lt;h2&gt;Query S3 Data&lt;/h2&gt;
&lt;pre&gt;&lt;code class=&quot;language-sql&quot;&gt;-- Query Parquet files
SELECT event_type, user_id, event_timestamp, page_url
FROM &amp;quot;s3-datalake&amp;quot;.events.&amp;quot;user_events.parquet&amp;quot;
WHERE event_type = &apos;purchase&apos; AND event_timestamp &amp;gt; &apos;2024-01-01&apos;;

-- Query partitioned data (e.g., year=2024/month=01/)
SELECT region, product_category, SUM(revenue) AS total_revenue
FROM &amp;quot;s3-datalake&amp;quot;.sales.transactions
GROUP BY region, product_category
ORDER BY total_revenue DESC;
&lt;/code&gt;&lt;/pre&gt;
&lt;h2&gt;Federate S3 with Other Sources&lt;/h2&gt;
&lt;pre&gt;&lt;code class=&quot;language-sql&quot;&gt;SELECT
  c.customer_name,
  c.segment,
  COUNT(e.event_id) AS s3_events,
  SUM(CASE WHEN e.event_type = &apos;purchase&apos; THEN e.revenue ELSE 0 END) AS s3_revenue,
  pg.lifetime_value AS crm_lifetime_value
FROM &amp;quot;postgres-crm&amp;quot;.public.customers c
LEFT JOIN &amp;quot;s3-datalake&amp;quot;.events.user_events e ON c.customer_id = e.user_id
LEFT JOIN &amp;quot;postgres-crm&amp;quot;.public.customer_metrics pg ON c.customer_id = pg.customer_id
GROUP BY c.customer_name, c.segment, pg.lifetime_value
ORDER BY s3_revenue DESC;
&lt;/code&gt;&lt;/pre&gt;
&lt;h2&gt;Create Iceberg Tables from S3 Data&lt;/h2&gt;
&lt;p&gt;Promote raw S3 files into managed Iceberg tables:&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-sql&quot;&gt;CREATE TABLE analytics.bronze.clean_events AS
SELECT event_type, user_id, CAST(event_timestamp AS TIMESTAMP) AS event_time, page_url, revenue
FROM &amp;quot;s3-datalake&amp;quot;.events.&amp;quot;user_events.parquet&amp;quot;
WHERE event_type IS NOT NULL;
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;The Iceberg table benefits from automatic compaction, time travel, results caching, and Autonomous Reflections.&lt;/p&gt;
&lt;h2&gt;S3-Compatible Storage&lt;/h2&gt;
&lt;p&gt;Dremio&apos;s S3 connector works with S3-compatible storage:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;MinIO:&lt;/strong&gt; Enable compatibility mode, set &lt;code&gt;fs.s3a.endpoint&lt;/code&gt; to your MinIO endpoint.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Cloudflare R2:&lt;/strong&gt; Same pattern, with R2&apos;s S3-compatible endpoint.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;DigitalOcean Spaces:&lt;/strong&gt; Compatibility mode + custom endpoint.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Amazon FSx for NetApp ONTAP:&lt;/strong&gt; Set the S3 Access Point alias as the root path, ensure IAM permissions include FSx-specific actions.&lt;/li&gt;
&lt;/ul&gt;
&lt;h2&gt;Build a Semantic Layer&lt;/h2&gt;
&lt;pre&gt;&lt;code class=&quot;language-sql&quot;&gt;CREATE VIEW analytics.gold.event_metrics AS
SELECT
  DATE_TRUNC(&apos;day&apos;, CAST(event_timestamp AS TIMESTAMP)) AS event_date,
  event_type,
  COUNT(*) AS event_count,
  COUNT(DISTINCT user_id) AS unique_users,
  SUM(revenue) AS daily_revenue,
  CASE
    WHEN COUNT(*) &amp;gt; 10000 THEN &apos;High Activity&apos;
    WHEN COUNT(*) &amp;gt; 1000 THEN &apos;Normal Activity&apos;
    ELSE &apos;Low Activity&apos;
  END AS activity_level
FROM &amp;quot;s3-datalake&amp;quot;.events.user_events
GROUP BY 1, 2;
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;In the &lt;strong&gt;Catalog&lt;/strong&gt;, click &lt;strong&gt;Edit&lt;/strong&gt; → &lt;strong&gt;Generate Wiki&lt;/strong&gt; and &lt;strong&gt;Generate Tags&lt;/strong&gt;.&lt;/p&gt;
&lt;h2&gt;AI-Powered Analytics on S3 Data&lt;/h2&gt;
&lt;h3&gt;Dremio AI Agent&lt;/h3&gt;
&lt;p&gt;Ask &amp;quot;What&apos;s our daily purchase revenue trend this month?&amp;quot; and the AI Agent generates SQL from your semantic layer. The wiki descriptions guide the Agent&apos;s understanding of event types and metrics.&lt;/p&gt;
&lt;h3&gt;Dremio MCP Server&lt;/h3&gt;
&lt;p&gt;Connect &lt;a href=&quot;https://github.com/dremio/dremio-mcp&quot;&gt;Claude or ChatGPT&lt;/a&gt; to your S3 data:&lt;/p&gt;
&lt;ol&gt;
&lt;li&gt;Create a Native OAuth app in Dremio Cloud&lt;/li&gt;
&lt;li&gt;Configure redirect URLs&lt;/li&gt;
&lt;li&gt;Connect via &lt;code&gt;mcp.dremio.cloud/mcp/{project_id}&lt;/code&gt;&lt;/li&gt;
&lt;/ol&gt;
&lt;p&gt;A product analyst asks Claude &amp;quot;Analyze user engagement patterns from S3 event data this week&amp;quot; and gets governed results.&lt;/p&gt;
&lt;h3&gt;AI SQL Functions&lt;/h3&gt;
&lt;pre&gt;&lt;code class=&quot;language-sql&quot;&gt;-- Classify events with AI
SELECT
  event_type,
  event_count,
  AI_CLASSIFY(
    &apos;Based on this event pattern, classify the business impact&apos;,
    &apos;Event: &apos; || event_type || &apos;, Count: &apos; || CAST(event_count AS VARCHAR) || &apos;, Revenue: $&apos; || CAST(daily_revenue AS VARCHAR),
    ARRAY[&apos;Revenue Driver&apos;, &apos;Engagement Signal&apos;, &apos;Support Indicator&apos;, &apos;Churn Signal&apos;]
  ) AS business_impact
FROM analytics.gold.event_metrics
WHERE event_date = CURRENT_DATE - INTERVAL &apos;1&apos; DAY;

-- Process unstructured data from S3
SELECT
  file[&apos;path&apos;] AS file_path,
  AI_GENERATE(
    &apos;Extract key information from this document&apos;,
    (&apos;Summarize the main topics in this file&apos;, file)
    WITH SCHEMA ROW(summary VARCHAR, category VARCHAR)
  ) AS extracted_info
FROM TABLE(LIST_FILES(&apos;@&amp;quot;s3-datalake&amp;quot;/documents/&apos;))
WHERE file[&apos;path&apos;] LIKE &apos;%.pdf&apos;;
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;&lt;code&gt;AI_GENERATE&lt;/code&gt; with file references can process unstructured documents (PDFs, images) stored in S3 directly in SQL queries.&lt;/p&gt;
&lt;h2&gt;Reflections and C3 Caching&lt;/h2&gt;
&lt;p&gt;For frequently queried S3 data, Dremio provides two layers of acceleration:&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Reflections&lt;/strong&gt; pre-compute query results. Create them on your semantic layer views:&lt;/p&gt;
&lt;ol&gt;
&lt;li&gt;In the &lt;strong&gt;Catalog&lt;/strong&gt;, navigate to the view&lt;/li&gt;
&lt;li&gt;Click the &lt;strong&gt;Reflections&lt;/strong&gt; tab&lt;/li&gt;
&lt;li&gt;Choose Raw or Aggregation Reflections&lt;/li&gt;
&lt;li&gt;Select columns and set the refresh interval&lt;/li&gt;
&lt;li&gt;Click &lt;strong&gt;Save&lt;/strong&gt;&lt;/li&gt;
&lt;/ol&gt;
&lt;p&gt;&lt;strong&gt;C3 (Columnar Cloud Cache)&lt;/strong&gt; automatically caches frequently accessed file data on local NVMe drives. C3 works transparently — no configuration needed. When Dremio reads S3 files, it caches the columnar data locally. Subsequent reads of the same files come from NVMe instead of S3, eliminating S3 GET request costs and latency.&lt;/p&gt;
&lt;p&gt;Together, Reflections and C3 mean that frequently executed queries against S3 data run in milliseconds, not seconds.&lt;/p&gt;
&lt;h2&gt;Governance on S3 Data&lt;/h2&gt;
&lt;p&gt;S3 has bucket-level IAM policies, but no column-level masking or row-level filtering. Dremio&apos;s Fine-Grained Access Control (FGAC) adds these capabilities:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Column masking:&lt;/strong&gt; Mask PII fields (email, IP, user ID) from specific roles. Data engineers see everything; marketing analysts see aggregated metrics only.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Row-level filtering:&lt;/strong&gt; Restrict data by user role. Regional analysts see only their region&apos;s events.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Unified policies:&lt;/strong&gt; Same governance applies across S3, PostgreSQL, Snowflake, and all other sources.&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;These policies apply across SQL Runner, BI tools (Arrow Flight/ODBC), AI Agent, and MCP Server.&lt;/p&gt;
&lt;h2&gt;Connect BI Tools via Arrow Flight&lt;/h2&gt;
&lt;p&gt;Arrow Flight provides 10-100x faster data transfer than JDBC/ODBC:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Tableau:&lt;/strong&gt; Dremio connector for direct Arrow Flight access&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Power BI:&lt;/strong&gt; Dremio ODBC driver or native connector&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Python/Pandas:&lt;/strong&gt; &lt;code&gt;pyarrow.flight&lt;/code&gt; for programmatic access&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Looker:&lt;/strong&gt; Connect via JDBC&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;dbt:&lt;/strong&gt; &lt;code&gt;dbt-dremio&lt;/code&gt; adapter for transformation workflows&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;All queries benefit from Reflections, governance, and the semantic layer.&lt;/p&gt;
&lt;h2&gt;VS Code Copilot Integration&lt;/h2&gt;
&lt;p&gt;Dremio&apos;s VS Code extension with Copilot lets developers query S3 data from their IDE. Ask Copilot &amp;quot;Show me purchase event trends from S3 data this week&amp;quot; and get SQL generated using your semantic layer.&lt;/p&gt;
&lt;h2&gt;S3 Data Organization Best Practices&lt;/h2&gt;
&lt;p&gt;How you organize data in S3 directly impacts Dremio&apos;s query performance:&lt;/p&gt;
&lt;h3&gt;Partition Strategy&lt;/h3&gt;
&lt;p&gt;Hive-style partitions (&lt;code&gt;year=2024/month=01/day=15/&lt;/code&gt;) enable Dremio to skip irrelevant partitions during query planning. The right partition key depends on your query patterns:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Time-based queries:&lt;/strong&gt; Partition by &lt;code&gt;year/month/day&lt;/code&gt; or &lt;code&gt;year/month&lt;/code&gt;. Dremio reads only the partitions matching your &lt;code&gt;WHERE&lt;/code&gt; clause.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Regional queries:&lt;/strong&gt; Partition by &lt;code&gt;region/date&lt;/code&gt; for multi-region datasets.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Mixed access:&lt;/strong&gt; Partition by the most common filter column first (e.g., &lt;code&gt;region/year/month&lt;/code&gt;).&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;Avoid over-partitioning (too many small files per partition) or under-partitioning (too few partitions with huge files). Aim for partition sizes between 128 MB and 1 GB.&lt;/p&gt;
&lt;h3&gt;File Format Recommendations&lt;/h3&gt;
&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Format&lt;/th&gt;
&lt;th&gt;Best For&lt;/th&gt;
&lt;th&gt;Dremio Support&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Parquet&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Structured analytics data&lt;/td&gt;
&lt;td&gt;Full support, columnar optimization&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Apache Iceberg&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;ACID transactions, time travel&lt;/td&gt;
&lt;td&gt;Full read/write support&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Delta Lake&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Databricks ecosystem compatibility&lt;/td&gt;
&lt;td&gt;Read support&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;JSON&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Semi-structured event data&lt;/td&gt;
&lt;td&gt;Full support, schema inference&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;CSV&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Legacy data imports&lt;/td&gt;
&lt;td&gt;Full support, limited performance&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Avro&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Schema-evolved event streams&lt;/td&gt;
&lt;td&gt;Read support&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;
&lt;p&gt;For analytical workloads, convert CSV and JSON files to Parquet or Iceberg for 10-50x better query performance. Dremio can perform this conversion:&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-sql&quot;&gt;CREATE TABLE analytics.bronze.events_optimized AS
SELECT * FROM &amp;quot;s3-datalake&amp;quot;.raw.&amp;quot;events.csv&amp;quot;;
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;This creates an Iceberg table from CSV data, giving you columnar storage, automatic compaction, and time travel.&lt;/p&gt;
&lt;h3&gt;Data Lake Layers&lt;/h3&gt;
&lt;p&gt;Organize your S3 bucket with a medallion architecture:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;&lt;code&gt;raw/&lt;/code&gt;&lt;/strong&gt; — Landing zone for incoming data (CSV, JSON, Parquet from external sources)&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;&lt;code&gt;bronze/&lt;/code&gt;&lt;/strong&gt; — Cleaned, typed versions of raw data (Iceberg tables)&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;&lt;code&gt;silver/&lt;/code&gt;&lt;/strong&gt; — Joined, deduplicated, enriched datasets&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;&lt;code&gt;gold/&lt;/code&gt;&lt;/strong&gt; — Business-ready views and aggregations for the semantic layer&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;Dremio&apos;s SQL engine handles the transformations between layers using &lt;code&gt;CREATE TABLE AS SELECT&lt;/code&gt; and &lt;code&gt;MERGE&lt;/code&gt; statements — no external ETL tools needed.&lt;/p&gt;
&lt;h2&gt;When to Use S3 vs. Other Storage&lt;/h2&gt;
&lt;p&gt;&lt;strong&gt;Use S3 when:&lt;/strong&gt; Your data originates in AWS, you need cost-effective long-term storage, you want to use Apache Iceberg tables, your data is in file formats (Parquet, JSON, CSV).&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Use managed databases when:&lt;/strong&gt; Your data requires real-time OLTP operations, your applications need row-level transactions, your data model is heavily relational.&lt;/p&gt;
&lt;p&gt;Dremio federates across both — S3 for your data lake and databases for operational data, in a single query.&lt;/p&gt;
&lt;h2&gt;Get Started&lt;/h2&gt;
&lt;p&gt;Amazon S3 is the most common data lake storage layer. Dremio Cloud turns it into a queryable, federated, AI-ready analytics platform without Athena costs or data warehouse ETL. Whether your S3 data is in Parquet, CSV, JSON, or Iceberg format, Dremio reads it directly and makes it available for SQL queries, cross-source joins, and AI-powered analytics.&lt;/p&gt;
&lt;p&gt;Start by connecting your primary S3 bucket to Dremio Cloud. Create views that standardize your data into business-friendly structures, add wiki descriptions for the AI Agent, and build Reflections on frequently accessed datasets. Within hours, your S3 data lake transforms from raw file storage into a governed, AI-ready analytical platform. No infrastructure to manage and no data to move.&lt;/p&gt;
&lt;p&gt;&lt;a href=&quot;https://www.dremio.com/get-started?utm_source=ev_buffer&amp;amp;utm_medium=influencer&amp;amp;utm_campaign=pag&amp;amp;utm_term=connector-amazon-s3-dremio-cloud&amp;amp;utm_content=alexmerced&quot;&gt;Try Dremio Cloud free for 30 days&lt;/a&gt; and connect your S3 buckets.&lt;/p&gt;
</content:encoded><author>Alex Merced</author></item><item><title>Connect SAP HANA to Dremio Cloud: Unlock Analytics Beyond the SAP Ecosystem</title><link>https://iceberglakehouse.com/posts/2026-03-connector-sap-hana/</link><guid isPermaLink="true">https://iceberglakehouse.com/posts/2026-03-connector-sap-hana/</guid><description>
SAP HANA is the in-memory database platform that powers SAP S/4HANA, SAP BW/4HANA, and custom enterprise applications across finance, manufacturing, ...</description><pubDate>Sun, 01 Mar 2026 14:00:00 GMT</pubDate><content:encoded>&lt;p&gt;SAP HANA is the in-memory database platform that powers SAP S/4HANA, SAP BW/4HANA, and custom enterprise applications across finance, manufacturing, logistics, and supply chain. It&apos;s fast for SAP-native analytics — real-time financial reporting, material requirements planning, and production analytics run directly on HANA&apos;s in-memory columnar engine. But SAP HANA exists in a walled garden.&lt;/p&gt;
&lt;p&gt;Connecting HANA data to non-SAP tools requires SAP Data Intelligence, SAP Business Technology Platform (BTP), or custom ABAP extractors — all of which add significant cost and complexity. Sharing HANA data with teams that don&apos;t use SAP tools (marketing running Tableau, data science using Python, operations using Power BI) means building export pipelines that duplicate data, add latency, and create governance gaps.&lt;/p&gt;
&lt;p&gt;Dremio Cloud connects directly to SAP HANA and queries it alongside your other data sources with standard SQL. No SAP-specific middleware. No data extraction. Your HANA data stays in place and joins with S3, PostgreSQL, BigQuery, Snowflake, or any other connected source in a single SQL query.&lt;/p&gt;
&lt;h2&gt;Why SAP HANA Users Need Dremio&lt;/h2&gt;
&lt;h3&gt;Break Out of the SAP Ecosystem&lt;/h3&gt;
&lt;p&gt;SAP Analytics Cloud and SAP BusinessObjects work well with HANA, but connecting HANA data to Tableau, Power BI, Looker, or Python-based analytics requires additional middleware, gateway servers, or data export. Dremio provides a vendor-neutral SQL layer that connects HANA to any BI tool via Arrow Flight (high-performance columnar data transfer) or standard ODBC/JDBC.&lt;/p&gt;
&lt;h3&gt;Cross-Platform Analytics&lt;/h3&gt;
&lt;p&gt;Your SAP data covers finance (GL accounts, AP/AR, cost centers) and supply chain (material masters, purchase orders, production orders). But your CRM data is in Salesforce (exported to S3), your support ticket data is in PostgreSQL, and your marketing attribution data is in Google BigQuery. Without a federation layer, combining these with SAP data requires building custom pipelines for each source. Dremio federates across all sources in a single query.&lt;/p&gt;
&lt;h3&gt;Reduce HANA Memory Pressure&lt;/h3&gt;
&lt;p&gt;SAP HANA licenses are tied to memory allocation — the more memory provisioned, the higher the license cost. Running analytical workloads in HANA consumes memory resources that compete with transactional OLTP operations. Dremio&apos;s Reflections offload repeated analytical queries from HANA&apos;s engine, reducing memory pressure and potentially allowing you to right-size your HANA memory allocation.&lt;/p&gt;
&lt;h3&gt;AI Analytics on SAP Data&lt;/h3&gt;
&lt;p&gt;SAP&apos;s AI capabilities (SAP Joule, embedded analytics) are tightly coupled to SAP applications. Dremio&apos;s AI Agent, MCP Server, and AI SQL Functions provide AI analytics that span SAP and non-SAP data sources — enabling cross-functional insights that SAP&apos;s tools can&apos;t deliver alone.&lt;/p&gt;
&lt;h2&gt;Prerequisites&lt;/h2&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;SAP HANA hostname or IP address&lt;/strong&gt; — the HANA server&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Port number&lt;/strong&gt; — typically &lt;code&gt;30015&lt;/code&gt; for single-tenant, or &lt;code&gt;3XX15&lt;/code&gt; for multi-tenant (XX = instance number)&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Database name&lt;/strong&gt; (required for multi-tenant HANA systems)&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Username and password&lt;/strong&gt; — HANA user with &lt;code&gt;SELECT&lt;/code&gt; privileges on the schemas and tables you want to query&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Network access&lt;/strong&gt; — HANA port must be reachable from Dremio Cloud&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Dremio Cloud account&lt;/strong&gt; — &lt;a href=&quot;https://www.dremio.com/get-started?utm_source=ev_buffer&amp;amp;utm_medium=influencer&amp;amp;utm_campaign=pag&amp;amp;utm_term=connector-sap-hana-dremio-cloud&amp;amp;utm_content=alexmerced&quot;&gt;sign up free for 30 days&lt;/a&gt; with $400 in compute credits&lt;/li&gt;
&lt;/ul&gt;
&lt;h2&gt;Step-by-Step: Connect SAP HANA to Dremio Cloud&lt;/h2&gt;
&lt;h3&gt;1. Add the Source&lt;/h3&gt;
&lt;p&gt;Click &lt;strong&gt;&amp;quot;+&amp;quot;&lt;/strong&gt; in the Dremio console and select &lt;strong&gt;SAP HANA&lt;/strong&gt;.&lt;/p&gt;
&lt;h3&gt;2. Configure Connection Details&lt;/h3&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Name:&lt;/strong&gt; Descriptive identifier (e.g., &lt;code&gt;sap-hana&lt;/code&gt; or &lt;code&gt;erp-analytics&lt;/code&gt;).&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Host:&lt;/strong&gt; HANA server hostname or IP.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Port:&lt;/strong&gt; Default &lt;code&gt;30015&lt;/code&gt; for single-tenant.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Database:&lt;/strong&gt; Required for multi-tenant HANA systems.&lt;/li&gt;
&lt;/ul&gt;
&lt;h3&gt;3. Set Authentication&lt;/h3&gt;
&lt;p&gt;Master Credentials (username/password) or Secret Resource URL (AWS Secrets Manager).&lt;/p&gt;
&lt;h3&gt;4. Configure Advanced Options&lt;/h3&gt;
&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Setting&lt;/th&gt;
&lt;th&gt;Purpose&lt;/th&gt;
&lt;th&gt;Default&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Record fetch size&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Rows per batch from HANA&lt;/td&gt;
&lt;td&gt;200&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Maximum Idle Connections&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Connection pool management&lt;/td&gt;
&lt;td&gt;8&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Connection Idle Time&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Seconds before idle connections close&lt;/td&gt;
&lt;td&gt;60&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Enable SSL&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Encrypt the connection&lt;/td&gt;
&lt;td&gt;Off&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Connection Properties&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Custom JDBC parameters&lt;/td&gt;
&lt;td&gt;None&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;
&lt;h3&gt;5. Set Reflection and Metadata Refresh, then Save&lt;/h3&gt;
&lt;h2&gt;Query SAP HANA Data&lt;/h2&gt;
&lt;pre&gt;&lt;code class=&quot;language-sql&quot;&gt;-- Query material inventory data
SELECT material_id, material_desc, plant, stock_quantity, unit_of_measure
FROM &amp;quot;sap-hana&amp;quot;.SAPABAP1.MARD
WHERE plant = &apos;1000&apos; AND stock_quantity &amp;gt; 100
ORDER BY stock_quantity DESC;

-- Financial reporting: GL Account balances
SELECT
  gl_account,
  company_code,
  fiscal_year,
  SUM(debit_amount) AS total_debits,
  SUM(credit_amount) AS total_credits,
  SUM(debit_amount) - SUM(credit_amount) AS net_balance
FROM &amp;quot;sap-hana&amp;quot;.SAPABAP1.BSEG
WHERE fiscal_year = &apos;2024&apos;
GROUP BY gl_account, company_code, fiscal_year
ORDER BY net_balance DESC;
&lt;/code&gt;&lt;/pre&gt;
&lt;h2&gt;Federate SAP with Non-SAP Sources&lt;/h2&gt;
&lt;pre&gt;&lt;code class=&quot;language-sql&quot;&gt;-- Join SAP material data with external supplier and demand data
SELECT
  m.material_desc,
  m.stock_quantity,
  m.plant,
  s.supplier_name,
  s.lead_time_days,
  s.unit_cost,
  d.forecasted_demand_30d,
  CASE
    WHEN m.stock_quantity &amp;lt; d.forecasted_demand_30d * 0.5 THEN &apos;Critical - Reorder Now&apos;
    WHEN m.stock_quantity &amp;lt; d.forecasted_demand_30d THEN &apos;Watch - Order Soon&apos;
    ELSE &apos;Adequate&apos;
  END AS inventory_status
FROM &amp;quot;sap-hana&amp;quot;.SAPABAP1.MARD m
JOIN &amp;quot;postgres-procurement&amp;quot;.public.suppliers s ON m.material_id = s.material_id
LEFT JOIN &amp;quot;s3-forecasting&amp;quot;.demand.material_forecasts d ON m.material_id = d.material_id AND m.plant = d.plant
WHERE s.lead_time_days &amp;lt; 14
ORDER BY s.unit_cost ASC;
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;SAP handles material masters, PostgreSQL has supplier details, S3 has demand forecasts — Dremio joins them all.&lt;/p&gt;
&lt;h2&gt;Build a Semantic Layer&lt;/h2&gt;
&lt;pre&gt;&lt;code class=&quot;language-sql&quot;&gt;CREATE VIEW analytics.gold.inventory_health AS
SELECT
  m.material_id,
  m.material_desc,
  m.plant,
  m.stock_quantity,
  CASE
    WHEN m.stock_quantity = 0 THEN &apos;Out of Stock&apos;
    WHEN m.stock_quantity &amp;lt; 50 THEN &apos;Low Stock&apos;
    WHEN m.stock_quantity &amp;lt; 200 THEN &apos;Adequate&apos;
    ELSE &apos;Overstocked&apos;
  END AS stock_status,
  ROUND(m.stock_quantity * s.unit_cost, 2) AS inventory_value_usd
FROM &amp;quot;sap-hana&amp;quot;.SAPABAP1.MARD m
LEFT JOIN &amp;quot;postgres-procurement&amp;quot;.public.suppliers s ON m.material_id = s.material_id;
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;In the &lt;strong&gt;Catalog&lt;/strong&gt;, click &lt;strong&gt;Edit&lt;/strong&gt; → &lt;strong&gt;Generate Wiki&lt;/strong&gt; and &lt;strong&gt;Generate Tags&lt;/strong&gt;. Create descriptions like &amp;quot;inventory_health: One row per material-plant combination showing current stock levels, status classification, and estimated inventory value in USD.&amp;quot;&lt;/p&gt;
&lt;h2&gt;AI-Powered Analytics on SAP Data&lt;/h2&gt;
&lt;h3&gt;Dremio AI Agent&lt;/h3&gt;
&lt;p&gt;The AI Agent lets users ask questions about SAP data in plain English: &amp;quot;Which materials are low stock at plant 1000?&amp;quot; or &amp;quot;What&apos;s the total inventory value across all plants?&amp;quot; The Agent reads your wiki descriptions, understands SAP terminology through the semantic layer, and generates accurate SQL.&lt;/p&gt;
&lt;p&gt;This is transformative for SAP environments where only specialists know the table structures (MARD, BSEG, VBRK) and field names. The semantic layer translates SAP&apos;s technical schema into business language.&lt;/p&gt;
&lt;h3&gt;Dremio MCP Server&lt;/h3&gt;
&lt;p&gt;The &lt;a href=&quot;https://github.com/dremio/dremio-mcp&quot;&gt;Dremio MCP Server&lt;/a&gt; connects Claude and ChatGPT to your SAP data:&lt;/p&gt;
&lt;ol&gt;
&lt;li&gt;Create a Native OAuth app in Dremio Cloud&lt;/li&gt;
&lt;li&gt;Configure redirect URLs (e.g., &lt;code&gt;https://claude.ai/api/mcp/auth_callback&lt;/code&gt;)&lt;/li&gt;
&lt;li&gt;Connect using &lt;code&gt;mcp.dremio.cloud/mcp/{project_id}&lt;/code&gt;&lt;/li&gt;
&lt;/ol&gt;
&lt;p&gt;A supply chain manager asks Claude &amp;quot;Show me all critical reorder items combining SAP inventory with supplier lead times&amp;quot; and gets actionable results without knowing SAP table names.&lt;/p&gt;
&lt;h3&gt;AI SQL Functions&lt;/h3&gt;
&lt;pre&gt;&lt;code class=&quot;language-sql&quot;&gt;-- Classify inventory risk with AI
SELECT
  material_desc,
  stock_quantity,
  stock_status,
  AI_CLASSIFY(
    &apos;Based on inventory levels and value, recommend a procurement action&apos;,
    &apos;Material: &apos; || material_desc || &apos;, Stock: &apos; || CAST(stock_quantity AS VARCHAR) || &apos;, Status: &apos; || stock_status || &apos;, Value: $&apos; || CAST(inventory_value_usd AS VARCHAR),
    ARRAY[&apos;Rush Order&apos;, &apos;Standard Reorder&apos;, &apos;Monitor&apos;, &apos;Liquidate Excess&apos;]
  ) AS procurement_action
FROM analytics.gold.inventory_health
WHERE stock_status IN (&apos;Out of Stock&apos;, &apos;Low Stock&apos;, &apos;Overstocked&apos;);

-- Generate supplier evaluation summaries
SELECT
  s.supplier_name,
  AI_GENERATE(
    &apos;Write a one-sentence supplier performance summary&apos;,
    &apos;Supplier: &apos; || s.supplier_name || &apos;, Lead Time: &apos; || CAST(s.lead_time_days AS VARCHAR) || &apos; days, Unit Cost: $&apos; || CAST(s.unit_cost AS VARCHAR) || &apos;, Materials Supplied: &apos; || CAST(COUNT(m.material_id) AS VARCHAR)
  ) AS performance_summary
FROM &amp;quot;postgres-procurement&amp;quot;.public.suppliers s
JOIN &amp;quot;sap-hana&amp;quot;.SAPABAP1.MARD m ON s.material_id = m.material_id
GROUP BY s.supplier_name, s.lead_time_days, s.unit_cost;
&lt;/code&gt;&lt;/pre&gt;
&lt;h2&gt;Reflections for SAP Analytics&lt;/h2&gt;
&lt;p&gt;SAP HANA is expensive to query for analytical workloads. Create Reflections on your semantic layer views to cache results:&lt;/p&gt;
&lt;ol&gt;
&lt;li&gt;Navigate to the view in the &lt;strong&gt;Catalog&lt;/strong&gt;&lt;/li&gt;
&lt;li&gt;Click the &lt;strong&gt;Reflections&lt;/strong&gt; tab&lt;/li&gt;
&lt;li&gt;Choose &lt;strong&gt;Raw Reflection&lt;/strong&gt; (full cache) or &lt;strong&gt;Aggregation Reflection&lt;/strong&gt; (pre-computed metrics)&lt;/li&gt;
&lt;li&gt;Select columns and aggregations&lt;/li&gt;
&lt;li&gt;Set the &lt;strong&gt;Refresh Interval&lt;/strong&gt; — for SAP data that changes throughout the day, hourly is typical; for period-end data, daily or weekly&lt;/li&gt;
&lt;li&gt;Click &lt;strong&gt;Save&lt;/strong&gt;&lt;/li&gt;
&lt;/ol&gt;
&lt;p&gt;Dashboard queries from Tableau or Power BI hit the Reflection instead of HANA, reducing memory consumption and license pressure. A financial reporting dashboard that queries HANA 96 times per day (15-minute refresh) with a Reflection refreshing every 2 hours consumes HANA resources only 12 times per day — an 87.5% reduction.&lt;/p&gt;
&lt;h2&gt;Governance on SAP Data&lt;/h2&gt;
&lt;p&gt;Dremio&apos;s Fine-Grained Access Control (FGAC) adds governance that SAP&apos;s built-in security doesn&apos;t extend to non-SAP tools:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Column masking:&lt;/strong&gt; Mask salary data, cost center details, or GL account balances from specific roles. A supply chain analyst sees material inventory but not financial data.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Row-level filtering:&lt;/strong&gt; Restrict data by company code, plant, or region based on the querying user&apos;s role. A plant manager sees only their plant&apos;s data.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Unified policies:&lt;/strong&gt; Same governance applies across SAP HANA, PostgreSQL, S3, BigQuery, and all other sources.&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;These policies apply across SQL Runner, BI tools (Arrow Flight/ODBC), AI Agent, and MCP Server — ensuring consistent access control regardless of how data is queried.&lt;/p&gt;
&lt;h2&gt;Connect BI Tools via Arrow Flight&lt;/h2&gt;
&lt;p&gt;Dremio&apos;s Arrow Flight connector provides 10-100x faster data transfer than JDBC/ODBC. For SAP data, this eliminates the need for SAP BusinessObjects or SAP Analytics Cloud:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Tableau:&lt;/strong&gt; Dremio connector — replaces SAP-specific Tableau drivers&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Power BI:&lt;/strong&gt; Dremio ODBC or native connector — no SAP Gateway needed&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Python/Pandas:&lt;/strong&gt; &lt;code&gt;pyarrow.flight&lt;/code&gt; for data science on SAP data&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Looker:&lt;/strong&gt; Connect via JDBC&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;dbt:&lt;/strong&gt; &lt;code&gt;dbt-dremio&lt;/code&gt; for SQL-based transformations on SAP data&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;All queries benefit from Reflections, governance, and the semantic layer.&lt;/p&gt;
&lt;h2&gt;VS Code Copilot Integration&lt;/h2&gt;
&lt;p&gt;Dremio&apos;s VS Code extension with Copilot integration lets developers query SAP data from their IDE. Ask Copilot &amp;quot;Show me low stock materials at plant 1000 from SAP&amp;quot; and get SQL generated using your semantic layer — without knowing SAP table names like MARD or BSEG.&lt;/p&gt;
&lt;h2&gt;When to Keep Data in HANA vs. Migrate&lt;/h2&gt;
&lt;p&gt;&lt;strong&gt;Keep in HANA:&lt;/strong&gt; Transactional data actively used by SAP applications (OLTP), data with SAP-specific processing (ABAP reports, CDS views, BW extractors), master data referenced by SAP transactions, data subject to SAP transport management.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Migrate to Iceberg:&lt;/strong&gt; Historical analytical data (closed fiscal periods, prior-year orders), datasets consumed by non-SAP tools, data where HANA memory cost exceeds analytical value, data needed for AI/ML workloads outside of SAP, archived transaction data that rarely changes.&lt;/p&gt;
&lt;p&gt;For data staying in HANA, create manual Reflections to offload analytical queries. For migrated Iceberg data, Dremio provides automatic compaction, time travel, Autonomous Reflections, and zero per-query license costs.&lt;/p&gt;
&lt;h2&gt;SAP Landscape Integration&lt;/h2&gt;
&lt;p&gt;SAP HANA rarely exists in isolation. Dremio helps connect the SAP landscape with non-SAP analytics:&lt;/p&gt;
&lt;h3&gt;SAP S/4HANA Integration&lt;/h3&gt;
&lt;p&gt;S/4HANA stores business-critical data in HANA tables. Dremio connects to the underlying HANA database and reads these tables directly, bypassing the need for SAP BTP, SAP Analytics Cloud, or custom OData/RFC extractors. This gives analysts SQL access to S/4HANA data — sales orders, material documents, financial postings — alongside non-SAP sources.&lt;/p&gt;
&lt;h3&gt;SAP BW/4HANA Bridge&lt;/h3&gt;
&lt;p&gt;SAP BW/4HANA creates InfoProviders and ADSO tables in HANA. Dremio can query these underlying HANA tables, exposing BW-managed data to non-SAP BI tools. This is valuable for organizations consolidating from SAP Analytics Cloud and BW to a unified BI strategy.&lt;/p&gt;
&lt;h3&gt;Common SAP + Non-SAP Analytics Patterns&lt;/h3&gt;
&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;SAP Data (HANA)&lt;/th&gt;
&lt;th&gt;Non-SAP Data&lt;/th&gt;
&lt;th&gt;Analytics Use Case&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Sales orders (VBAK/VBAP)&lt;/td&gt;
&lt;td&gt;CRM opportunities (PostgreSQL)&lt;/td&gt;
&lt;td&gt;Pipeline-to-revenue tracking&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Material documents (MSEG)&lt;/td&gt;
&lt;td&gt;IoT sensor data (S3)&lt;/td&gt;
&lt;td&gt;Predictive maintenance&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Financial postings (BSEG)&lt;/td&gt;
&lt;td&gt;External market data (BigQuery)&lt;/td&gt;
&lt;td&gt;Financial benchmarking&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Employee master (PA0001)&lt;/td&gt;
&lt;td&gt;Recruitment data (MongoDB)&lt;/td&gt;
&lt;td&gt;Workforce analytics&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;
&lt;p&gt;Dremio&apos;s federation engine joins SAP tables with non-SAP sources without extracting SAP data to external systems — maintaining SAP as the system of record.&lt;/p&gt;
&lt;h3&gt;SAP HANA Licensing Considerations&lt;/h3&gt;
&lt;p&gt;SAP HANA licensing is based on memory allocation (RAM). Every analytical query consumes HANA memory resources. Dremio&apos;s Reflections offload analytical workloads from HANA, potentially allowing organizations to reduce HANA memory allocations and associated licensing costs.&lt;/p&gt;
&lt;h2&gt;Get Started&lt;/h2&gt;
&lt;p&gt;SAP HANA users can extend their SAP analytics beyond the SAP ecosystem — connect HANA, join it with every other source, and enable AI-driven analytics without SAP-specific middleware or additional SAP licenses. Start with Reflections to offload analytical queries from HANA&apos;s in-memory engine, then build a semantic layer for AI Agent access.&lt;/p&gt;
&lt;p&gt;&lt;a href=&quot;https://www.dremio.com/get-started?utm_source=ev_buffer&amp;amp;utm_medium=influencer&amp;amp;utm_campaign=pag&amp;amp;utm_term=connector-sap-hana-dremio-cloud&amp;amp;utm_content=alexmerced&quot;&gt;Try Dremio Cloud free for 30 days&lt;/a&gt; and connect your SAP HANA databases.&lt;/p&gt;
</content:encoded><author>Alex Merced</author></item><item><title>Connect IBM Db2 to Dremio Cloud: Modernize Mainframe Analytics with Federation and AI</title><link>https://iceberglakehouse.com/posts/2026-03-connector-ibm-db2/</link><guid isPermaLink="true">https://iceberglakehouse.com/posts/2026-03-connector-ibm-db2/</guid><description>
IBM Db2 is the relational database that powers critical applications across banking, insurance, government, healthcare, and manufacturing. For organi...</description><pubDate>Sun, 01 Mar 2026 13:00:00 GMT</pubDate><content:encoded>&lt;p&gt;IBM Db2 is the relational database that powers critical applications across banking, insurance, government, healthcare, and manufacturing. For organizations running Db2 — particularly on IBM Z (mainframes) or IBM i — the database holds decades of transactional data: account balances, policy records, claim histories, manufacturing workflows, and government records. This data is enormously valuable for analytics but notoriously difficult to access outside the Db2/IBM ecosystem.&lt;/p&gt;
&lt;p&gt;Traditional approaches to Db2 analytics involve CDC tools (IBM InfoSphere DataStage, Attunity), batch exports, or data replication to a separate analytics warehouse. These approaches are expensive, complex, and create stale copies of data that diverge from the source of truth.&lt;/p&gt;
&lt;p&gt;Dremio Cloud connects directly to Db2 (Linux, UNIX, and Windows editions) and queries it alongside modern cloud sources in real time. No CDC infrastructure. No batch exports. Your Db2 data stays in place and joins with S3, PostgreSQL, Snowflake, and any other connected source in a single SQL query.&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;Note:&lt;/strong&gt; Dremio&apos;s Db2 connector supports Db2 for LUW (Linux, UNIX, and Windows). Db2 for z/OS and Db2 for i are not directly supported. If your Db2 instance runs on z/OS or IBM i, you may need to set up a Db2 Connect gateway or replicate to a Db2 LUW instance.&lt;/p&gt;
&lt;/blockquote&gt;
&lt;h2&gt;Why Db2 Users Need Dremio&lt;/h2&gt;
&lt;h3&gt;Access Db2 Data Without IBM Middleware&lt;/h3&gt;
&lt;p&gt;Accessing Db2 analytically typically requires IBM DataStage, IBM Cognos, or custom JDBC applications. These tools are expensive, require specialized skills, and create vendor lock-in. Dremio provides a vendor-neutral SQL layer that connects Db2 to any BI tool (Tableau, Power BI, Looker) via Arrow Flight or ODBC — no IBM middleware needed.&lt;/p&gt;
&lt;h3&gt;Federate Mainframe Data with Cloud Sources&lt;/h3&gt;
&lt;p&gt;Your core banking transactions are in Db2, but your digital banking data is in PostgreSQL on AWS, your customer support data is in MongoDB, and your regulatory data is in S3. Without a federation layer, building a 360-degree customer view requires extracting data from each source into a common warehouse. Dremio queries each in place and joins them at query time.&lt;/p&gt;
&lt;h3&gt;Incremental Modernization&lt;/h3&gt;
&lt;p&gt;Migrating off Db2 is a multi-year, high-risk project that many organizations cannot undertake. Dremio lets you modernize incrementally: start by querying Db2 through Dremio alongside cloud sources, then gradually migrate specific datasets to Iceberg tables. The migration happens over time, with Db2 continuing to serve critical transactional workloads throughout.&lt;/p&gt;
&lt;h3&gt;Cost Reduction&lt;/h3&gt;
&lt;p&gt;IBM mainframe MIPS pricing means every Db2 query consumes expensive compute capacity. Dremio&apos;s Reflections cache analytical results so repeated queries don&apos;t consume Db2 MIPS. This can meaningfully reduce mainframe compute costs for organizations with heavy analytical workloads against Db2.&lt;/p&gt;
&lt;h3&gt;AI on Legacy Data&lt;/h3&gt;
&lt;p&gt;Db2 holds decades of institutional data — customer histories, transaction patterns, risk assessments. Dremio&apos;s AI capabilities make this data accessible to non-technical users and external AI tools, unlocking insights trapped in mainframe systems.&lt;/p&gt;
&lt;h2&gt;Prerequisites&lt;/h2&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Db2 LUW hostname or IP address&lt;/strong&gt; — the Db2 server&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Port&lt;/strong&gt; — default &lt;code&gt;50000&lt;/code&gt; for Db2 LUW&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Database name&lt;/strong&gt; — the Db2 database you want to connect&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Username and password&lt;/strong&gt; — Db2 user with SELECT privileges on the schemas/tables to query&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Network access&lt;/strong&gt; — port 50000 reachable from Dremio Cloud&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Dremio Cloud account&lt;/strong&gt; — &lt;a href=&quot;https://www.dremio.com/get-started?utm_source=ev_buffer&amp;amp;utm_medium=influencer&amp;amp;utm_campaign=pag&amp;amp;utm_term=connector-ibm-db2-dremio-cloud&amp;amp;utm_content=alexmerced&quot;&gt;sign up free for 30 days&lt;/a&gt; with $400 in compute credits&lt;/li&gt;
&lt;/ul&gt;
&lt;h2&gt;Step-by-Step: Connect Db2 to Dremio Cloud&lt;/h2&gt;
&lt;h3&gt;1. Add the Source&lt;/h3&gt;
&lt;p&gt;Click &lt;strong&gt;&amp;quot;+&amp;quot;&lt;/strong&gt; and select &lt;strong&gt;IBM Db2&lt;/strong&gt; from the database source types.&lt;/p&gt;
&lt;h3&gt;2. Configure Connection&lt;/h3&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Name:&lt;/strong&gt; Descriptive identifier (e.g., &lt;code&gt;db2-banking&lt;/code&gt; or &lt;code&gt;mainframe-core&lt;/code&gt;).&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Host:&lt;/strong&gt; Db2 server hostname or IP address.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Port:&lt;/strong&gt; Default &lt;code&gt;50000&lt;/code&gt;.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Database:&lt;/strong&gt; The Db2 database name.&lt;/li&gt;
&lt;/ul&gt;
&lt;h3&gt;3. Set Authentication&lt;/h3&gt;
&lt;p&gt;Master Credentials (username/password) or Secret Resource URL (AWS Secrets Manager).&lt;/p&gt;
&lt;h3&gt;4. Configure Advanced Options&lt;/h3&gt;
&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Setting&lt;/th&gt;
&lt;th&gt;Purpose&lt;/th&gt;
&lt;th&gt;Default&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Record fetch size&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Rows per batch from Db2&lt;/td&gt;
&lt;td&gt;200&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Maximum Idle Connections&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Connection pool management&lt;/td&gt;
&lt;td&gt;8&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Connection Idle Time (s)&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Seconds before idle connections close&lt;/td&gt;
&lt;td&gt;60&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Encrypt Connection&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Enable SSL/TLS&lt;/td&gt;
&lt;td&gt;Off&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Connection Properties&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Custom JDBC parameters&lt;/td&gt;
&lt;td&gt;None&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;
&lt;h3&gt;5. Configure Reflection Refresh and Metadata, Save&lt;/h3&gt;
&lt;h2&gt;Query Db2 Data&lt;/h2&gt;
&lt;pre&gt;&lt;code class=&quot;language-sql&quot;&gt;-- Query core banking accounts
SELECT
  account_id,
  customer_id,
  account_type,
  current_balance,
  last_transaction_date
FROM &amp;quot;db2-banking&amp;quot;.BANK.ACCOUNTS
WHERE account_type = &apos;SAVINGS&apos; AND current_balance &amp;gt; 10000
ORDER BY current_balance DESC;

-- Transaction analysis
SELECT
  account_type,
  DATE_TRUNC(&apos;month&apos;, transaction_date) AS month,
  COUNT(*) AS transaction_count,
  SUM(transaction_amount) AS total_amount,
  AVG(transaction_amount) AS avg_amount
FROM &amp;quot;db2-banking&amp;quot;.BANK.TRANSACTIONS
WHERE transaction_date &amp;gt;= &apos;2024-01-01&apos;
GROUP BY account_type, DATE_TRUNC(&apos;month&apos;, transaction_date)
ORDER BY 1, 2;
&lt;/code&gt;&lt;/pre&gt;
&lt;h2&gt;Federate Db2 with Cloud Sources&lt;/h2&gt;
&lt;pre&gt;&lt;code class=&quot;language-sql&quot;&gt;-- Join Db2 core banking with PostgreSQL digital banking and S3 support data
SELECT
  a.account_id,
  a.current_balance,
  pg.last_login_date,
  pg.mobile_transactions_30d,
  s3.support_tickets_open,
  CASE
    WHEN a.current_balance &amp;gt; 100000 AND pg.mobile_transactions_30d &amp;gt; 10 THEN &apos;High Value - Digitally Active&apos;
    WHEN a.current_balance &amp;gt; 100000 THEN &apos;High Value - Branch Preferred&apos;
    WHEN pg.mobile_transactions_30d &amp;gt; 20 THEN &apos;Digital Native&apos;
    ELSE &apos;Standard&apos;
  END AS customer_segment
FROM &amp;quot;db2-banking&amp;quot;.BANK.ACCOUNTS a
LEFT JOIN &amp;quot;postgres-digital&amp;quot;.public.customer_activity pg ON a.customer_id = pg.customer_id
LEFT JOIN &amp;quot;s3-support&amp;quot;.tickets.customer_tickets s3 ON a.customer_id = s3.customer_id
ORDER BY a.current_balance DESC;
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Mainframe banking data joins with cloud application data in a single query — no CDC, no data extraction.&lt;/p&gt;
&lt;h2&gt;Build a Semantic Layer&lt;/h2&gt;
&lt;pre&gt;&lt;code class=&quot;language-sql&quot;&gt;CREATE VIEW analytics.gold.customer_banking360 AS
SELECT
  a.customer_id,
  a.account_type,
  a.current_balance,
  pg.customer_name,
  pg.email,
  CASE
    WHEN a.current_balance &amp;gt; 250000 THEN &apos;Private Banking&apos;
    WHEN a.current_balance &amp;gt; 50000 THEN &apos;Premium&apos;
    WHEN a.current_balance &amp;gt; 10000 THEN &apos;Standard&apos;
    ELSE &apos;Basic&apos;
  END AS service_tier,
  DATEDIFF(DAY, a.last_transaction_date, CURRENT_DATE) AS days_since_last_transaction
FROM &amp;quot;db2-banking&amp;quot;.BANK.ACCOUNTS a
LEFT JOIN &amp;quot;postgres-digital&amp;quot;.public.customers pg ON a.customer_id = pg.customer_id;
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;In the &lt;strong&gt;Catalog&lt;/strong&gt;, click &lt;strong&gt;Edit&lt;/strong&gt; → &lt;strong&gt;Generate Wiki&lt;/strong&gt; and &lt;strong&gt;Generate Tags&lt;/strong&gt;. Create descriptions like: &amp;quot;customer_banking360: Combines mainframe core banking data with digital channel activity to provide a complete customer view for relationship management.&amp;quot;&lt;/p&gt;
&lt;h2&gt;AI-Powered Analytics on Db2 Data&lt;/h2&gt;
&lt;h3&gt;Dremio AI Agent&lt;/h3&gt;
&lt;p&gt;The AI Agent transforms access to mainframe data. Instead of needing a Db2 DBA to write queries against complex schemas, a relationship manager asks &amp;quot;Show me all Private Banking customers who haven&apos;t transacted in 30 days&amp;quot; and gets accurate results from the semantic layer. The Agent reads your wiki descriptions to understand what &amp;quot;Private Banking&amp;quot; (balance &amp;gt; $250K) and &amp;quot;days_since_last_transaction&amp;quot; mean.&lt;/p&gt;
&lt;p&gt;This democratizes access to decades of mainframe data that was previously accessible only through COBOL reports or specialized IBM tools.&lt;/p&gt;
&lt;h3&gt;Dremio MCP Server&lt;/h3&gt;
&lt;p&gt;Connect &lt;a href=&quot;https://github.com/dremio/dremio-mcp&quot;&gt;Claude or ChatGPT&lt;/a&gt; to your Db2 data:&lt;/p&gt;
&lt;ol&gt;
&lt;li&gt;Create a Native OAuth app in Dremio Cloud&lt;/li&gt;
&lt;li&gt;Configure redirect URLs for your AI client&lt;/li&gt;
&lt;li&gt;Connect via &lt;code&gt;mcp.dremio.cloud/mcp/{project_id}&lt;/code&gt;&lt;/li&gt;
&lt;/ol&gt;
&lt;p&gt;A compliance officer asks Claude &amp;quot;Show me all accounts with balances over $100K and no transactions in 60 days for our dormancy review&amp;quot; and gets a governed, accurate report from Db2 — without knowing Db2 table structures.&lt;/p&gt;
&lt;h3&gt;AI SQL Functions&lt;/h3&gt;
&lt;pre&gt;&lt;code class=&quot;language-sql&quot;&gt;-- Classify account risk with AI
SELECT
  customer_id,
  service_tier,
  current_balance,
  days_since_last_transaction,
  AI_CLASSIFY(
    &apos;Based on these banking patterns, classify the account dormancy risk&apos;,
    &apos;Tier: &apos; || service_tier || &apos;, Balance: $&apos; || CAST(current_balance AS VARCHAR) || &apos;, Days Inactive: &apos; || CAST(days_since_last_transaction AS VARCHAR),
    ARRAY[&apos;Active&apos;, &apos;At Risk&apos;, &apos;Potentially Dormant&apos;, &apos;Dormant&apos;]
  ) AS dormancy_risk
FROM analytics.gold.customer_banking360
WHERE days_since_last_transaction &amp;gt; 30;

-- Generate relationship manager talking points
SELECT
  customer_name,
  service_tier,
  AI_GENERATE(
    &apos;Write a one-sentence talking point for a relationship manager reaching out to this customer&apos;,
    &apos;Customer: &apos; || customer_name || &apos;, Tier: &apos; || service_tier || &apos;, Balance: $&apos; || CAST(current_balance AS VARCHAR) || &apos;, Inactive Days: &apos; || CAST(days_since_last_transaction AS VARCHAR)
  ) AS outreach_talking_point
FROM analytics.gold.customer_banking360
WHERE service_tier = &apos;Private Banking&apos; AND days_since_last_transaction &amp;gt; 14;
&lt;/code&gt;&lt;/pre&gt;
&lt;h2&gt;Reflections for Mainframe Cost Reduction&lt;/h2&gt;
&lt;p&gt;Every query against Db2 on a mainframe consumes MIPS. Create Reflections to cache frequently accessed analytics:&lt;/p&gt;
&lt;ol&gt;
&lt;li&gt;Navigate to the view in the &lt;strong&gt;Catalog&lt;/strong&gt;&lt;/li&gt;
&lt;li&gt;Click the &lt;strong&gt;Reflections&lt;/strong&gt; tab&lt;/li&gt;
&lt;li&gt;Choose &lt;strong&gt;Raw Reflection&lt;/strong&gt; or &lt;strong&gt;Aggregation Reflection&lt;/strong&gt;&lt;/li&gt;
&lt;li&gt;Select columns and set the &lt;strong&gt;Refresh Interval&lt;/strong&gt; — hourly for active accounts, daily for historical analysis&lt;/li&gt;
&lt;li&gt;Click &lt;strong&gt;Save&lt;/strong&gt;&lt;/li&gt;
&lt;/ol&gt;
&lt;p&gt;Dashboard and reporting queries hit Reflections instead of Db2, significantly reducing mainframe compute consumption. A compliance dashboard that refreshes every 15 minutes generates zero Db2 MIPS after the Reflection is built.&lt;/p&gt;
&lt;h2&gt;Governance on Db2 Data&lt;/h2&gt;
&lt;p&gt;Banking, insurance, and government organizations have strict data governance requirements. Dremio&apos;s Fine-Grained Access Control (FGAC) extends Db2 security to every connected source:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Column masking:&lt;/strong&gt; Mask account balances, SSNs, and transaction amounts from specific roles. A marketing analyst sees customer segments but not financial data.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Row-level filtering:&lt;/strong&gt; Branch-level access control — a branch manager sees only their branch&apos;s accounts.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Unified policies:&lt;/strong&gt; Same governance applies across Db2, PostgreSQL, S3, and all other connected sources.&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;These policies apply across SQL Runner, BI tools (Arrow Flight/ODBC), AI Agent, and MCP Server — meeting regulatory requirements for data access control.&lt;/p&gt;
&lt;h2&gt;Connect BI Tools via Arrow Flight&lt;/h2&gt;
&lt;p&gt;Arrow Flight provides 10-100x faster data transfer than JDBC/ODBC:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Tableau:&lt;/strong&gt; Dremio connector for direct Arrow Flight access to mainframe data&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Power BI:&lt;/strong&gt; Dremio ODBC driver or native connector — no IBM middleware&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Python/Pandas:&lt;/strong&gt; &lt;code&gt;pyarrow.flight&lt;/code&gt; for programmatic access to Db2 data&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Looker:&lt;/strong&gt; Connect via JDBC&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;dbt:&lt;/strong&gt; &lt;code&gt;dbt-dremio&lt;/code&gt; for SQL-based transformations on Db2 data&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;All queries benefit from Reflections, governance, and the semantic layer.&lt;/p&gt;
&lt;h2&gt;VS Code Copilot Integration&lt;/h2&gt;
&lt;p&gt;Dremio&apos;s VS Code extension with Copilot integration lets developers query Db2 data from their IDE. Ask Copilot &amp;quot;Show me dormant high-value accounts from Db2&amp;quot; and get SQL generated using your semantic layer — without knowing Db2 table schemas or COBOL naming conventions.&lt;/p&gt;
&lt;h2&gt;When to Keep Data in Db2 vs. Migrate&lt;/h2&gt;
&lt;p&gt;&lt;strong&gt;Keep in Db2:&lt;/strong&gt; Active transactional data for applications, data with COBOL program dependencies, regulatory data that must maintain system of record status, data subject to mainframe-specific compliance requirements.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Migrate to Iceberg:&lt;/strong&gt; Historical transaction archives (closed accounts, prior fiscal years), data consumed by non-mainframe tools, datasets where mainframe MIPS cost exceeds analytical value, archived data for long-term retention.&lt;/p&gt;
&lt;p&gt;For data staying in Db2, create manual Reflections to reduce MIPS consumption. For migrated Iceberg data, Dremio provides automatic compaction, time travel, Autonomous Reflections, and dramatically lower storage costs.&lt;/p&gt;
&lt;h2&gt;Db2 Character Encoding and Data Types&lt;/h2&gt;
&lt;p&gt;Db2 uses EBCDIC encoding on mainframes and ASCII/UTF-8 on LUW platforms. When connecting through Dremio:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;EBCDIC to UTF-8:&lt;/strong&gt; Db2 for LUW handles character conversion automatically — Dremio receives standard Unicode data&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;GRAPHIC/VARGRAPHIC:&lt;/strong&gt; Double-byte character columns map to VARCHAR in Dremio&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;DECIMAL/NUMERIC:&lt;/strong&gt; Db2&apos;s fixed-point types map to Dremio&apos;s DECIMAL with matching precision/scale&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;DATE/TIME/TIMESTAMP:&lt;/strong&gt; Standard mapping — Db2 timestamps map to Dremio TIMESTAMP&lt;/li&gt;
&lt;/ul&gt;
&lt;h2&gt;Regulatory Compliance Patterns&lt;/h2&gt;
&lt;p&gt;Banking, insurance, and government organizations have strict data retention and access requirements. Dremio addresses these:&lt;/p&gt;
&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Requirement&lt;/th&gt;
&lt;th&gt;Dremio Feature&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Data residency&lt;/td&gt;
&lt;td&gt;Query data in place — no cross-border data movement&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Access auditing&lt;/td&gt;
&lt;td&gt;Query logs track who queried what data&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Column-level security&lt;/td&gt;
&lt;td&gt;FGAC column masking hides sensitive fields&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Row-level security&lt;/td&gt;
&lt;td&gt;FGAC row filtering restricts data by user role&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Data retention&lt;/td&gt;
&lt;td&gt;Time travel on Iceberg tables provides point-in-time access&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;
&lt;h2&gt;Mainframe Modernization Roadmap&lt;/h2&gt;
&lt;p&gt;Use Dremio as the bridge in a multi-year mainframe modernization:&lt;/p&gt;
&lt;ol&gt;
&lt;li&gt;&lt;strong&gt;Phase 1 (Months 1-3):&lt;/strong&gt; Connect Db2 to Dremio Cloud. Create Reflections to offload analytical queries.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Phase 2 (Months 4-6):&lt;/strong&gt; Build a semantic layer over Db2 data. Enable AI Agent and MCP Server for business users.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Phase 3 (Months 7-12):&lt;/strong&gt; Identify high-value datasets for migration to Iceberg. Use &lt;code&gt;CREATE TABLE AS SELECT&lt;/code&gt; to migrate.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Phase 4 (Year 2+):&lt;/strong&gt; Gradually migrate remaining datasets as mainframe contracts renew. Db2 focus narrows to core OLTP.&lt;/li&gt;
&lt;/ol&gt;
&lt;p&gt;Throughout the process, users experience no disruption — they continue using the same semantic layer views. Only the underlying data sources change.&lt;/p&gt;
&lt;h2&gt;Get Started&lt;/h2&gt;
&lt;p&gt;Db2 users can modernize analytics without migrating off the mainframe — federate, govern, accelerate, and AI-enable decades of institutional data through Dremio Cloud. Start with Reflections to offload analytical queries from Db2, then progressively build a semantic layer that makes legacy data accessible to modern AI tools and business users.&lt;/p&gt;
&lt;p&gt;&lt;a href=&quot;https://www.dremio.com/get-started?utm_source=ev_buffer&amp;amp;utm_medium=influencer&amp;amp;utm_campaign=pag&amp;amp;utm_term=connector-ibm-db2-dremio-cloud&amp;amp;utm_content=alexmerced&quot;&gt;Try Dremio Cloud free for 30 days&lt;/a&gt; and connect your Db2 databases.&lt;/p&gt;
</content:encoded><author>Alex Merced</author></item><item><title>Connect Microsoft SQL Server to Dremio Cloud: Federate Enterprise Data Without ETL</title><link>https://iceberglakehouse.com/posts/2026-03-connector-microsoft-sql-server/</link><guid isPermaLink="true">https://iceberglakehouse.com/posts/2026-03-connector-microsoft-sql-server/</guid><description>
Microsoft SQL Server is one of the most widely deployed enterprise databases in the world. ERP systems, CRM platforms, financial applications, and cu...</description><pubDate>Sun, 01 Mar 2026 12:00:00 GMT</pubDate><content:encoded>&lt;p&gt;Microsoft SQL Server is one of the most widely deployed enterprise databases in the world. ERP systems, CRM platforms, financial applications, and custom business applications run on SQL Server across on-premises data centers and Azure cloud deployments. But connecting SQL Server data to a modern analytics platform typically requires building ETL pipelines, managing SSIS packages, or purchasing additional SQL Server Enterprise licenses for analytics workloads.&lt;/p&gt;
&lt;p&gt;Dremio Cloud connects directly to SQL Server and queries it alongside S3, PostgreSQL, Snowflake, BigQuery, MongoDB, and every other connected source in a single SQL query. You don&apos;t need to extract data from SQL Server, build staging tables, or manage nightly ETL jobs. Dremio reads SQL Server in place, applies governance, and accelerates repeated queries with Reflections.&lt;/p&gt;
&lt;p&gt;SQL Server licensing is notoriously expensive — Enterprise edition costs tens of thousands of dollars per core. Running analytical queries directly against production SQL Server instances consumes CPU capacity that&apos;s licensed for transactional workloads. Dremio&apos;s Reflections cache analytical results, offloading read-heavy queries from SQL Server and potentially allowing organizations to reduce their SQL Server core count or downgrade from Enterprise to Standard edition.&lt;/p&gt;
&lt;h2&gt;Why SQL Server Users Need Dremio&lt;/h2&gt;
&lt;h3&gt;Escape Linked Server Limitations&lt;/h3&gt;
&lt;p&gt;SQL Server&apos;s linked servers provide basic federation, but they&apos;re limited: poor cross-platform support (try linking to MongoDB or BigQuery), no query optimization across links, no governance layer, and performance degrades with large result sets. Dremio&apos;s federation engine is purpose-built for cross-source queries — it pushes predicates to each source, optimizes join strategies, and handles large-scale data movement efficiently.&lt;/p&gt;
&lt;h3&gt;Reduce SQL Server License Costs&lt;/h3&gt;
&lt;p&gt;SQL Server Enterprise licensing is expensive — especially when analytical workloads compete with transactional OLTP operations for CPU and memory. Dremio&apos;s Reflections offload repeated analytical queries from SQL Server: dashboard refreshes, scheduled reports, and ad-hoc exploration hit cached Reflections instead of SQL Server. This can reduce the SQL Server resources dedicated to analytics, potentially allowing you to downgrade from Enterprise to Standard edition or reduce core counts.&lt;/p&gt;
&lt;h3&gt;Multi-Cloud, Multi-Database Analytics&lt;/h3&gt;
&lt;p&gt;Your SQL Server holds ERP data, but your data lake is on S3, your marketing data is in Google BigQuery, and your modern applications use PostgreSQL. Without Dremio, combining these requires SSIS packages, Azure Data Factory, or custom ETL for each source. Dremio queries all of them in a single SQL statement.&lt;/p&gt;
&lt;h3&gt;Unified Governance Beyond Windows&lt;/h3&gt;
&lt;p&gt;SQL Server has Windows Authentication and SQL Logins, but these don&apos;t apply to your S3 data lake, BigQuery, or PostgreSQL. Dremio&apos;s Fine-Grained Access Control applies column masking and row-level filtering consistently across SQL Server and every other connected source.&lt;/p&gt;
&lt;h3&gt;AI Analytics on Enterprise Data&lt;/h3&gt;
&lt;p&gt;SQL Server stores decades of business data — financial records, customer histories, inventory movements. Dremio&apos;s AI Agent, MCP Server, and AI SQL Functions make that historical data queryable by natural language and enrichable by AI, unlocking insights that would otherwise require a data analyst with deep institutional knowledge.&lt;/p&gt;
&lt;h2&gt;Prerequisites&lt;/h2&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;SQL Server hostname or IP address&lt;/strong&gt;&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Port&lt;/strong&gt; — default &lt;code&gt;1433&lt;/code&gt;&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Database name&lt;/strong&gt;&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Username and password&lt;/strong&gt; (SQL Authentication) — user needs SELECT permissions on target schemas and tables&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Network access&lt;/strong&gt; — port 1433 must be reachable from Dremio Cloud. For on-premises SQL Server, configure VPN or firewall rules&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Dremio Cloud account&lt;/strong&gt; — &lt;a href=&quot;https://www.dremio.com/get-started?utm_source=ev_buffer&amp;amp;utm_medium=influencer&amp;amp;utm_campaign=pag&amp;amp;utm_term=connector-sqlserver-dremio-cloud&amp;amp;utm_content=alexmerced&quot;&gt;sign up free for 30 days&lt;/a&gt; with $400 in compute credits&lt;/li&gt;
&lt;/ul&gt;
&lt;h2&gt;Step-by-Step: Connect SQL Server to Dremio Cloud&lt;/h2&gt;
&lt;h3&gt;1. Add the Source&lt;/h3&gt;
&lt;p&gt;Click &lt;strong&gt;&amp;quot;+&amp;quot;&lt;/strong&gt; in the Dremio console and select &lt;strong&gt;Microsoft SQL Server&lt;/strong&gt;.&lt;/p&gt;
&lt;h3&gt;2. Configure Connection&lt;/h3&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Name:&lt;/strong&gt; Descriptive identifier (e.g., &lt;code&gt;sqlserver-erp&lt;/code&gt; or &lt;code&gt;production-db&lt;/code&gt;).&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Host:&lt;/strong&gt; SQL Server hostname or IP address.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Port:&lt;/strong&gt; Default &lt;code&gt;1433&lt;/code&gt;.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Database:&lt;/strong&gt; The database name to connect to.&lt;/li&gt;
&lt;/ul&gt;
&lt;h3&gt;3. Set Authentication&lt;/h3&gt;
&lt;p&gt;Enter SQL Authentication credentials (username/password) or use Secret Resource URL for centralized credential management via AWS Secrets Manager.&lt;/p&gt;
&lt;h3&gt;4. Configure Advanced Options&lt;/h3&gt;
&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Setting&lt;/th&gt;
&lt;th&gt;Purpose&lt;/th&gt;
&lt;th&gt;Default&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Record fetch size&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Rows per batch from SQL Server&lt;/td&gt;
&lt;td&gt;200&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Maximum Idle Connections&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Connection pool management&lt;/td&gt;
&lt;td&gt;8&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Connection Idle Time (s)&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Seconds before idle connections close&lt;/td&gt;
&lt;td&gt;60&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Encrypt Connection&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Enable SSL/TLS&lt;/td&gt;
&lt;td&gt;Off&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;SSL Verification&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Verify SSL server certificate&lt;/td&gt;
&lt;td&gt;Off&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Hostname in Certificate&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Expected hostname in SSL certificate&lt;/td&gt;
&lt;td&gt;None&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Connection Properties&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Custom JDBC parameters&lt;/td&gt;
&lt;td&gt;None&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;
&lt;h3&gt;5. Configure Reflection Refresh and Metadata, Save&lt;/h3&gt;
&lt;h2&gt;Query SQL Server Data&lt;/h2&gt;
&lt;pre&gt;&lt;code class=&quot;language-sql&quot;&gt;-- Query ERP inventory data
SELECT
  product_id,
  product_name,
  warehouse_location,
  quantity_on_hand,
  reorder_point
FROM &amp;quot;sqlserver-erp&amp;quot;.dbo.products
WHERE quantity_on_hand &amp;lt; reorder_point
ORDER BY quantity_on_hand ASC;

-- Financial reporting
SELECT
  department_code,
  account_category,
  fiscal_quarter,
  SUM(actual_amount) AS actual_spend,
  SUM(budget_amount) AS budgeted,
  ROUND((SUM(actual_amount) - SUM(budget_amount)) / NULLIF(SUM(budget_amount), 0) * 100, 1) AS variance_pct
FROM &amp;quot;sqlserver-erp&amp;quot;.finance.budget_actuals
WHERE fiscal_year = 2024
GROUP BY department_code, account_category, fiscal_quarter
ORDER BY ABS(SUM(actual_amount) - SUM(budget_amount)) DESC;
&lt;/code&gt;&lt;/pre&gt;
&lt;h2&gt;Federate SQL Server with Other Sources&lt;/h2&gt;
&lt;pre&gt;&lt;code class=&quot;language-sql&quot;&gt;-- Join SQL Server ERP with PostgreSQL CRM and S3 marketing data
SELECT
  ss.product_name,
  ss.quantity_on_hand,
  pg.total_orders,
  pg.avg_order_value,
  s3.click_through_rate,
  CASE
    WHEN pg.total_orders &amp;gt; 100 AND ss.quantity_on_hand &amp;lt; 50 THEN &apos;Reorder - High Demand&apos;
    WHEN pg.total_orders &amp;lt; 10 AND ss.quantity_on_hand &amp;gt; 500 THEN &apos;Overstock - Reduce&apos;
    ELSE &apos;Normal&apos;
  END AS inventory_action
FROM &amp;quot;sqlserver-erp&amp;quot;.dbo.products ss
LEFT JOIN (
  SELECT product_id, COUNT(*) AS total_orders, AVG(order_value) AS avg_order_value
  FROM &amp;quot;postgres-crm&amp;quot;.public.orders
  WHERE order_date &amp;gt;= &apos;2024-01-01&apos;
  GROUP BY product_id
) pg ON ss.product_id = pg.product_id
LEFT JOIN &amp;quot;s3-marketing&amp;quot;.analytics.product_clicks s3 ON ss.product_id = s3.product_id
WHERE ss.quantity_on_hand &amp;lt; ss.reorder_point OR pg.total_orders &amp;gt; 100
ORDER BY pg.total_orders DESC;
&lt;/code&gt;&lt;/pre&gt;
&lt;h2&gt;Build a Semantic Layer&lt;/h2&gt;
&lt;pre&gt;&lt;code class=&quot;language-sql&quot;&gt;CREATE VIEW analytics.gold.inventory_management AS
SELECT
  p.product_id,
  p.product_name,
  p.warehouse_location,
  p.quantity_on_hand,
  p.reorder_point,
  CASE
    WHEN p.quantity_on_hand = 0 THEN &apos;Out of Stock&apos;
    WHEN p.quantity_on_hand &amp;lt; p.reorder_point * 0.5 THEN &apos;Critical&apos;
    WHEN p.quantity_on_hand &amp;lt; p.reorder_point THEN &apos;Low&apos;
    ELSE &apos;Adequate&apos;
  END AS stock_status,
  ROUND(p.quantity_on_hand * p.unit_cost, 2) AS inventory_value
FROM &amp;quot;sqlserver-erp&amp;quot;.dbo.products p;
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;In the &lt;strong&gt;Catalog&lt;/strong&gt;, click &lt;strong&gt;Edit&lt;/strong&gt; → &lt;strong&gt;Generate Wiki&lt;/strong&gt; and &lt;strong&gt;Generate Tags&lt;/strong&gt;. Create descriptions like: &amp;quot;inventory_management: One row per product showing current stock levels, stock status classification, and estimated inventory value. Use this view to monitor reorder needs.&amp;quot;&lt;/p&gt;
&lt;h2&gt;AI-Powered Analytics on SQL Server Data&lt;/h2&gt;
&lt;h3&gt;Dremio AI Agent&lt;/h3&gt;
&lt;p&gt;The AI Agent lets operations managers ask &amp;quot;Which products are critically low at the Chicago warehouse?&amp;quot; without writing SQL. The Agent reads your wiki descriptions, understands &amp;quot;Critical&amp;quot; means stock below 50% of reorder point, and generates accurate queries. This is transformative for SQL Server environments where tribal knowledge about table schemas and column meanings lives in senior employees&apos; heads.&lt;/p&gt;
&lt;h3&gt;Dremio MCP Server&lt;/h3&gt;
&lt;p&gt;Connect &lt;a href=&quot;https://github.com/dremio/dremio-mcp&quot;&gt;Claude or ChatGPT&lt;/a&gt; to your SQL Server data:&lt;/p&gt;
&lt;ol&gt;
&lt;li&gt;Create a Native OAuth app in Dremio Cloud&lt;/li&gt;
&lt;li&gt;Configure redirect URLs for your AI client&lt;/li&gt;
&lt;li&gt;Connect via &lt;code&gt;mcp.dremio.cloud/mcp/{project_id}&lt;/code&gt;&lt;/li&gt;
&lt;/ol&gt;
&lt;p&gt;A warehouse manager asks Claude &amp;quot;Show me all products that need reordering, sorted by how critical the shortage is&amp;quot; and gets actionable results from the semantic layer over SQL Server — no SQL, no SSMS.&lt;/p&gt;
&lt;h3&gt;AI SQL Functions&lt;/h3&gt;
&lt;pre&gt;&lt;code class=&quot;language-sql&quot;&gt;-- Generate reorder recommendations with AI
SELECT
  product_name,
  stock_status,
  quantity_on_hand,
  reorder_point,
  AI_GENERATE(
    &apos;Write a one-sentence reorder recommendation based on inventory status&apos;,
    &apos;Product: &apos; || product_name || &apos;, Stock: &apos; || CAST(quantity_on_hand AS VARCHAR) || &apos;, Reorder Point: &apos; || CAST(reorder_point AS VARCHAR) || &apos;, Status: &apos; || stock_status
  ) AS reorder_recommendation
FROM analytics.gold.inventory_management
WHERE stock_status IN (&apos;Critical&apos;, &apos;Out of Stock&apos;);

-- Classify financial variances
SELECT
  department_code,
  variance_pct,
  AI_CLASSIFY(
    &apos;Based on the budget variance, classify the financial risk level&apos;,
    &apos;Department: &apos; || department_code || &apos;, Variance: &apos; || CAST(variance_pct AS VARCHAR) || &apos;%&apos;,
    ARRAY[&apos;On Track&apos;, &apos;Minor Variance&apos;, &apos;Significant Overspend&apos;, &apos;Critical Overspend&apos;]
  ) AS financial_risk
FROM (
  SELECT department_code, ROUND((SUM(actual_amount) - SUM(budget_amount)) / NULLIF(SUM(budget_amount), 0) * 100, 1) AS variance_pct
  FROM &amp;quot;sqlserver-erp&amp;quot;.finance.budget_actuals
  WHERE fiscal_year = 2024
  GROUP BY department_code
);
&lt;/code&gt;&lt;/pre&gt;
&lt;h2&gt;Reflections for Performance&lt;/h2&gt;
&lt;p&gt;SQL Server Enterprise charges per-core licensing. Offloading analytical queries to Reflections reduces compute pressure on SQL Server cores:&lt;/p&gt;
&lt;ol&gt;
&lt;li&gt;Navigate to the view in the &lt;strong&gt;Catalog&lt;/strong&gt;&lt;/li&gt;
&lt;li&gt;Click the &lt;strong&gt;Reflections&lt;/strong&gt; tab&lt;/li&gt;
&lt;li&gt;Choose &lt;strong&gt;Raw Reflection&lt;/strong&gt; or &lt;strong&gt;Aggregation Reflection&lt;/strong&gt;&lt;/li&gt;
&lt;li&gt;Select columns and set the &lt;strong&gt;Refresh Interval&lt;/strong&gt; — for ERP data updated throughout the day, hourly; for financial data, match to reporting cycles&lt;/li&gt;
&lt;li&gt;Click &lt;strong&gt;Save&lt;/strong&gt;&lt;/li&gt;
&lt;/ol&gt;
&lt;p&gt;BI tools get sub-second response times from Reflections. SQL Server focuses on transactional OLTP workloads. A financial dashboard refreshing every 15 minutes generates zero SQL Server load after the Reflection is built.&lt;/p&gt;
&lt;h2&gt;Governance Across SQL Server and Other Sources&lt;/h2&gt;
&lt;p&gt;Dremio&apos;s Fine-Grained Access Control (FGAC) extends SQL Server security to every connected source:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Column masking:&lt;/strong&gt; Mask financial data, salary details, or PII from specific roles. A warehouse manager sees inventory levels but not cost data.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Row-level filtering:&lt;/strong&gt; Regional managers see only their region&apos;s data. Department heads see only their department.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Unified policies:&lt;/strong&gt; Same governance applies across SQL Server, PostgreSQL, S3, and all other sources.&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;These policies apply across SQL Runner, BI tools (Arrow Flight/ODBC), AI Agent, and MCP Server.&lt;/p&gt;
&lt;h2&gt;Connect BI Tools via Arrow Flight&lt;/h2&gt;
&lt;p&gt;Arrow Flight provides 10-100x faster data transfer than JDBC/ODBC for SQL Server data:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Power BI:&lt;/strong&gt; Dremio native connector — ideal for Microsoft-centric organizations&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Tableau:&lt;/strong&gt; Dremio connector for direct Arrow Flight access&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Python/Pandas:&lt;/strong&gt; &lt;code&gt;pyarrow.flight&lt;/code&gt; for programmatic data access&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Looker:&lt;/strong&gt; Connect via JDBC&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;dbt:&lt;/strong&gt; &lt;code&gt;dbt-dremio&lt;/code&gt; adapter for transformation workflows&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;All queries benefit from Reflections, governance, and the semantic layer.&lt;/p&gt;
&lt;h2&gt;VS Code Copilot Integration&lt;/h2&gt;
&lt;p&gt;Dremio&apos;s VS Code extension with Copilot integration lets developers query SQL Server data from their IDE. Ask Copilot &amp;quot;Show me products below reorder point at the Chicago warehouse&amp;quot; and get SQL generated from your semantic layer.&lt;/p&gt;
&lt;h2&gt;When to Keep Data in SQL Server vs. Migrate&lt;/h2&gt;
&lt;p&gt;&lt;strong&gt;Keep in SQL Server:&lt;/strong&gt; Transactional data for active applications, data with stored procedures and triggers, operational systems that depend on SQL Server features (SSRS, SSIS, linked servers).&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Migrate to Iceberg:&lt;/strong&gt; Historical records and archives, reporting data, data consumed by non-SQL-Server tools, datasets where SQL Server per-core licensing cost exceeds analytical value. Migrated Iceberg tables get Dremio&apos;s automatic compaction, time travel, and Autonomous Reflections.&lt;/p&gt;
&lt;p&gt;For data staying in SQL Server, create manual Reflections. For migrated Iceberg data, Dremio handles optimization automatically.&lt;/p&gt;
&lt;h2&gt;Query Pushdown to SQL Server&lt;/h2&gt;
&lt;p&gt;Dremio&apos;s federation engine optimizes cross-source queries by pushing operations to SQL Server whenever possible:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Filter pushdown:&lt;/strong&gt; &lt;code&gt;WHERE&lt;/code&gt; clauses are pushed to SQL Server, so only matching rows are transferred&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Projection pushdown:&lt;/strong&gt; Only the columns referenced in your query are requested from SQL Server&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Aggregate pushdown:&lt;/strong&gt; &lt;code&gt;SUM&lt;/code&gt;, &lt;code&gt;COUNT&lt;/code&gt;, &lt;code&gt;AVG&lt;/code&gt; operations can be executed on SQL Server when the full query allows&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;This minimizes data transfer between SQL Server and Dremio, reducing network traffic and improving query performance.&lt;/p&gt;
&lt;h2&gt;ERP Integration Patterns&lt;/h2&gt;
&lt;p&gt;SQL Server frequently powers ERP systems (Microsoft Dynamics, custom internal ERPs). Dremio enables analytics that combine ERP data with external sources:&lt;/p&gt;
&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;SQL Server (ERP)&lt;/th&gt;
&lt;th&gt;External Source&lt;/th&gt;
&lt;th&gt;Analytics Use Case&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Inventory levels&lt;/td&gt;
&lt;td&gt;S3 demand forecasts&lt;/td&gt;
&lt;td&gt;Automated reorder predictions&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Purchase orders&lt;/td&gt;
&lt;td&gt;PostgreSQL supplier data&lt;/td&gt;
&lt;td&gt;Supplier performance scoring&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Financial actuals&lt;/td&gt;
&lt;td&gt;BigQuery market data&lt;/td&gt;
&lt;td&gt;Revenue benchmarking&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Customer accounts&lt;/td&gt;
&lt;td&gt;MongoDB support tickets&lt;/td&gt;
&lt;td&gt;Churn risk assessment&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;
&lt;p&gt;These cross-source analytics are impossible with SQL Server alone and traditionally require SQL Server Integration Services (SSIS) to build ETL pipelines. Dremio eliminates this requirement entirely.&lt;/p&gt;
&lt;h2&gt;SQL Server Always Encrypted and SSL&lt;/h2&gt;
&lt;p&gt;Dremio supports SSL/TLS connections to SQL Server. For databases using Always Encrypted columns, be aware that Dremio reads the encrypted values — decryption requires the Column Master Key, which is managed by the application. For analytical workloads, consider creating views on the SQL Server side that expose non-encrypted analytical summaries.&lt;/p&gt;
&lt;h2&gt;Get Started&lt;/h2&gt;
&lt;p&gt;SQL Server users can federate enterprise data, reduce license costs, deploy AI analytics, and apply unified governance across their entire data estate. Start by connecting your primary SQL Server instance to Dremio Cloud. Create Reflections on your most-queried reporting tables to offload analytical queries from SQL Server immediately, reducing CPU load and freeing licensed cores for transactional workloads.&lt;/p&gt;
&lt;p&gt;&lt;a href=&quot;https://www.dremio.com/get-started?utm_source=ev_buffer&amp;amp;utm_medium=influencer&amp;amp;utm_campaign=pag&amp;amp;utm_term=connector-sqlserver-dremio-cloud&amp;amp;utm_content=alexmerced&quot;&gt;Try Dremio Cloud free for 30 days&lt;/a&gt; and connect your SQL Server instances.&lt;/p&gt;
</content:encoded><author>Alex Merced</author></item><item><title>Extract Structured Data from Text with Dremio&apos;s AI_GENERATE Function</title><link>https://iceberglakehouse.com/posts/2026-03-ai-ai-generate/</link><guid isPermaLink="true">https://iceberglakehouse.com/posts/2026-03-ai-ai-generate/</guid><description>
Unstructured text is the most underused data in most organizations. Customer emails sit in inboxes. Contract notes live in text fields. Meeting summa...</description><pubDate>Sun, 01 Mar 2026 11:00:00 GMT</pubDate><content:encoded>&lt;p&gt;Unstructured text is the most underused data in most organizations. Customer emails sit in inboxes. Contract notes live in text fields. Meeting summaries exist as free-text columns in CRM systems. The information is there, but it&apos;s locked inside prose that SQL can&apos;t filter, join, or aggregate.&lt;/p&gt;
&lt;p&gt;Dremio&apos;s &lt;code&gt;AI_GENERATE&lt;/code&gt; function breaks that lock. It sends unstructured text to an LLM and returns structured rows with typed columns. You define the output schema directly in SQL, and the LLM extracts the fields you specify. An email becomes a row with &lt;code&gt;sender&lt;/code&gt;, &lt;code&gt;subject&lt;/code&gt;, &lt;code&gt;priority&lt;/code&gt;, and &lt;code&gt;action_items&lt;/code&gt; columns. A contract note becomes a row with &lt;code&gt;party_name&lt;/code&gt;, &lt;code&gt;contract_value&lt;/code&gt;, &lt;code&gt;start_date&lt;/code&gt;, and &lt;code&gt;terms&lt;/code&gt;.&lt;/p&gt;
&lt;p&gt;This tutorial builds a complete document processing pipeline in a fresh Dremio Cloud account. You&apos;ll create sample email and contract data, build a medallion architecture, and use &lt;code&gt;AI_GENERATE&lt;/code&gt; to extract structured fields from free text. A separate section covers using &lt;code&gt;AI_GENERATE&lt;/code&gt; with &lt;code&gt;LIST_FILES&lt;/code&gt; to process unstructured files (PDFs, text files) stored in object storage.&lt;/p&gt;
&lt;h2&gt;What You&apos;ll Build&lt;/h2&gt;
&lt;p&gt;By the end of this tutorial, you&apos;ll have:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;A dataset with 50+ raw emails and 25+ contract notes containing free-text descriptions&lt;/li&gt;
&lt;li&gt;Bronze views that standardize raw data&lt;/li&gt;
&lt;li&gt;Silver views that join emails with contract information&lt;/li&gt;
&lt;li&gt;Gold views that use &lt;code&gt;AI_GENERATE&lt;/code&gt; with &lt;code&gt;WITH SCHEMA&lt;/code&gt; to extract structured fields from text&lt;/li&gt;
&lt;li&gt;Materialized Iceberg tables that persist extracted data for downstream analytics&lt;/li&gt;
&lt;li&gt;An understanding of how to combine &lt;code&gt;AI_GENERATE&lt;/code&gt; with &lt;code&gt;LIST_FILES&lt;/code&gt; for file-based extraction&lt;/li&gt;
&lt;/ul&gt;
&lt;h2&gt;Prerequisites&lt;/h2&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Dremio Cloud account&lt;/strong&gt; — &lt;a href=&quot;https://www.dremio.com/get-started?utm_source=ev_buffer&amp;amp;utm_medium=influencer&amp;amp;utm_campaign=pag&amp;amp;utm_term=ai-generate-dremio-cloud&amp;amp;utm_content=alexmerced&quot;&gt;sign up free for 30 days&lt;/a&gt; with $400 in compute credits&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;AI enabled&lt;/strong&gt; — go to Admin → Project Settings → Preferences → AI section and enable AI features&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Model Provider configured&lt;/strong&gt; — Dremio provides a hosted LLM by default, or connect your own (OpenAI, Anthropic, Google Gemini, AWS Bedrock, Azure OpenAI)&lt;/li&gt;
&lt;/ul&gt;
&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;Note:&lt;/strong&gt; Tables in the built-in Open Catalog use &lt;code&gt;folder.subfolder.table_name&lt;/code&gt; without a catalog prefix. External sources use &lt;code&gt;source_name.schema.table_name&lt;/code&gt;.&lt;/p&gt;
&lt;/blockquote&gt;
&lt;h2&gt;Understanding AI_GENERATE&lt;/h2&gt;
&lt;p&gt;&lt;code&gt;AI_GENERATE&lt;/code&gt; is the most powerful of Dremio&apos;s AI SQL functions because it returns structured data from unstructured input. The function signature:&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-sql&quot;&gt;AI_GENERATE(
  [model_name VARCHAR,]
  prompt VARCHAR,
  target_data VARCHAR
  [WITH SCHEMA (field_name DATA_TYPE, ...)]
) → ROW | VARCHAR
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;&lt;strong&gt;Parameters:&lt;/strong&gt;&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;model_name&lt;/strong&gt; (optional) — specify a model like &lt;code&gt;&apos;openai.gpt-4o&apos;&lt;/code&gt;. Format is &lt;code&gt;modelProvider.modelName&lt;/code&gt;.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;prompt&lt;/strong&gt; — the extraction instruction telling the LLM what fields to find in the target data.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;target_data&lt;/strong&gt; — the unstructured text column to process. This is usually a column from your table containing emails, notes, descriptions, or document content.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;WITH SCHEMA&lt;/strong&gt; (optional but recommended) — defines the output structure as a ROW type with named, typed columns. Without it, &lt;code&gt;AI_GENERATE&lt;/code&gt; returns a &lt;code&gt;VARCHAR&lt;/code&gt; (plain text). With it, you get a &lt;code&gt;ROW&lt;/code&gt; that you can expand using dot notation.&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;The &lt;code&gt;WITH SCHEMA&lt;/code&gt; clause is what makes &lt;code&gt;AI_GENERATE&lt;/code&gt; different from &lt;code&gt;AI_COMPLETE&lt;/code&gt;. Instead of getting free-form text back, you get a typed row where each field is a column you defined, ready for filtering, joining, and aggregating.&lt;/p&gt;
&lt;h3&gt;ROW Type Output&lt;/h3&gt;
&lt;p&gt;When you use &lt;code&gt;WITH SCHEMA&lt;/code&gt;, the result is a &lt;code&gt;ROW&lt;/code&gt; type. Access individual fields with dot notation:&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-sql&quot;&gt;SELECT
  result.sender,
  result.priority,
  result.action_items
FROM (
  SELECT AI_GENERATE(
    &apos;Extract key information from this email&apos;,
    email_body
    WITH SCHEMA (sender VARCHAR, priority VARCHAR, action_items VARCHAR)
  ) AS result
  FROM emails
);
&lt;/code&gt;&lt;/pre&gt;
&lt;h2&gt;Step 1: Create Your Folder Structure&lt;/h2&gt;
&lt;p&gt;Open the &lt;strong&gt;SQL Runner&lt;/strong&gt;:&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-sql&quot;&gt;CREATE FOLDER IF NOT EXISTS aigenerateexp;
CREATE FOLDER IF NOT EXISTS aigenerateexp.document_data;
CREATE FOLDER IF NOT EXISTS aigenerateexp.bronze;
CREATE FOLDER IF NOT EXISTS aigenerateexp.silver;
CREATE FOLDER IF NOT EXISTS aigenerateexp.gold;
&lt;/code&gt;&lt;/pre&gt;
&lt;h2&gt;Step 2: Seed Your Sample Data&lt;/h2&gt;
&lt;h3&gt;Raw Emails Table&lt;/h3&gt;
&lt;p&gt;This table simulates emails stored in a CRM system. Each email has a free-text body that contains multiple pieces of information: who sent it, what they&apos;re asking about, how urgent it is, and what action is needed. Extracting these fields manually would require a human to read each email. &lt;code&gt;AI_GENERATE&lt;/code&gt; automates this.&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-sql&quot;&gt;CREATE TABLE aigenerateexp.document_data.raw_emails (
  email_id INT,
  received_date DATE,
  email_body VARCHAR
);

INSERT INTO aigenerateexp.document_data.raw_emails VALUES
(1, &apos;2025-09-01&apos;, &apos;Hi team, this is Sarah Chen from Acme Corp. We need to urgently discuss the renewal of our enterprise license. Our current contract expires on October 15th and we want to add 200 additional seats. Can someone from your licensing team contact me by end of week? My direct line is 555-0142. Thanks, Sarah&apos;),
(2, &apos;2025-09-02&apos;, &apos;To whom it may concern, I am writing to report a critical production outage affecting our CloudSync deployment. All file synchronization stopped at 3:47 AM EST this morning. Over 500 users are impacted. We need immediate escalation to your Level 3 support team. This is a P1 issue per our SLA terms. Regards, James Rodriguez, VP of IT, Global Industries&apos;),
(3, &apos;2025-09-03&apos;, &apos;Hello, my name is Emily Watson and I am the procurement manager at TechStart Inc. We are evaluating DataVault Enterprise for our compliance requirements. Could you send me pricing information for a 3-year commitment with 150 users? Also interested in the SOC 2 audit documentation. Our budget review is scheduled for next month so no rush. Best, Emily&apos;),
(4, &apos;2025-09-05&apos;, &apos;URGENT: Our QuickReport installation has been down for 6 hours. Dashboard presentations to the board of directors are in 2 hours. We need the reporting engine restored immediately. Client: MegaCorp Financial. Contact: David Kim, CFO. Phone: 555-0198. This is affecting our quarterly earnings presentation.&apos;),
(5, &apos;2025-09-06&apos;, &apos;Hi there, I wanted to share some positive feedback. Your DevPipeline product has reduced our deployment time from 45 minutes to under 3 minutes. Our engineering team of 80 developers is very happy with the migration. We are considering expanding to our European offices next quarter. Great product! - Michael Brown, CTO, CloudNine Software&apos;),
(6, &apos;2025-09-08&apos;, &apos;Dear Support, we recently purchased MailForge for our marketing team but are having trouble with the SMTP relay configuration. Emails are being flagged as spam by Gmail and Outlook recipients. Our deliverability rate dropped from 98% to 62% after switching to MailForge. This is not urgent but needs resolution by end of month before our holiday campaign launches. Sincerely, Lisa Park, Marketing Director, RetailPlus&apos;),
(7, &apos;2025-09-10&apos;, &apos;To the sales team: We are a healthcare organization looking for a HIPAA-compliant backup solution. We evaluated CloudBackup but have concerns about the BAA terms in section 4.2. Can your legal team review our proposed amendments? We handle PHI for approximately 50000 patients. Timeline: need decision by November 1st. Contact: Dr. Anna Kowalski, Chief Medical Information Officer, Metro Health System&apos;),
(8, &apos;2025-09-11&apos;, &apos;I am writing to formally request cancellation of our HelpDesk360 subscription effective immediately. The product has not met our expectations. Response routing is inaccurate, the knowledge base search returns irrelevant results, and we have experienced 3 unplanned outages in the past month. Please process our refund for the remaining 8 months on our annual contract. Robert Taylor, Operations Director, ServiceFirst Ltd&apos;),
(9, &apos;2025-09-12&apos;, &apos;Quick question: does FormBuilder support WCAG 2.1 AA compliance for government forms? We are a state agency and this is a hard requirement for procurement. If yes, can you point me to the VPAT documentation? Thanks, Maria Garcia, Accessibility Coordinator, State of California Department of Technology&apos;),
(10, &apos;2025-09-14&apos;, &apos;Hi, our team has been using TeamBoard for 6 months and we love it. However we really need a way to export Gantt charts to PDF while preserving the formatting. The current export flattens all the dependency lines and makes the chart unreadable. Is this on your roadmap? Our PMO presents these charts to clients weekly. Tom Williams, PMO Lead, ConsultCo&apos;),
(11, &apos;2025-09-15&apos;, &apos;INCIDENT REPORT: At approximately 14:22 UTC our SecureSign production environment began experiencing signature verification failures. Approximately 340 pending documents across 12 customer accounts are affected. Root cause appears to be an expired intermediate SSL certificate in your signing chain. We need immediate remediation. Kevin Thompson, Security Engineer, LegalTech Partners&apos;),
(12, &apos;2025-09-17&apos;, &apos;Dear team, we operate DataStream to process 2TB of Kafka events daily. Starting last week we noticed exactly-once processing guarantees are failing intermittently. Approximately 0.3% of events are being duplicated in our downstream Postgres sink. This is causing financial reconciliation errors in our billing system. Medium priority but needs attention within 2 weeks. Jennifer Lee, Senior Data Engineer, FinServ Analytics&apos;),
(13, &apos;2025-09-18&apos;, &apos;I would like to schedule a product demo of AdOptimizer for our digital marketing agency. We manage ad spend for 45 clients across Google Ads Facebook and LinkedIn totaling approximately 2.5M monthly. Currently using a competitor but unhappy with the attribution modeling accuracy. When is your team available next week? Chris Martinez, Founder, DigitalEdge Agency&apos;),
(14, &apos;2025-09-20&apos;, &apos;Hi, we just completed our evaluation of ContractManager and would like to proceed with a purchase for 75 seats. We need the Salesforce integration enabled from day one. Our legal team processes roughly 200 contracts per month and we are currently tracking everything in spreadsheets. What is the implementation timeline? Rachel Adams, General Counsel, NovaTech Industries&apos;),
(15, &apos;2025-09-22&apos;, &apos;Attention: We detected unauthorized API access attempts against our LogInsight deployment between 2AM and 4AM EST today. The requests originated from IP addresses in a known threat intelligence database. While our firewall blocked the attempts, we want to understand if LogInsight has additional rate limiting or IP blocking capabilities we should enable. Mark Allen, CISO, DataShield Corp&apos;),
(16, &apos;2025-09-24&apos;, &apos;To billing department: Our organization PayFlow account 8847291 shows a currency conversion fee of 2.8% on GBP transactions. Our contract specifies a 1.5% rate for all EUR and GBP conversions. Please correct this billing discrepancy retroactively for September transactions totaling approximately 45000 GBP. Amanda Clark, Treasury Manager, EuroCommerce BV&apos;),
(17, &apos;2025-09-25&apos;, &apos;Hello, we have been running ChatAssist for our e-commerce customer support and the intent classification accuracy is excellent at around 94%. However we need to add support for Portuguese and Thai languages. Our customer base expanded to Brazil and Thailand this quarter. Is the multi-language add-on available for our current plan tier? Steven Moore, VP Customer Experience, GlobalShop&apos;),
(18, &apos;2025-09-27&apos;, &apos;I am the HR director at a 2000-employee manufacturing company. We need SchedulePro to handle complex shift patterns including rotating shifts split shifts and on-call schedules. Our current system cannot handle the overtime calculations required by state-specific labor laws in California New York and Texas. Can SchedulePro handle multi-state labor law compliance? Catherine Hall, HR Director, PrecisionMfg Inc&apos;),
(19, &apos;2025-09-28&apos;, &apos;Feature request: DesignHub needs better support for design tokens and component variables. When we update a color in our design system it should propagate to all linked components across all projects automatically. Currently we have to manually update 200+ components which defeats the purpose of a design system. Otherwise great product. Brian Harris, Design Systems Lead, PixelPerfect Studio&apos;),
(20, &apos;2025-09-30&apos;, &apos;Dear sales, I am reaching out on behalf of a consortium of 12 regional banks looking for a unified API management solution. We collectively process 4.2M API requests daily and need a solution that supports PSD2 compliance including strong customer authentication and secure communication. Can we arrange a meeting with your banking vertical team? Daniel Wilson, Technology Director, Regional Banking Alliance&apos;),
(21, &apos;2025-10-01&apos;, &apos;Hi, quick update on our InventoryTrack implementation. The barcode scanning module is working perfectly in our main warehouse but the multi-warehouse sync is showing a 15-minute delay between facilities. For perishable goods this delay causes stock discrepancies. Can we reduce the sync interval to real-time? Sophia Nguyen, Warehouse Operations Manager, FreshFoods Distribution&apos;),
(22, &apos;2025-10-03&apos;, &apos;To the product team at CloudSync: I have been a loyal customer for 3 years and want to share feedback. The recent UI redesign is excellent but the new settings menu is confusing. I cannot find the bandwidth throttling option which I use daily. Please make frequently used settings more accessible. Otherwise love the product and have recommended it to 5 colleagues. Laura Jackson, IT Consultant&apos;),
(23, &apos;2025-10-05&apos;, &apos;CRITICAL: Our DataVault encryption at rest failed an internal penetration test. The AES-256 implementation is using ECB mode instead of CBC or GCM for blocks larger than 16 bytes. This is a known vulnerability pattern. We need confirmation that this will be patched before our next compliance audit on November 15th. Michelle Lopez, Information Security Analyst, SecureBank NA&apos;),
(24, &apos;2025-10-06&apos;, &apos;Hello, I am a professor at MIT and we use QuickReport for our research data visualization. We are interested in an academic licensing program. Our department has 35 researchers and 120 graduate students who would benefit from the tool. Is there an education discount available? Dr. Jessica Young, Department of Data Science, MIT&apos;),
(25, &apos;2025-10-08&apos;, &apos;Support ticket follow-up: Our MailForge DKIM configuration issue ticket 4421 was marked resolved but we are still failing DMARC checks from Yahoo and AOL. The DKIM record appears correctly in DNS but the selector value does not match what MailForge sends in the email headers. Need this escalated back to engineering. Andrew White, Email Administrator, NewsMedia Group&apos;);
&lt;/code&gt;&lt;/pre&gt;
&lt;h3&gt;Contract Notes Table&lt;/h3&gt;
&lt;p&gt;This table simulates free-text contract summaries written by account managers. Each note contains key contract details buried in natural language that &lt;code&gt;AI_GENERATE&lt;/code&gt; will extract into structured columns.&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-sql&quot;&gt;CREATE TABLE aigenerateexp.document_data.contract_notes (
  note_id INT,
  account_manager VARCHAR,
  note_date DATE,
  note_text VARCHAR
);

INSERT INTO aigenerateexp.document_data.contract_notes VALUES
(1, &apos;Patricia Moore&apos;, &apos;2025-09-01&apos;, &apos;Closed deal with Acme Corp for CloudSync Pro enterprise license. 500 seats at $22/seat/month for 3-year term. Total contract value $396,000. Includes premium support and 99.99% SLA. Renewal auto-triggers 90 days before expiration. Key contact: Sarah Chen, VP Engineering.&apos;),
(2, &apos;Marcus Johnson&apos;, &apos;2025-09-04&apos;, &apos;MegaCorp Financial signed for QuickReport Premium. 200 users, 2-year commitment at $42/user/month. TCV $201,600. Custom integration with their Bloomberg terminal data feed required. Implementation starts Oct 1st. Executive sponsor: David Kim, CFO.&apos;),
(3, &apos;Patricia Moore&apos;, &apos;2025-09-08&apos;, &apos;Renewal discussion with Global Industries for DataVault Enterprise. Current contract of 350 seats expires Dec 31. They want to expand to 600 seats and add the healthcare compliance module. Proposed pricing: $75/seat/month for 600 seats, 3-year term. TCV $1,620,000. Pending legal review of updated BAA.&apos;),
(4, &apos;Sandra Lee&apos;, &apos;2025-09-12&apos;, &apos;New customer TechStart Inc closed for DataVault Enterprise. 150 seats, 3-year term at $82/seat/month. TCV $443,880. SOC 2 documentation provided. Implementation timeline: 6 weeks starting Oct 15. Procurement contact: Emily Watson.&apos;),
(5, &apos;Marcus Johnson&apos;, &apos;2025-09-15&apos;, &apos;DigitalEdge Agency signed AdOptimizer Enterprise with custom attribution modeling. 10 managed accounts, $1,499/month flat rate, 1-year term with option to renew. TCV $17,988. Agency plans to expand to 45 accounts in Q2 2026. Founder Chris Martinez is very enthusiastic about the attribution improvements.&apos;),
(6, &apos;Sandra Lee&apos;, &apos;2025-09-18&apos;, &apos;NovaTech Industries purchased ContractManager Professional. 75 seats at $62/seat/month, 2-year term. TCV $111,600. Salesforce integration required for day-one launch. Processing 200+ contracts monthly currently using spreadsheets. General Counsel Rachel Adams leading internal rollout.&apos;),
(7, &apos;Patricia Moore&apos;, &apos;2025-09-20&apos;, &apos;Lost deal: ServiceFirst Ltd cancelling HelpDesk360 subscription. 8 months remaining on annual contract at $54/seat for 100 seats. Refund request of $43,200 pending finance approval. Customer cited routing accuracy issues, knowledge base relevance problems, and 3 outages. Risk of negative public review.&apos;),
(8, &apos;  Marcus Johnson&apos;, &apos;2025-09-22&apos;, &apos;Expansion deal with CloudNine Software for DevPipeline. Adding 80 European developer seats to existing 80 US seats. European deployment at $72/seat/month, 2-year aligned with US contract end. Additional TCV $138,240. CTO Michael Brown driving the expansion after successful US rollout.&apos;),
(9, &apos;Sandra Lee&apos;, &apos;2025-09-25&apos;, &apos;State of California DPT evaluating FormBuilder for government forms. WCAG 2.1 AA compliance confirmed. Potential 500-seat deployment at government rate of $15/seat/month, 5-year term. TCV $450,000. Requires VPAT documentation submission to procurement. Long sales cycle expected, 6-9 months.&apos;),
(10, &apos;Patricia Moore&apos;, &apos;2025-09-28&apos;, &apos;EuroCommerce BV billing dispute on PayFlow. Customer contract guarantees 1.5% FX rate on EUR/GBP but system charged 2.8% for September. Estimated overcharge: approximately $900 on 45K GBP volume. Finance investigating root cause. Treasury Manager Amanda Clark expects retroactive correction.&apos;),
(11, &apos;Marcus Johnson&apos;, &apos;2025-10-01&apos;, &apos;Regional Banking Alliance consortium deal for APIGateway Pro. 12 banks, centralized deployment, 4.2M daily API calls. PSD2 compliance required. Proposed tiered pricing based on volume: $15,000/month for the consortium. 3-year term. TCV $540,000. Technology Director Daniel Wilson coordinating across all 12 institutions.&apos;),
(12, &apos;Sandra Lee&apos;, &apos;2025-10-03&apos;, &apos;FreshFoods Distribution requesting InventoryTrack real-time sync upgrade. Current standard sync has 15-min delay between 4 warehouses causing perishable goods discrepancies. Upgrade to real-time tier: additional $20/warehouse/month. Annual incremental revenue: $960. Operations Manager Sophia Nguyen is the champion.&apos;),
(13, &apos;Patricia Moore&apos;, &apos;2025-10-05&apos;, &apos;GlobalShop expansion for ChatAssist multi-language support. Adding Portuguese and Thai to existing English and Spanish deployment. Current contract: 300 seats at $68/seat/month. Multi-language add-on: additional $12/seat/month. Added TCV for remaining 18 months: $64,800. VP Customer Experience Steven Moore confirmed budget approval.&apos;),
(14, &apos;Marcus Johnson&apos;, &apos;2025-10-06&apos;, &apos;PrecisionMfg Inc evaluating SchedulePro for 2000 employees across 3 US states. Complex requirements: rotating shifts, split shifts, on-call, multi-state overtime compliance for CA, NY, TX. Enterprise tier at $12/employee/month, 2-year term. TCV $576,000. HR Director Catherine Hall leading evaluation. POC planned for November.&apos;),
(15, &apos;Sandra Lee&apos;, &apos;2025-10-08&apos;, &apos;MIT academic licensing request for QuickReport. 35 researchers plus 120 graduate students. Academic program pricing: 70% discount, $14.99/seat/month for 155 seats. 1-year renewable. TCV $27,881. Dr. Jessica Young in Department of Data Science. Low revenue but high brand visibility in academic publications.&apos;);
&lt;/code&gt;&lt;/pre&gt;
&lt;h2&gt;Step 3: Build Bronze Views&lt;/h2&gt;
&lt;p&gt;Bronze views cast dates to timestamps and standardize column names.&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-sql&quot;&gt;CREATE OR REPLACE VIEW aigenerateexp.bronze.v_emails AS
SELECT
  email_id,
  CAST(received_date AS TIMESTAMP) AS received_timestamp,
  email_body
FROM aigenerateexp.document_data.raw_emails;

CREATE OR REPLACE VIEW aigenerateexp.bronze.v_contracts AS
SELECT
  note_id,
  account_manager,
  CAST(note_date AS TIMESTAMP) AS note_timestamp,
  note_text
FROM aigenerateexp.document_data.contract_notes;
&lt;/code&gt;&lt;/pre&gt;
&lt;h2&gt;Step 4: Build Silver Views&lt;/h2&gt;
&lt;p&gt;This Silver view provides the unified email data that Gold views will process. At this stage, we simply promote the Bronze view for downstream extraction.&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-sql&quot;&gt;CREATE OR REPLACE VIEW aigenerateexp.silver.v_email_pipeline AS
SELECT
  email_id,
  received_timestamp,
  email_body
FROM aigenerateexp.bronze.v_emails;
&lt;/code&gt;&lt;/pre&gt;
&lt;h2&gt;Step 5: Build Gold Views with AI_GENERATE&lt;/h2&gt;
&lt;h3&gt;Gold View 1: Email Information Extraction&lt;/h3&gt;
&lt;p&gt;This is the core use case for &lt;code&gt;AI_GENERATE&lt;/code&gt;. Each email contains a sender, their company, the topic, the urgency level, and an action item, but all of this is embedded in free-text prose. The &lt;code&gt;WITH SCHEMA&lt;/code&gt; clause tells the LLM exactly what fields to extract and what types to return.&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-sql&quot;&gt;CREATE OR REPLACE VIEW aigenerateexp.gold.v_email_extracted AS
SELECT
  email_id,
  received_timestamp,
  email_body,
  extracted.sender_name,
  extracted.company,
  extracted.topic,
  extracted.urgency,
  extracted.action_required
FROM (
  SELECT
    email_id,
    received_timestamp,
    email_body,
    AI_GENERATE(
      &apos;Extract the following information from this email. If a field is not present, return N/A.&apos;,
      email_body
      WITH SCHEMA (
        sender_name VARCHAR,
        company VARCHAR,
        topic VARCHAR,
        urgency VARCHAR,
        action_required VARCHAR
      )
    ) AS extracted
  FROM aigenerateexp.silver.v_email_pipeline
);
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;The subquery calls &lt;code&gt;AI_GENERATE&lt;/code&gt; and aliases the result as &lt;code&gt;extracted&lt;/code&gt;. The outer query then expands the ROW using dot notation (&lt;code&gt;extracted.sender_name&lt;/code&gt;, &lt;code&gt;extracted.company&lt;/code&gt;, etc.). Each field becomes a regular column you can filter, group, or join on.&lt;/p&gt;
&lt;h3&gt;Gold View 2: Contract Detail Extraction&lt;/h3&gt;
&lt;p&gt;Contract notes contain structured deal information in natural language. &lt;code&gt;AI_GENERATE&lt;/code&gt; extracts the client name, product, seat count, contract value, term, and key contact into individual columns.&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-sql&quot;&gt;CREATE OR REPLACE VIEW aigenerateexp.gold.v_contract_details AS
SELECT
  note_id,
  account_manager,
  note_timestamp,
  note_text,
  details.client_name,
  details.product,
  details.seat_count,
  details.monthly_rate,
  details.contract_term_years,
  details.total_contract_value,
  details.key_contact,
  details.deal_status
FROM (
  SELECT
    note_id,
    account_manager,
    note_timestamp,
    note_text,
    AI_GENERATE(
      &apos;Extract deal information from this contract note. For total_contract_value use only the numeric amount. For deal_status classify as Won, Lost, Pending, or Expansion.&apos;,
      note_text
      WITH SCHEMA (
        client_name VARCHAR,
        product VARCHAR,
        seat_count INT,
        monthly_rate DECIMAL(10,2),
        contract_term_years INT,
        total_contract_value DECIMAL(12,2),
        key_contact VARCHAR,
        deal_status VARCHAR
      )
    ) AS details
  FROM aigenerateexp.bronze.v_contracts
);
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Notice the &lt;code&gt;WITH SCHEMA&lt;/code&gt; uses &lt;code&gt;INT&lt;/code&gt; for seat count, &lt;code&gt;DECIMAL&lt;/code&gt; for monetary values, and &lt;code&gt;VARCHAR&lt;/code&gt; for text fields. The LLM converts the free-text values to the types you specify. If a contract note says &amp;quot;75 seats,&amp;quot; the &lt;code&gt;seat_count&lt;/code&gt; column returns the integer &lt;code&gt;75&lt;/code&gt;.&lt;/p&gt;
&lt;h3&gt;How WITH SCHEMA Changes the Output&lt;/h3&gt;
&lt;p&gt;Without &lt;code&gt;WITH SCHEMA&lt;/code&gt;, &lt;code&gt;AI_GENERATE&lt;/code&gt; returns a &lt;code&gt;VARCHAR&lt;/code&gt; with the LLM&apos;s freeform response. This is harder to work with downstream:&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-sql&quot;&gt;-- Without WITH SCHEMA: returns plain text
SELECT AI_GENERATE(
  &apos;Extract the sender name and company from this email&apos;,
  email_body
) AS raw_text
FROM aigenerateexp.bronze.v_emails
LIMIT 3;
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;The raw text might look like: &amp;quot;Sender: Sarah Chen, Company: Acme Corp&amp;quot; but there&apos;s no guarantee of consistent formatting across rows. With &lt;code&gt;WITH SCHEMA&lt;/code&gt;, every row returns the same column structure, making the output predictable and queryable.&lt;/p&gt;
&lt;h2&gt;Persisting Results with CTAS&lt;/h2&gt;
&lt;p&gt;Materialize your extracted data into Iceberg tables to avoid repeated LLM calls:&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-sql&quot;&gt;CREATE TABLE aigenerateexp.gold.emails_extracted AS
SELECT * FROM aigenerateexp.gold.v_email_extracted;

CREATE TABLE aigenerateexp.gold.contracts_extracted AS
SELECT * FROM aigenerateexp.gold.v_contract_details;
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Once materialized, you can run standard SQL analytics on the extracted fields without incurring LLM token costs. Refresh the tables when new emails or contracts arrive.&lt;/p&gt;
&lt;h2&gt;Step 6: Enable AI-Generated Wikis and Tags&lt;/h2&gt;
&lt;ol&gt;
&lt;li&gt;Click &lt;strong&gt;Admin&lt;/strong&gt; in the left sidebar, then go to &lt;strong&gt;Project Settings&lt;/strong&gt;.&lt;/li&gt;
&lt;li&gt;Select the &lt;strong&gt;Preferences&lt;/strong&gt; tab.&lt;/li&gt;
&lt;li&gt;Scroll to the &lt;strong&gt;AI&lt;/strong&gt; section and enable &lt;strong&gt;Generate Wikis and Labels&lt;/strong&gt;.&lt;/li&gt;
&lt;li&gt;Go to the &lt;strong&gt;Catalog&lt;/strong&gt; and navigate to your Gold views under &lt;code&gt;aigenerateexp.gold&lt;/code&gt;.&lt;/li&gt;
&lt;li&gt;Click the &lt;strong&gt;Edit&lt;/strong&gt; button (pencil icon) next to each view.&lt;/li&gt;
&lt;li&gt;In the &lt;strong&gt;Details&lt;/strong&gt; tab, click &lt;strong&gt;Generate Wiki&lt;/strong&gt; and &lt;strong&gt;Generate Tags&lt;/strong&gt;.&lt;/li&gt;
&lt;li&gt;Repeat for all Gold views.&lt;/li&gt;
&lt;/ol&gt;
&lt;p&gt;Enhance the generated wikis with context like &amp;quot;sender_name and company are LLM-extracted from raw email text. Urgency is classified by the LLM based on language cues like &apos;urgent&apos;, &apos;critical&apos;, and &apos;immediate&apos;.&amp;quot;&lt;/p&gt;
&lt;h2&gt;Step 7: Ask Questions with the AI Agent&lt;/h2&gt;
&lt;p&gt;&lt;strong&gt;&amp;quot;Which companies sent urgent emails?&amp;quot;&lt;/strong&gt;&lt;/p&gt;
&lt;p&gt;The Agent queries &lt;code&gt;v_email_extracted&lt;/code&gt;, filters by &lt;code&gt;urgency&lt;/code&gt; containing &apos;urgent&apos; or &apos;critical&apos;, and returns the company names and topics.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;&amp;quot;Show me a chart of email topics by urgency level&amp;quot;&lt;/strong&gt;&lt;/p&gt;
&lt;p&gt;The Agent groups by &lt;code&gt;topic&lt;/code&gt; and &lt;code&gt;urgency&lt;/code&gt; in &lt;code&gt;v_email_extracted&lt;/code&gt; and creates a visualization showing which topics generate the most urgent communications.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;&amp;quot;List all won deals over $100,000 with their key contacts&amp;quot;&lt;/strong&gt;&lt;/p&gt;
&lt;p&gt;The Agent filters &lt;code&gt;v_contract_details&lt;/code&gt; for &lt;code&gt;deal_status = &apos;Won&apos;&lt;/code&gt; and &lt;code&gt;total_contract_value &amp;gt; 100000&lt;/code&gt;, returning client names, products, values, and key contacts.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;&amp;quot;Create a chart showing total contract value by account manager and deal status&amp;quot;&lt;/strong&gt;&lt;/p&gt;
&lt;p&gt;The Agent creates a stacked bar chart from &lt;code&gt;v_contract_details&lt;/code&gt; comparing each account manager&apos;s total pipeline across Won, Lost, Pending, and Expansion statuses.&lt;/p&gt;
&lt;h2&gt;Processing Unstructured Files with AI_GENERATE and LIST_FILES&lt;/h2&gt;
&lt;p&gt;The examples above process text that&apos;s already stored in table columns. But many organizations have unstructured files, such as PDFs, text documents, images, and scanned invoices, sitting in object storage (S3, Azure Blob, GCS) that have never been queryable through SQL.&lt;/p&gt;
&lt;p&gt;Dremio&apos;s &lt;code&gt;LIST_FILES&lt;/code&gt; table function bridges this gap. It recursively lists files from a connected source and returns metadata about each file. Combined with &lt;code&gt;AI_GENERATE&lt;/code&gt;, you can read file content and extract structured data from documents that were previously invisible to your analytics platform.&lt;/p&gt;
&lt;h3&gt;How LIST_FILES Works&lt;/h3&gt;
&lt;p&gt;&lt;code&gt;LIST_FILES&lt;/code&gt; is a table function that returns metadata for files in a connected storage source:&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-sql&quot;&gt;SELECT *
FROM TABLE(
  LIST_FILES(
    path =&amp;gt; &apos;your_s3_source.folder_name&apos;,
    recursive =&amp;gt; true
  )
);
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;The function returns columns including the file source, path, size, and last modification time. This metadata feeds into &lt;code&gt;AI_GENERATE&lt;/code&gt; as file references.&lt;/p&gt;
&lt;h3&gt;Hypothetical Example: Invoice Processing&lt;/h3&gt;
&lt;p&gt;Suppose you have an S3 bucket connected to Dremio as a source called &lt;code&gt;company_s3&lt;/code&gt;, with a folder &lt;code&gt;/invoices/2025/&lt;/code&gt; containing PDF invoices from vendors. Here&apos;s how you&apos;d extract structured data:&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-sql&quot;&gt;-- Step 1: List all invoice files
SELECT *
FROM TABLE(
  LIST_FILES(
    path =&amp;gt; &apos;company_s3.invoices.2025&apos;,
    recursive =&amp;gt; true
  )
);

-- Step 2: Extract structured data from each invoice
SELECT
  invoice_data.vendor_name,
  invoice_data.invoice_number,
  invoice_data.invoice_date,
  invoice_data.total_amount,
  invoice_data.currency,
  invoice_data.line_items
FROM (
  SELECT AI_GENERATE(
    &apos;Extract the vendor name, invoice number, date, total amount, currency, and a summary of line items from this invoice.&apos;,
    file_content
    WITH SCHEMA (
      vendor_name VARCHAR,
      invoice_number VARCHAR,
      invoice_date VARCHAR,
      total_amount DECIMAL(12,2),
      currency VARCHAR,
      line_items VARCHAR
    )
  ) AS invoice_data
  FROM TABLE(
    LIST_FILES(
      path =&amp;gt; &apos;company_s3.invoices.2025&apos;,
      recursive =&amp;gt; true
    )
  )
);
&lt;/code&gt;&lt;/pre&gt;
&lt;h3&gt;Hypothetical Example: Resume Screening&lt;/h3&gt;
&lt;p&gt;An HR team stores candidate resumes as PDFs in an S3 bucket. &lt;code&gt;AI_GENERATE&lt;/code&gt; extracts candidate information for structured analysis:&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-sql&quot;&gt;SELECT
  candidate.full_name,
  candidate.email,
  candidate.years_experience,
  candidate.primary_skill,
  candidate.education_level,
  candidate.current_company
FROM (
  SELECT AI_GENERATE(
    &apos;Extract candidate information from this resume. For years_experience provide a numeric estimate.&apos;,
    file_content
    WITH SCHEMA (
      full_name VARCHAR,
      email VARCHAR,
      years_experience INT,
      primary_skill VARCHAR,
      education_level VARCHAR,
      current_company VARCHAR
    )
  ) AS candidate
  FROM TABLE(
    LIST_FILES(
      path =&amp;gt; &apos;hr_s3.resumes.2025_q4&apos;,
      recursive =&amp;gt; true
    )
  )
);
&lt;/code&gt;&lt;/pre&gt;
&lt;h3&gt;Hypothetical Example: Quarterly Report Analysis&lt;/h3&gt;
&lt;p&gt;Finance stores quarterly PDF reports from subsidiaries. Extract key financial metrics without manual reading:&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-sql&quot;&gt;SELECT
  metrics.subsidiary_name,
  metrics.quarter,
  metrics.total_revenue,
  metrics.net_income,
  metrics.headcount,
  metrics.key_risks
FROM (
  SELECT AI_GENERATE(
    &apos;Extract financial summary data from this quarterly report.&apos;,
    file_content
    WITH SCHEMA (
      subsidiary_name VARCHAR,
      quarter VARCHAR,
      total_revenue DECIMAL(15,2),
      net_income DECIMAL(15,2),
      headcount INT,
      key_risks VARCHAR
    )
  ) AS metrics
  FROM TABLE(
    LIST_FILES(
      path =&amp;gt; &apos;finance_s3.quarterly_reports.2025&apos;,
      recursive =&amp;gt; true
    )
  )
);
&lt;/code&gt;&lt;/pre&gt;
&lt;h3&gt;Materializing File Extraction Results&lt;/h3&gt;
&lt;p&gt;Once you&apos;ve extracted structured data from files, persist it as an Iceberg table:&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-sql&quot;&gt;CREATE TABLE aigenerateexp.gold.invoices_extracted AS
SELECT
  invoice_data.vendor_name,
  invoice_data.invoice_number,
  invoice_data.total_amount,
  invoice_data.currency
FROM (
  SELECT AI_GENERATE(
    &apos;Extract invoice details&apos;,
    file_content
    WITH SCHEMA (vendor_name VARCHAR, invoice_number VARCHAR, total_amount DECIMAL(12,2), currency VARCHAR)
  ) AS invoice_data
  FROM TABLE(LIST_FILES(path =&amp;gt; &apos;company_s3.invoices.2025&apos;, recursive =&amp;gt; true))
);
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;This creates a governed, queryable Iceberg table from raw PDF invoices. The table supports time travel, schema evolution, and ACID transactions. Build Reflections on it for dashboard acceleration.&lt;/p&gt;
&lt;h3&gt;Key Considerations for LIST_FILES + AI_GENERATE&lt;/h3&gt;
&lt;p&gt;&lt;strong&gt;Source connectivity:&lt;/strong&gt; &lt;code&gt;LIST_FILES&lt;/code&gt; requires a connected storage source (S3, Azure Storage, GCS) in your Dremio project. The source must be configured with appropriate read permissions.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;File format support:&lt;/strong&gt; Dremio&apos;s AI functions can process text-based content including PDFs, text files, and document formats. The LLM interprets the file content and extracts fields per your schema definition.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Token costs:&lt;/strong&gt; Processing files through the LLM consumes tokens proportional to file size. Filter your &lt;code&gt;LIST_FILES&lt;/code&gt; results before passing them to &lt;code&gt;AI_GENERATE&lt;/code&gt; to avoid processing unnecessary files:&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-sql&quot;&gt;-- Filter to only recent files before AI processing
SELECT AI_GENERATE(...)
FROM TABLE(LIST_FILES(path =&amp;gt; &apos;company_s3.invoices.2025&apos;, recursive =&amp;gt; true))
WHERE modification_time &amp;gt; TIMESTAMP &apos;2025-09-01 00:00:00&apos;;
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;&lt;strong&gt;Engine routing:&lt;/strong&gt; Use &lt;code&gt;query_calls_ai_functions()&lt;/code&gt; to route file processing queries to a dedicated engine, isolating heavy batch extraction from your regular analytical workloads.&lt;/p&gt;
&lt;h2&gt;Why Apache Iceberg Matters&lt;/h2&gt;
&lt;p&gt;Extracted data stored as Iceberg tables benefits from automated performance management. As your extraction pipeline grows from hundreds to thousands of documents, Iceberg&apos;s compaction, manifest optimization, and clustering keep query performance consistent without manual tuning.&lt;/p&gt;
&lt;h3&gt;Iceberg vs. Federated for AI_GENERATE Workloads&lt;/h3&gt;
&lt;p&gt;&lt;strong&gt;Use CTAS materialization when:&lt;/strong&gt; You&apos;re extracting from historical documents (past invoices, old contracts, archived emails). Run the extraction once, query the results forever.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Use live views when:&lt;/strong&gt; You need real-time extraction from a continuously updating text column in a federated database. Pair with manual Reflections to cache results at a controlled refresh interval, balancing extraction cost against data freshness.&lt;/p&gt;
&lt;h2&gt;Next Steps&lt;/h2&gt;
&lt;ol&gt;
&lt;li&gt;&lt;strong&gt;Connect your real data sources&lt;/strong&gt; — replace simulated tables with federated connections to your email system, CRM, and document storage&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Connect an S3 or Azure source&lt;/strong&gt; — enable &lt;code&gt;LIST_FILES&lt;/code&gt; processing on your actual unstructured file repositories&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Add FGAC&lt;/strong&gt; — mask extracted PII fields (emails, phone numbers, names) for downstream consumers who shouldn&apos;t see personal data&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Build Reflections&lt;/strong&gt; — create Reflections on CTAS-materialized extraction tables for fast dashboard queries at zero LLM cost&lt;/li&gt;
&lt;/ol&gt;
&lt;p&gt;If your organization has unstructured text trapped in database columns or files sitting unanalyzed in object storage, &lt;code&gt;AI_GENERATE&lt;/code&gt; turns that text into structured, queryable, governed data. Define a schema, write a prompt, and run a query. The extraction happens inside your lakehouse with the same access controls and governance that apply to all your other data.&lt;/p&gt;
&lt;p&gt;&lt;a href=&quot;https://www.dremio.com/get-started?utm_source=ev_buffer&amp;amp;utm_medium=influencer&amp;amp;utm_campaign=pag&amp;amp;utm_term=ai-generate-dremio-cloud&amp;amp;utm_content=alexmerced&quot;&gt;Try Dremio Cloud free for 30 days&lt;/a&gt; and start extracting structured data from your unstructured text.&lt;/p&gt;
</content:encoded><author>Alex Merced</author></item><item><title>Connect Oracle Database to Dremio Cloud: Enterprise Analytics Without Data Movement</title><link>https://iceberglakehouse.com/posts/2026-03-connector-oracle/</link><guid isPermaLink="true">https://iceberglakehouse.com/posts/2026-03-connector-oracle/</guid><description>
Oracle Database runs the most critical enterprise applications in the world — ERP systems, financial ledgers, supply chain management, and HR platfor...</description><pubDate>Sun, 01 Mar 2026 11:00:00 GMT</pubDate><content:encoded>&lt;p&gt;Oracle Database runs the most critical enterprise applications in the world — ERP systems, financial ledgers, supply chain management, and HR platforms. These systems generate massive volumes of data that business teams want to analyze, but running analytical queries directly against Oracle is expensive (license costs scale with CPU usage), complex (Oracle-specific SQL dialects and tooling), and risky (heavy queries can impact transactional performance).&lt;/p&gt;
&lt;p&gt;Dremio Cloud connects to Oracle Database and queries it in place using standard SQL. You don&apos;t need to license additional Oracle tools, build ETL pipelines, or export data to a separate warehouse. Dremio pushes filters and aggregations to Oracle, fetches only the results, and lets you join Oracle data with every other source in your organization in a single query.&lt;/p&gt;
&lt;p&gt;This guide walks through the complete setup, including Oracle-specific features like native encryption, user impersonation, service name configuration, and the extensive predicate pushdown support.&lt;/p&gt;
&lt;h2&gt;Why Oracle Users Need Dremio&lt;/h2&gt;
&lt;p&gt;&lt;strong&gt;Oracle licensing costs make analytics expensive.&lt;/strong&gt; Oracle licenses are typically tied to CPU cores. Running analytical workloads on your production Oracle instance consumes CPU, which means higher licensing costs. Dremio&apos;s Reflections create pre-computed copies of frequently queried Oracle data. After the initial query, subsequent analytics hit the Reflection — not Oracle — reducing CPU consumption and license exposure.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Cross-system analytics require ETL.&lt;/strong&gt; Your financial data is in Oracle, your CRM data is in PostgreSQL, and your marketing data is in S3. Without a federation layer, joining these requires building ETL pipelines that extract data from each source, transform it, and load it into a central warehouse. That&apos;s months of engineering work. Dremio federates across all three sources with a single SQL query.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Oracle&apos;s analytical tooling is Oracle-specific.&lt;/strong&gt; Oracle Analytics Cloud, Oracle BI, and Oracle Data Integrator work well within the Oracle ecosystem but don&apos;t extend to non-Oracle data. Dremio provides a vendor-neutral SQL layer that works with any BI tool (Tableau, Power BI, Looker) via Arrow Flight or ODBC, covering Oracle and every other connected source.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;No semantic layer for AI.&lt;/strong&gt; Oracle tables use technical names and lack the business context that AI agents need to generate accurate SQL. Dremio&apos;s semantic layer lets you create views with business logic, attach wiki descriptions, and enable the AI Agent to answer questions like &amp;quot;What&apos;s our quarterly revenue by product line?&amp;quot; by understanding what &amp;quot;quarterly revenue&amp;quot; means from your metadata.&lt;/p&gt;
&lt;h2&gt;Prerequisites&lt;/h2&gt;
&lt;p&gt;Before connecting Oracle to Dremio Cloud, confirm you have:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Oracle hostname or IP address&lt;/strong&gt;&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Port number&lt;/strong&gt; — Oracle defaults to &lt;code&gt;1521&lt;/code&gt;&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Service name&lt;/strong&gt; — the Oracle service name (not the SID) for your database&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Username and password&lt;/strong&gt; — an Oracle user with &lt;code&gt;SELECT&lt;/code&gt; privileges on the relevant schemas&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Network access&lt;/strong&gt; — port 1521 must be reachable from Dremio Cloud&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Dremio Cloud account&lt;/strong&gt; — &lt;a href=&quot;https://www.dremio.com/get-started?utm_source=ev_buffer&amp;amp;utm_medium=influencer&amp;amp;utm_campaign=pag&amp;amp;utm_term=connector-oracle-dremio-cloud&amp;amp;utm_content=alexmerced&quot;&gt;sign up free for 30 days&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;h2&gt;Step-by-Step: Connect Oracle to Dremio Cloud&lt;/h2&gt;
&lt;h3&gt;1. Add the Oracle Source&lt;/h3&gt;
&lt;p&gt;In the Dremio console, click the &lt;strong&gt;&amp;quot;+&amp;quot;&lt;/strong&gt; in the left sidebar and select &lt;strong&gt;Oracle&lt;/strong&gt; from the database source types.&lt;/p&gt;
&lt;h3&gt;2. Configure General Settings&lt;/h3&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Name:&lt;/strong&gt; A descriptive identifier (e.g., &lt;code&gt;erp-oracle&lt;/code&gt; or &lt;code&gt;finance-oracle&lt;/code&gt;).&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Host:&lt;/strong&gt; The Oracle server hostname.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Port:&lt;/strong&gt; Default &lt;code&gt;1521&lt;/code&gt;.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Service Name:&lt;/strong&gt; The Oracle service name for your database.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Enable TLS encryption:&lt;/strong&gt; Toggle this on for encrypted connections over TLS.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Oracle Native Encryption:&lt;/strong&gt; If you don&apos;t use TLS, Oracle supports its own encryption protocol. Options are:
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Accepted (default):&lt;/strong&gt; Allows both encrypted and unencrypted connections.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Requested:&lt;/strong&gt; Prefers encryption but accepts unencrypted if not available.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Required:&lt;/strong&gt; Only encrypted connections allowed.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Rejected:&lt;/strong&gt; No encryption.&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;You can use either TLS or Oracle Native Encryption, but not both on the same source.&lt;/p&gt;
&lt;h3&gt;3. Set Authentication&lt;/h3&gt;
&lt;p&gt;Three options:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Master Authentication:&lt;/strong&gt; Username and password entered directly.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Secret Resource URL:&lt;/strong&gt; Password stored in AWS Secrets Manager, referenced by ARN.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Kerberos:&lt;/strong&gt; For environments where Oracle is configured with Kerberos authentication.&lt;/li&gt;
&lt;/ul&gt;
&lt;h3&gt;4. Configure Advanced Options&lt;/h3&gt;
&lt;p&gt;Oracle has several unique advanced settings:&lt;/p&gt;
&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Setting&lt;/th&gt;
&lt;th&gt;What It Does&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Use timezone as connection region&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Uses the timezone to set the connection region&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Include synonyms&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Makes Oracle synonyms visible as datasets in Dremio&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Map Oracle DATE to TIMESTAMP&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Oracle&apos;s &lt;code&gt;DATE&lt;/code&gt; type includes time components. Enable this to expose them as &lt;code&gt;TIMESTAMP&lt;/code&gt; in Dremio instead of truncating to &lt;code&gt;DATE&lt;/code&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Record fetch size&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Rows per batch (default 200, set 0 for automatic)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Use LDAP Naming Services&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Authenticate via LDAP rather than Oracle&apos;s local user database&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;User Impersonation&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Run queries under each Dremio user&apos;s own Oracle credentials (see below)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Connection Properties&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Custom JDBC parameters&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;
&lt;h3&gt;5. User Impersonation (Optional but Valuable)&lt;/h3&gt;
&lt;p&gt;Oracle supports user impersonation through proxy authentication. This means each Dremio user runs queries under their own Oracle username, with their own Oracle permissions, rather than sharing a single service account.&lt;/p&gt;
&lt;p&gt;To set this up:&lt;/p&gt;
&lt;ol&gt;
&lt;li&gt;Ensure each Dremio user has a matching username in Oracle.&lt;/li&gt;
&lt;li&gt;In Oracle, grant proxy authentication: &lt;code&gt;ALTER USER analyst_user GRANT CONNECT THROUGH dremio_service_user;&lt;/code&gt;&lt;/li&gt;
&lt;li&gt;In Dremio&apos;s source settings, enable &lt;strong&gt;User Impersonation&lt;/strong&gt; under Advanced Options.&lt;/li&gt;
&lt;/ol&gt;
&lt;p&gt;This is particularly valuable in regulated industries where audit trails need to track which individual accessed which data.&lt;/p&gt;
&lt;h3&gt;6. Save the Connection&lt;/h3&gt;
&lt;p&gt;Configure Reflection Refresh, Metadata Refresh, and Privileges as needed, then click &lt;strong&gt;Save&lt;/strong&gt;.&lt;/p&gt;
&lt;h2&gt;Query Oracle Data from Dremio&lt;/h2&gt;
&lt;p&gt;Browse your Oracle schemas and tables, then run standard SQL:&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-sql&quot;&gt;SELECT department_id, department_name, manager_id, location_id
FROM &amp;quot;erp-oracle&amp;quot;.HR.DEPARTMENTS
WHERE location_id = 1700;
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Dremio pushes the &lt;code&gt;WHERE&lt;/code&gt; clause to Oracle and transfers only the matching rows.&lt;/p&gt;
&lt;h2&gt;Federate Oracle with Other Sources&lt;/h2&gt;
&lt;p&gt;Combine Oracle ERP data with S3 data and PostgreSQL data in one query:&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-sql&quot;&gt;SELECT
  d.department_name,
  COUNT(e.employee_id) AS headcount,
  AVG(e.salary) AS avg_salary,
  SUM(b.budget_amount) AS total_budget
FROM &amp;quot;erp-oracle&amp;quot;.HR.DEPARTMENTS d
JOIN &amp;quot;erp-oracle&amp;quot;.HR.EMPLOYEES e ON d.department_id = e.department_id
LEFT JOIN &amp;quot;finance-postgres&amp;quot;.budgets.dept_budgets b ON d.department_id = b.dept_id
GROUP BY d.department_name
ORDER BY total_budget DESC;
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Oracle handles the department-employee join (predicate pushdown), and Dremio handles the cross-source join with PostgreSQL budget data.&lt;/p&gt;
&lt;h2&gt;Predicate Pushdown Support&lt;/h2&gt;
&lt;p&gt;Oracle has one of the most comprehensive pushdown profiles in Dremio. The engine offloads:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;All standard comparisons and logical operators&lt;/strong&gt;&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Aggregations:&lt;/strong&gt; SUM, AVG, COUNT, MIN, MAX, STDDEV, MEDIAN, VAR_POP, VAR_SAMP, COVAR_POP, COVAR_SAMP, PERCENTILE_CONT, PERCENTILE_DISC&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Math functions:&lt;/strong&gt; ABS, CEIL, FLOOR, ROUND, MOD, SQRT, POWER, LOG, EXP, SIGN, trigonometric functions (SIN, COS, TAN, ASIN, ACOS, ATAN, SINH, COSH, TANH)&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;String functions:&lt;/strong&gt; CONCAT, SUBSTR, LENGTH, LOWER, UPPER, TRIM, REPLACE, REVERSE, LPAD, RPAD&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Date functions:&lt;/strong&gt; DATE_ADD, DATE_SUB, DATE_TRUNC, EXTRACT, ADD_MONTHS, LAST_DAY, TO_CHAR, TO_DATE, TRUNC&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;This extensive pushdown support means Oracle does most of the heavy lifting for filtering and aggregation, and Dremio only transfers the summarized results across the network.&lt;/p&gt;
&lt;h2&gt;Data Type Mapping&lt;/h2&gt;
&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Oracle&lt;/th&gt;
&lt;th&gt;Dremio&lt;/th&gt;
&lt;th&gt;Notes&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;NUMBER&lt;/td&gt;
&lt;td&gt;DECIMAL&lt;/td&gt;
&lt;td&gt;Preserves precision&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;VARCHAR2 / NVARCHAR2 / CHAR / NCHAR&lt;/td&gt;
&lt;td&gt;VARCHAR&lt;/td&gt;
&lt;td&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;DATE&lt;/td&gt;
&lt;td&gt;DATE or TIMESTAMP&lt;/td&gt;
&lt;td&gt;Use advanced option to map to TIMESTAMP&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;TIMESTAMP&lt;/td&gt;
&lt;td&gt;TIMESTAMP&lt;/td&gt;
&lt;td&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;BINARY_FLOAT&lt;/td&gt;
&lt;td&gt;FLOAT&lt;/td&gt;
&lt;td&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;BINARY_DOUBLE&lt;/td&gt;
&lt;td&gt;DOUBLE&lt;/td&gt;
&lt;td&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;FLOAT&lt;/td&gt;
&lt;td&gt;DOUBLE&lt;/td&gt;
&lt;td&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;BLOB / RAW / LONG RAW&lt;/td&gt;
&lt;td&gt;VARBINARY&lt;/td&gt;
&lt;td&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;LONG&lt;/td&gt;
&lt;td&gt;VARCHAR&lt;/td&gt;
&lt;td&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;INTERVALDS&lt;/td&gt;
&lt;td&gt;INTERVAL (day to seconds)&lt;/td&gt;
&lt;td&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;INTERVALYM&lt;/td&gt;
&lt;td&gt;INTERVAL (years to months)&lt;/td&gt;
&lt;td&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;
&lt;p&gt;&lt;code&gt;CLOB&lt;/code&gt;, &lt;code&gt;XMLTYPE&lt;/code&gt;, and Oracle spatial types are not supported through the connector.&lt;/p&gt;
&lt;h2&gt;Build a Semantic Layer Over Oracle&lt;/h2&gt;
&lt;p&gt;Create views that translate Oracle&apos;s technical schema into business-friendly analytics:&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-sql&quot;&gt;CREATE VIEW analytics.gold.department_performance AS
SELECT
  d.department_name,
  COUNT(e.employee_id) AS employee_count,
  ROUND(AVG(e.salary), 2) AS avg_salary,
  MAX(e.hire_date) AS most_recent_hire,
  CASE
    WHEN COUNT(e.employee_id) &amp;gt; 50 THEN &apos;Large&apos;
    WHEN COUNT(e.employee_id) &amp;gt; 20 THEN &apos;Medium&apos;
    ELSE &apos;Small&apos;
  END AS department_size
FROM &amp;quot;erp-oracle&amp;quot;.HR.DEPARTMENTS d
LEFT JOIN &amp;quot;erp-oracle&amp;quot;.HR.EMPLOYEES e ON d.department_id = e.department_id
GROUP BY d.department_name;
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Attach wiki context via the Catalog (edit pencil icon → Details → Generate Wiki/Tags) so the AI Agent can answer questions like &amp;quot;Which large departments have the highest average salary?&amp;quot;&lt;/p&gt;
&lt;h2&gt;When to Keep Data in Oracle vs. Migrate to Iceberg&lt;/h2&gt;
&lt;p&gt;&lt;strong&gt;Keep in Oracle:&lt;/strong&gt; Actively transactional data (current orders, inventory, ledger entries), data that applications read and write frequently, data subject to Oracle-specific constraints and triggers.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Migrate to Iceberg:&lt;/strong&gt; Historical archives (closed fiscal quarters, past-year orders), aggregated reporting tables, datasets queried heavily for analytics but rarely written.&lt;/p&gt;
&lt;p&gt;For data that stays in Oracle, create manual Reflections with a refresh schedule that balances data freshness against Oracle CPU usage. For migrated data, Dremio&apos;s Open Catalog provides automated compaction, time travel, and Autonomous Reflections.&lt;/p&gt;
&lt;h2&gt;AI-Powered Analytics on Oracle Data&lt;/h2&gt;
&lt;h3&gt;Dremio AI Agent&lt;/h3&gt;
&lt;p&gt;The AI Agent lets business users ask questions about Oracle data in plain English. An HR director asks &amp;quot;Which large departments have the highest average salary?&amp;quot; and the Agent generates accurate SQL by reading the wiki descriptions on your &lt;code&gt;department_performance&lt;/code&gt; view. The Agent understands what &amp;quot;large&amp;quot; means (employee_count &amp;gt; 50) because you&apos;ve defined it in the semantic layer.&lt;/p&gt;
&lt;p&gt;This is particularly valuable for Oracle environments where decades of institutional knowledge about schema structures, table naming conventions (like &lt;code&gt;HR.DEPARTMENTS&lt;/code&gt;), and column semantics lives in senior DBAs&apos; heads.&lt;/p&gt;
&lt;h3&gt;Dremio MCP Server&lt;/h3&gt;
&lt;p&gt;The &lt;a href=&quot;https://github.com/dremio/dremio-mcp&quot;&gt;Dremio MCP Server&lt;/a&gt; connects Claude, ChatGPT, and other AI clients to your Oracle data through Dremio:&lt;/p&gt;
&lt;ol&gt;
&lt;li&gt;Create a Native OAuth application in Dremio Cloud&lt;/li&gt;
&lt;li&gt;Configure redirect URLs (e.g., &lt;code&gt;https://claude.ai/api/mcp/auth_callback&lt;/code&gt; for Claude, &lt;code&gt;https://chatgpt.com/connector_platform_oauth_redirect&lt;/code&gt; for ChatGPT)&lt;/li&gt;
&lt;li&gt;Connect using &lt;code&gt;mcp.dremio.cloud/mcp/{project_id}&lt;/code&gt; (US) or &lt;code&gt;mcp.eu.dremio.cloud/mcp/{project_id}&lt;/code&gt; (EU)&lt;/li&gt;
&lt;/ol&gt;
&lt;p&gt;A CFO asks Claude &amp;quot;Compare department headcount and budget utilization across our Oracle ERP&amp;quot; and gets governed, accurate results from Oracle data — without knowing SQL or Oracle table structures.&lt;/p&gt;
&lt;h3&gt;AI SQL Functions&lt;/h3&gt;
&lt;p&gt;Use AI directly in queries against Oracle data:&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-sql&quot;&gt;-- Classify departments by operational health
SELECT
  department_name,
  employee_count,
  avg_salary,
  department_size,
  AI_CLASSIFY(
    &apos;Based on these HR metrics, classify the department health&apos;,
    &apos;Department: &apos; || department_name || &apos;, Employees: &apos; || CAST(employee_count AS VARCHAR) || &apos;, Avg Salary: $&apos; || CAST(avg_salary AS VARCHAR) || &apos;, Size: &apos; || department_size,
    ARRAY[&apos;Thriving&apos;, &apos;Stable&apos;, &apos;Understaffed&apos;, &apos;Needs Attention&apos;]
  ) AS department_health
FROM analytics.gold.department_performance;

-- Generate executive briefings from Oracle data
SELECT
  department_name,
  AI_GENERATE(
    &apos;Write a one-sentence executive summary for this department&apos;,
    &apos;Department: &apos; || department_name || &apos;, Headcount: &apos; || CAST(employee_count AS VARCHAR) || &apos;, Avg Salary: $&apos; || CAST(avg_salary AS VARCHAR) || &apos;, Most Recent Hire: &apos; || CAST(most_recent_hire AS VARCHAR)
  ) AS executive_summary
FROM analytics.gold.department_performance
WHERE department_size = &apos;Large&apos;;
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;&lt;code&gt;AI_CLASSIFY&lt;/code&gt; runs LLM inference to categorize departments. &lt;code&gt;AI_GENERATE&lt;/code&gt; creates narrative summaries. Both run inline in your SQL queries, enriching Oracle data with AI.&lt;/p&gt;
&lt;h2&gt;Reflections for Performance&lt;/h2&gt;
&lt;p&gt;Oracle Database licensing is expensive — especially Enterprise Edition with Analytics and Diagnostics Packs. Reflections offload analytical queries:&lt;/p&gt;
&lt;ol&gt;
&lt;li&gt;Navigate to the view in the &lt;strong&gt;Catalog&lt;/strong&gt;&lt;/li&gt;
&lt;li&gt;Click the &lt;strong&gt;Reflections&lt;/strong&gt; tab&lt;/li&gt;
&lt;li&gt;Choose &lt;strong&gt;Raw Reflection&lt;/strong&gt; or &lt;strong&gt;Aggregation Reflection&lt;/strong&gt;&lt;/li&gt;
&lt;li&gt;Select columns and set the &lt;strong&gt;Refresh Interval&lt;/strong&gt; — for HR data, daily; for financial data, match to reporting cycles&lt;/li&gt;
&lt;li&gt;Click &lt;strong&gt;Save&lt;/strong&gt;&lt;/li&gt;
&lt;/ol&gt;
&lt;p&gt;BI tools get sub-second responses from Reflections. Oracle focuses on transactional workloads. A department performance dashboard with hourly refreshes generates zero Oracle CPU consumption after the Reflection is built.&lt;/p&gt;
&lt;h2&gt;Governance on Oracle Data&lt;/h2&gt;
&lt;p&gt;Oracle has its own security model (Oracle Database Vault, VPD), but it doesn&apos;t extend to non-Oracle sources. Dremio&apos;s Fine-Grained Access Control (FGAC) provides unified governance:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Column masking:&lt;/strong&gt; Mask salary data, employee SSNs, and performance ratings from specific roles. An HR generalist sees headcount but not compensation details.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Row-level filtering:&lt;/strong&gt; Department-level access — a department manager sees only their department. Regional HR sees only their region.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Unified policies:&lt;/strong&gt; Same governance applies across Oracle, PostgreSQL, S3, and all other sources.&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;These policies apply across SQL Runner, BI tools (Arrow Flight/ODBC), AI Agent, and MCP Server.&lt;/p&gt;
&lt;h2&gt;Connect BI Tools via Arrow Flight&lt;/h2&gt;
&lt;p&gt;Arrow Flight provides 10-100x faster data transfer than JDBC/ODBC:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Tableau:&lt;/strong&gt; Dremio connector for direct Arrow Flight access&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Power BI:&lt;/strong&gt; Dremio ODBC driver or native connector — no Oracle client needed&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Python/Pandas:&lt;/strong&gt; &lt;code&gt;pyarrow.flight&lt;/code&gt; for programmatic data access&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Looker:&lt;/strong&gt; Connect via JDBC&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;dbt:&lt;/strong&gt; &lt;code&gt;dbt-dremio&lt;/code&gt; for SQL-based transformations&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;All queries benefit from Reflections, governance, and the semantic layer.&lt;/p&gt;
&lt;h2&gt;VS Code Copilot Integration&lt;/h2&gt;
&lt;p&gt;Dremio&apos;s VS Code extension with Copilot integration lets developers query Oracle data from their IDE. Ask Copilot &amp;quot;Show me understaffed departments from Oracle HR&amp;quot; and get SQL generated from your semantic layer — without knowing Oracle schema conventions.&lt;/p&gt;
&lt;h2&gt;When to Keep Data in Oracle vs. Migrate&lt;/h2&gt;
&lt;p&gt;&lt;strong&gt;Keep in Oracle:&lt;/strong&gt; Transactional data for active applications, data with PL/SQL dependencies (stored procedures, triggers, packages), data subject to Oracle RAC clustering, data managed by Oracle GoldenGate replication.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Migrate to Iceberg:&lt;/strong&gt; Historical HR data and archives, closed fiscal year financials, data consumed by non-Oracle tools, datasets where Oracle per-core licensing exceeds analytical value. Migrated Iceberg tables get automatic compaction, time travel, and Autonomous Reflections.&lt;/p&gt;
&lt;p&gt;For data staying in Oracle, create manual Reflections to reduce Oracle CPU load. For migrated Iceberg data, Dremio handles optimization automatically.&lt;/p&gt;
&lt;h2&gt;Get Started&lt;/h2&gt;
&lt;p&gt;Oracle Database users pay a premium for their database&apos;s reliability and enterprise features. Dremio Cloud lets you extract analytical value from that data without additional Oracle licensing, ETL pipelines, or vendor-specific BI tools.&lt;/p&gt;
&lt;p&gt;&lt;a href=&quot;https://www.dremio.com/get-started?utm_source=ev_buffer&amp;amp;utm_medium=influencer&amp;amp;utm_campaign=pag&amp;amp;utm_term=connector-oracle-dremio-cloud&amp;amp;utm_content=alexmerced&quot;&gt;Try Dremio Cloud free for 30 days&lt;/a&gt; and connect your Oracle databases alongside your other data sources.&lt;/p&gt;
</content:encoded><author>Alex Merced</author></item><item><title>Generate Summaries and Insights with Dremio&apos;s AI_COMPLETE Function</title><link>https://iceberglakehouse.com/posts/2026-03-ai-ai-complete/</link><guid isPermaLink="true">https://iceberglakehouse.com/posts/2026-03-ai-ai-complete/</guid><description>
Every data team has a version of this problem: a table full of raw data that needs human-readable summaries, translations, or narrative descriptions....</description><pubDate>Sun, 01 Mar 2026 10:00:00 GMT</pubDate><content:encoded>&lt;p&gt;Every data team has a version of this problem: a table full of raw data that needs human-readable summaries, translations, or narrative descriptions. Product descriptions that need rewriting for a new market. Customer records that need one-sentence executive summaries. Support interactions that need post-call notes.&lt;/p&gt;
&lt;p&gt;&lt;code&gt;AI_COMPLETE&lt;/code&gt; brings an LLM directly into your SQL query to produce that text. You write a prompt, pass in your data columns, and get generated text back as a &lt;code&gt;VARCHAR&lt;/code&gt;. No Python notebooks, no external APIs, no data exports.&lt;/p&gt;
&lt;p&gt;This tutorial builds a complete product analytics pipeline in a fresh Dremio Cloud account. You&apos;ll create sample product and sales data, build a medallion architecture, and use &lt;code&gt;AI_COMPLETE&lt;/code&gt; to generate product summaries, executive briefings, marketing copy, and translations, all inside SQL.&lt;/p&gt;
&lt;h2&gt;What You&apos;ll Build&lt;/h2&gt;
&lt;p&gt;By the end of this tutorial, you&apos;ll have:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;A product catalog with 50+ products and 50+ sales records&lt;/li&gt;
&lt;li&gt;Bronze views that standardize raw data&lt;/li&gt;
&lt;li&gt;Silver views that compute sales metrics per product&lt;/li&gt;
&lt;li&gt;Gold views that use &lt;code&gt;AI_COMPLETE&lt;/code&gt; to generate summaries, marketing descriptions, and translated content&lt;/li&gt;
&lt;li&gt;Materialized Iceberg tables that persist generated text for dashboards&lt;/li&gt;
&lt;li&gt;Wiki metadata that enables the AI Agent to answer natural language questions about your enriched catalog&lt;/li&gt;
&lt;/ul&gt;
&lt;h2&gt;Prerequisites&lt;/h2&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Dremio Cloud account&lt;/strong&gt; — &lt;a href=&quot;https://www.dremio.com/get-started?utm_source=ev_buffer&amp;amp;utm_medium=influencer&amp;amp;utm_campaign=pag&amp;amp;utm_term=ai-complete-dremio-cloud&amp;amp;utm_content=alexmerced&quot;&gt;sign up free for 30 days&lt;/a&gt; with $400 in compute credits&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;AI enabled&lt;/strong&gt; — go to Admin → Project Settings → Preferences → AI section and enable AI features&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Model Provider configured&lt;/strong&gt; — Dremio provides a hosted LLM by default, or connect your own (OpenAI, Anthropic, Google Gemini, AWS Bedrock, Azure OpenAI)&lt;/li&gt;
&lt;/ul&gt;
&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;Note:&lt;/strong&gt; Tables in the built-in Open Catalog use &lt;code&gt;folder.subfolder.table_name&lt;/code&gt; without a catalog prefix. External sources use &lt;code&gt;source_name.schema.table_name&lt;/code&gt;.&lt;/p&gt;
&lt;/blockquote&gt;
&lt;h2&gt;Understanding AI_COMPLETE&lt;/h2&gt;
&lt;p&gt;&lt;code&gt;AI_COMPLETE&lt;/code&gt; sends a prompt to your configured LLM and returns the generated text as a &lt;code&gt;VARCHAR&lt;/code&gt;. The function signature:&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-sql&quot;&gt;AI_COMPLETE(
  [model_name VARCHAR,]
  prompt VARCHAR
) → VARCHAR
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;&lt;strong&gt;Parameters:&lt;/strong&gt;&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;model_name&lt;/strong&gt; (optional) — specify a model like &lt;code&gt;&apos;openai.gpt-4o&apos;&lt;/code&gt;. Format is &lt;code&gt;modelProvider.modelName&lt;/code&gt;. If omitted, Dremio uses your default model.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;prompt&lt;/strong&gt; — the text instruction for the LLM. Typically you concatenate a task description with column values to give the model both the instruction and the data context.&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;The key difference from &lt;code&gt;AI_CLASSIFY&lt;/code&gt; is that &lt;code&gt;AI_COMPLETE&lt;/code&gt; returns free-text output. There&apos;s no array of allowed values. The LLM generates whatever text the prompt asks for: a summary, a translation, a paragraph, a sentence, or a structured response.&lt;/p&gt;
&lt;p&gt;This flexibility is both the strength and the risk. A well-crafted prompt produces consistent, useful output. A vague prompt produces inconsistent results. Prompt engineering matters here more than with classification.&lt;/p&gt;
&lt;h2&gt;Step 1: Create Your Folder Structure&lt;/h2&gt;
&lt;p&gt;Open the &lt;strong&gt;SQL Runner&lt;/strong&gt; from the left sidebar in Dremio Cloud:&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-sql&quot;&gt;CREATE FOLDER IF NOT EXISTS aicompleteexp;
CREATE FOLDER IF NOT EXISTS aicompleteexp.catalog_data;
CREATE FOLDER IF NOT EXISTS aicompleteexp.bronze;
CREATE FOLDER IF NOT EXISTS aicompleteexp.silver;
CREATE FOLDER IF NOT EXISTS aicompleteexp.gold;
&lt;/code&gt;&lt;/pre&gt;
&lt;h2&gt;Step 2: Seed Your Sample Data&lt;/h2&gt;
&lt;h3&gt;Products Table&lt;/h3&gt;
&lt;p&gt;This table simulates a SaaS product catalog with technical descriptions, pricing tiers, and categories. These descriptions are the raw material that &lt;code&gt;AI_COMPLETE&lt;/code&gt; will use to generate marketing copy and summaries.&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-sql&quot;&gt;CREATE TABLE aicompleteexp.catalog_data.products (
  product_id INT,
  product_name VARCHAR,
  category VARCHAR,
  description VARCHAR,
  price_monthly DECIMAL(10,2),
  launch_date DATE,
  target_audience VARCHAR
);

INSERT INTO aicompleteexp.catalog_data.products VALUES
(1, &apos;CloudSync Pro&apos;, &apos;Storage&apos;, &apos;Enterprise file synchronization platform supporting real-time sync across Windows Mac and Linux with conflict resolution selective sync and 256-bit AES encryption at rest and in transit&apos;, 29.99, &apos;2023-03-15&apos;, &apos;IT Teams&apos;),
(2, &apos;DataVault Enterprise&apos;, &apos;Security&apos;, &apos;Zero-knowledge encrypted cloud storage with SOC 2 Type II certification automated backup deduplication granular access controls and 99.99% uptime SLA for regulated industries&apos;, 89.99, &apos;2022-11-01&apos;, &apos;Compliance Officers&apos;),
(3, &apos;QuickReport&apos;, &apos;Analytics&apos;, &apos;Self-service business intelligence tool with drag-and-drop report builder 50+ chart types scheduled email delivery PDF export and REST API for automated report generation&apos;, 49.99, &apos;2024-01-20&apos;, &apos;Business Analysts&apos;),
(4, &apos;DevPipeline&apos;, &apos;DevOps&apos;, &apos;CI/CD platform with parallel build execution Docker and Kubernetes native deployment auto-scaling runners built-in secret management and integration with GitHub GitLab and Bitbucket&apos;, 79.99, &apos;2023-07-10&apos;, &apos;Engineering Teams&apos;),
(5, &apos;MailForge&apos;, &apos;Marketing&apos;, &apos;Email marketing automation platform with AI-powered subject line optimization A/B testing dynamic content personalization and real-time deliverability monitoring across 50+ ISPs&apos;, 39.99, &apos;2024-05-01&apos;, &apos;Marketing Teams&apos;),
(6, &apos;HelpDesk360&apos;, &apos;Support&apos;, &apos;Omnichannel customer support platform supporting email chat phone and social media with SLA tracking auto-routing knowledge base integration and customer satisfaction scoring&apos;, 59.99, &apos;2023-09-15&apos;, &apos;Support Managers&apos;),
(7, &apos;FormBuilder&apos;, &apos;Productivity&apos;, &apos;No-code form and survey creation tool with conditional logic payment collection 200+ templates analytics dashboard and WCAG 2.1 AA accessibility compliance&apos;, 19.99, &apos;2024-02-28&apos;, &apos;Operations Teams&apos;),
(8, &apos;APIGateway Pro&apos;, &apos;Infrastructure&apos;, &apos;API management platform with rate limiting OAuth 2.0 authentication request transformation caching analytics dashboard and support for REST GraphQL and gRPC protocols&apos;, 99.99, &apos;2023-01-05&apos;, &apos;Platform Engineers&apos;),
(9, &apos;InventoryTrack&apos;, &apos;Commerce&apos;, &apos;Multi-warehouse inventory management system with barcode scanning lot tracking reorder alerts multi-currency support and integration with Shopify WooCommerce and Amazon&apos;, 44.99, &apos;2024-04-10&apos;, &apos;E-commerce Managers&apos;),
(10, &apos;TeamBoard&apos;, &apos;Collaboration&apos;, &apos;Visual project management platform with Kanban Gantt and timeline views time tracking resource allocation dependencies and Slack Microsoft Teams integration&apos;, 24.99, &apos;2023-06-20&apos;, &apos;Project Managers&apos;),
(11, &apos;SecureSign&apos;, &apos;Legal&apos;, &apos;Electronic signature platform with legally binding signatures audit trails multi-party signing workflows template library and compliance with eIDAS UETA and ESIGN regulations&apos;, 34.99, &apos;2024-03-01&apos;, &apos;Legal Teams&apos;),
(12, &apos;DataStream&apos;, &apos;Data&apos;, &apos;Real-time data pipeline platform supporting Kafka Pulsar and Kinesis with schema registry exactly-once processing dead letter queues and built-in data quality checks&apos;, 149.99, &apos;2023-04-18&apos;, &apos;Data Engineers&apos;),
(13, &apos;AdOptimizer&apos;, &apos;Marketing&apos;, &apos;Cross-channel advertising platform with automated bid management audience segmentation attribution modeling creative testing and budget pacing across Google Facebook and LinkedIn&apos;, 199.99, &apos;2024-06-15&apos;, &apos;Performance Marketers&apos;),
(14, &apos;ContractManager&apos;, &apos;Legal&apos;, &apos;Contract lifecycle management platform with AI-assisted clause extraction version tracking approval workflows obligation monitoring and integration with Salesforce and HubSpot&apos;, 69.99, &apos;2023-08-22&apos;, &apos;Legal Operations&apos;),
(15, &apos;LogInsight&apos;, &apos;Infrastructure&apos;, &apos;Log aggregation and analysis platform with full-text search pattern detection anomaly alerts custom dashboards and retention policies supporting up to 10TB daily ingestion&apos;, 119.99, &apos;2023-02-14&apos;, &apos;SRE Teams&apos;),
(16, &apos;PayFlow&apos;, &apos;Finance&apos;, &apos;Payment processing platform with support for 135 currencies PCI DSS Level 1 compliance recurring billing invoice generation and fraud detection using ML models&apos;, 0.00, &apos;2024-01-10&apos;, &apos;Finance Teams&apos;),
(17, &apos;ChatAssist&apos;, &apos;Support&apos;, &apos;AI-powered chatbot platform with natural language understanding intent classification handoff to human agents conversation analytics and multi-language support for 40+ languages&apos;, 74.99, &apos;2024-07-01&apos;, &apos;Customer Experience&apos;),
(18, &apos;SchedulePro&apos;, &apos;HR&apos;, &apos;Employee scheduling platform with shift management availability tracking overtime calculation labor cost forecasting and integration with ADP Workday and BambooHR payroll systems&apos;, 14.99, &apos;2023-11-05&apos;, &apos;HR Managers&apos;),
(19, &apos;CloudBackup&apos;, &apos;Storage&apos;, &apos;Automated cloud backup solution with incremental backups point-in-time recovery cross-region replication ransomware protection and support for AWS Azure and GCP workloads&apos;, 54.99, &apos;2023-05-30&apos;, &apos;IT Administrators&apos;),
(20, &apos;DesignHub&apos;, &apos;Productivity&apos;, &apos;Collaborative design platform with real-time co-editing version history component libraries handoff-to-dev specs and integration with Figma Sketch and Adobe XD import&apos;, 29.99, &apos;2024-08-10&apos;, &apos;Design Teams&apos;);
&lt;/code&gt;&lt;/pre&gt;
&lt;h3&gt;Sales Data Table&lt;/h3&gt;
&lt;p&gt;This table tracks monthly sales performance for each product, giving us the raw numbers that &lt;code&gt;AI_COMPLETE&lt;/code&gt; will summarize into narrative insights.&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-sql&quot;&gt;CREATE TABLE aicompleteexp.catalog_data.sales_data (
  sale_id INT,
  product_id INT,
  month_year VARCHAR,
  units_sold INT,
  revenue DECIMAL(12,2),
  new_customers INT,
  churned_customers INT,
  region VARCHAR
);

INSERT INTO aicompleteexp.catalog_data.sales_data VALUES
(1, 1, &apos;2025-07&apos;, 342, 10256.58, 89, 12, &apos;North America&apos;),
(2, 1, &apos;2025-08&apos;, 389, 11666.11, 102, 8, &apos;North America&apos;),
(3, 1, &apos;2025-09&apos;, 415, 12446.85, 95, 15, &apos;Europe&apos;),
(4, 2, &apos;2025-07&apos;, 156, 14037.44, 34, 5, &apos;North America&apos;),
(5, 2, &apos;2025-08&apos;, 178, 16017.22, 41, 3, &apos;Europe&apos;),
(6, 2, &apos;2025-09&apos;, 201, 18087.99, 52, 7, &apos;North America&apos;),
(7, 3, &apos;2025-07&apos;, 267, 13346.33, 73, 18, &apos;North America&apos;),
(8, 3, &apos;2025-08&apos;, 234, 11697.66, 58, 22, &apos;Europe&apos;),
(9, 3, &apos;2025-09&apos;, 298, 14894.02, 81, 14, &apos;Asia Pacific&apos;),
(10, 4, &apos;2025-07&apos;, 123, 9837.77, 28, 4, &apos;North America&apos;),
(11, 4, &apos;2025-08&apos;, 145, 11598.55, 35, 6, &apos;Europe&apos;),
(12, 4, &apos;2025-09&apos;, 167, 13358.33, 42, 3, &apos;North America&apos;),
(13, 5, &apos;2025-07&apos;, 445, 17795.55, 112, 25, &apos;North America&apos;),
(14, 5, &apos;2025-08&apos;, 478, 19115.22, 98, 30, &apos;Europe&apos;),
(15, 5, &apos;2025-09&apos;, 512, 20475.88, 134, 19, &apos;Asia Pacific&apos;),
(16, 6, &apos;2025-07&apos;, 198, 11877.02, 45, 11, &apos;North America&apos;),
(17, 6, &apos;2025-08&apos;, 212, 12717.88, 52, 8, &apos;Europe&apos;),
(18, 6, &apos;2025-09&apos;, 189, 11331.11, 38, 16, &apos;North America&apos;),
(19, 7, &apos;2025-07&apos;, 567, 11334.33, 145, 32, &apos;North America&apos;),
(20, 7, &apos;2025-08&apos;, 612, 12234.88, 160, 28, &apos;Europe&apos;),
(21, 7, &apos;2025-09&apos;, 589, 11774.11, 138, 35, &apos;Asia Pacific&apos;),
(22, 8, &apos;2025-07&apos;, 89, 8899.11, 15, 2, &apos;North America&apos;),
(23, 8, &apos;2025-08&apos;, 95, 9499.05, 18, 1, &apos;Europe&apos;),
(24, 8, &apos;2025-09&apos;, 102, 10198.98, 22, 3, &apos;North America&apos;),
(25, 9, &apos;2025-07&apos;, 312, 14035.88, 78, 14, &apos;North America&apos;),
(26, 9, &apos;2025-08&apos;, 287, 12911.13, 65, 19, &apos;Europe&apos;),
(27, 9, &apos;2025-09&apos;, 345, 15520.55, 92, 11, &apos;Asia Pacific&apos;),
(28, 10, &apos;2025-07&apos;, 234, 5847.66, 67, 20, &apos;North America&apos;),
(29, 10, &apos;2025-08&apos;, 256, 6397.44, 72, 15, &apos;Europe&apos;),
(30, 10, &apos;2025-09&apos;, 278, 6946.22, 80, 18, &apos;Asia Pacific&apos;),
(31, 11, &apos;2025-07&apos;, 189, 6613.11, 48, 9, &apos;North America&apos;),
(32, 11, &apos;2025-08&apos;, 201, 7032.99, 55, 7, &apos;Europe&apos;),
(33, 11, &apos;2025-09&apos;, 223, 7802.77, 62, 5, &apos;North America&apos;),
(34, 12, &apos;2025-07&apos;, 67, 10049.33, 12, 1, &apos;North America&apos;),
(35, 12, &apos;2025-08&apos;, 78, 11699.22, 16, 2, &apos;Europe&apos;),
(36, 12, &apos;2025-09&apos;, 82, 12299.18, 19, 1, &apos;North America&apos;),
(37, 13, &apos;2025-07&apos;, 134, 26793.66, 28, 6, &apos;North America&apos;),
(38, 13, &apos;2025-08&apos;, 145, 28993.55, 32, 4, &apos;Europe&apos;),
(39, 13, &apos;2025-09&apos;, 167, 33393.33, 41, 8, &apos;North America&apos;),
(40, 14, &apos;2025-07&apos;, 112, 7838.88, 25, 5, &apos;North America&apos;),
(41, 14, &apos;2025-08&apos;, 128, 8959.72, 30, 3, &apos;Europe&apos;),
(42, 14, &apos;2025-09&apos;, 145, 10149.55, 38, 4, &apos;North America&apos;),
(43, 15, &apos;2025-07&apos;, 56, 6719.44, 10, 2, &apos;North America&apos;),
(44, 15, &apos;2025-08&apos;, 62, 7439.38, 13, 1, &apos;Europe&apos;),
(45, 15, &apos;2025-09&apos;, 71, 8519.29, 17, 2, &apos;North America&apos;),
(46, 16, &apos;2025-07&apos;, 890, 0.00, 234, 45, &apos;North America&apos;),
(47, 16, &apos;2025-08&apos;, 1023, 0.00, 267, 38, &apos;Europe&apos;),
(48, 16, &apos;2025-09&apos;, 1156, 0.00, 301, 52, &apos;Asia Pacific&apos;),
(49, 17, &apos;2025-07&apos;, 145, 10873.55, 38, 8, &apos;North America&apos;),
(50, 17, &apos;2025-08&apos;, 167, 12522.33, 45, 6, &apos;Europe&apos;),
(51, 17, &apos;2025-09&apos;, 189, 14173.11, 52, 10, &apos;Asia Pacific&apos;),
(52, 18, &apos;2025-07&apos;, 423, 6341.77, 110, 28, &apos;North America&apos;),
(53, 18, &apos;2025-08&apos;, 456, 6836.44, 118, 22, &apos;Europe&apos;),
(54, 18, &apos;2025-09&apos;, 489, 7331.11, 125, 30, &apos;Asia Pacific&apos;);
&lt;/code&gt;&lt;/pre&gt;
&lt;h2&gt;Step 3: Build Bronze Views&lt;/h2&gt;
&lt;p&gt;Bronze views standardize column names and cast dates to timestamps. No business logic at this layer.&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-sql&quot;&gt;CREATE OR REPLACE VIEW aicompleteexp.bronze.v_products AS
SELECT
  product_id,
  product_name,
  category,
  description,
  price_monthly,
  CAST(launch_date AS TIMESTAMP) AS launch_timestamp,
  target_audience
FROM aicompleteexp.catalog_data.products;

CREATE OR REPLACE VIEW aicompleteexp.bronze.v_sales AS
SELECT
  sale_id,
  product_id,
  month_year,
  units_sold,
  revenue,
  new_customers,
  churned_customers,
  region
FROM aicompleteexp.catalog_data.sales_data;
&lt;/code&gt;&lt;/pre&gt;
&lt;h2&gt;Step 4: Build Silver Views&lt;/h2&gt;
&lt;p&gt;This Silver view aggregates sales performance per product across all months, giving us total revenue, total units, average deal size, net customer growth, and growth rate. The &lt;code&gt;AI_COMPLETE&lt;/code&gt; function will use these metrics to generate narrative summaries.&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-sql&quot;&gt;CREATE OR REPLACE VIEW aicompleteexp.silver.v_product_performance AS
SELECT
  p.product_id,
  p.product_name,
  p.category,
  p.description,
  p.price_monthly,
  p.target_audience,
  COALESCE(SUM(s.units_sold), 0) AS total_units,
  COALESCE(SUM(s.revenue), 0) AS total_revenue,
  COALESCE(SUM(s.new_customers), 0) AS total_new_customers,
  COALESCE(SUM(s.churned_customers), 0) AS total_churned,
  COALESCE(SUM(s.new_customers), 0) - COALESCE(SUM(s.churned_customers), 0) AS net_customer_growth,
  CASE
    WHEN COALESCE(SUM(s.units_sold), 0) &amp;gt; 0
    THEN ROUND(COALESCE(SUM(s.revenue), 0) / SUM(s.units_sold), 2)
    ELSE 0
  END AS avg_revenue_per_unit,
  COUNT(DISTINCT s.region) AS regions_active
FROM aicompleteexp.bronze.v_products p
LEFT JOIN aicompleteexp.bronze.v_sales s ON p.product_id = s.product_id
GROUP BY p.product_id, p.product_name, p.category, p.description, p.price_monthly, p.target_audience;
&lt;/code&gt;&lt;/pre&gt;
&lt;h2&gt;Step 5: Build Gold Views with AI_COMPLETE&lt;/h2&gt;
&lt;h3&gt;Gold View 1: Executive Product Summaries&lt;/h3&gt;
&lt;p&gt;This view generates a one-sentence executive summary for each product based on its sales performance. Product managers use these summaries in weekly reports without manually writing them.&lt;/p&gt;
&lt;p&gt;The prompt includes specific data points (revenue, units, customer growth) so the LLM produces factual summaries rather than generic descriptions.&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-sql&quot;&gt;CREATE OR REPLACE VIEW aicompleteexp.gold.v_product_summaries AS
SELECT
  product_id,
  product_name,
  category,
  total_revenue,
  total_units,
  net_customer_growth,
  AI_COMPLETE(
    &apos;Write a single-sentence executive summary for this product. Be specific with numbers. Product: &apos;
    || product_name
    || &apos;. Category: &apos; || category
    || &apos;. Total revenue: $&apos; || CAST(total_revenue AS VARCHAR)
    || &apos;. Units sold: &apos; || CAST(total_units AS VARCHAR)
    || &apos;. Net customer growth: &apos; || CAST(net_customer_growth AS VARCHAR)
    || &apos;. Target audience: &apos; || target_audience
  ) AS executive_summary
FROM aicompleteexp.silver.v_product_performance;
&lt;/code&gt;&lt;/pre&gt;
&lt;h3&gt;Gold View 2: Marketing Description Generator&lt;/h3&gt;
&lt;p&gt;This view transforms technical product descriptions into customer-facing marketing copy. The LLM rewrites the description in a style that emphasizes benefits rather than features, suitable for a product landing page.&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-sql&quot;&gt;CREATE OR REPLACE VIEW aicompleteexp.gold.v_marketing_copy AS
SELECT
  product_id,
  product_name,
  category,
  description AS technical_description,
  price_monthly,
  target_audience,
  AI_COMPLETE(
    &apos;Rewrite this technical product description as a compelling 2-3 sentence marketing paragraph for a product landing page. Focus on benefits not features. Avoid buzzwords like transformative or revolutionary. Product: &apos;
    || product_name
    || &apos;. Technical description: &apos; || description
    || &apos;. Price: $&apos; || CAST(price_monthly AS VARCHAR) || &apos;/month&apos;
    || &apos;. Target audience: &apos; || target_audience
  ) AS marketing_description
FROM aicompleteexp.bronze.v_products;
&lt;/code&gt;&lt;/pre&gt;
&lt;h3&gt;Gold View 3: Translated Descriptions&lt;/h3&gt;
&lt;p&gt;&lt;code&gt;AI_COMPLETE&lt;/code&gt; handles translation by including the target language in the prompt. This view generates Spanish translations of product descriptions for a localization initiative.&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-sql&quot;&gt;CREATE OR REPLACE VIEW aicompleteexp.gold.v_spanish_catalog AS
SELECT
  product_id,
  product_name,
  description AS english_description,
  AI_COMPLETE(
    &apos;Translate this product description to Spanish. Return only the Spanish text, no explanations: &apos; || description
  ) AS spanish_description
FROM aicompleteexp.bronze.v_products;
&lt;/code&gt;&lt;/pre&gt;
&lt;h3&gt;Choosing the Right Model&lt;/h3&gt;
&lt;p&gt;For summarization tasks, you can specify a model optimized for speed or quality:&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-sql&quot;&gt;-- Use a specific high-quality model for executive summaries
SELECT
  product_name,
  AI_COMPLETE(
    &apos;openai.gpt-4o&apos;,
    &apos;Write a brief executive summary: Product &apos; || product_name
    || &apos; generated $&apos; || CAST(total_revenue AS VARCHAR) || &apos; in revenue&apos;
  ) AS summary
FROM aicompleteexp.silver.v_product_performance;
&lt;/code&gt;&lt;/pre&gt;
&lt;h3&gt;Prompt Engineering Patterns&lt;/h3&gt;
&lt;p&gt;The quality of &lt;code&gt;AI_COMPLETE&lt;/code&gt; output depends heavily on prompt structure. Here are patterns that produce consistent results:&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Be specific about format:&lt;/strong&gt; &amp;quot;Write a single-sentence summary&amp;quot; produces better output than &amp;quot;Summarize this.&amp;quot; Specify the expected length and format.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Include constraints:&lt;/strong&gt; &amp;quot;Avoid buzzwords like transformative or revolutionary&amp;quot; steers the LLM away from generic marketing language. &amp;quot;Return only the Spanish text, no explanations&amp;quot; prevents the model from adding unwanted context.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Provide data context:&lt;/strong&gt; Concatenate actual numbers into the prompt. &amp;quot;Total revenue: $34,000&amp;quot; gives the LLM facts to work with, reducing hallucination. Never ask the LLM to calculate; provide pre-computed metrics and ask it to narrate them.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Test with LIMIT:&lt;/strong&gt; Before running &lt;code&gt;AI_COMPLETE&lt;/code&gt; on your full dataset, test with &lt;code&gt;LIMIT 5&lt;/code&gt; to check output quality and token costs:&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-sql&quot;&gt;SELECT product_name, AI_COMPLETE(&apos;Summarize in one sentence: &apos; || description)
FROM aicompleteexp.bronze.v_products
LIMIT 5;
&lt;/code&gt;&lt;/pre&gt;
&lt;h2&gt;Persisting Results with CTAS&lt;/h2&gt;
&lt;p&gt;LLM calls cost tokens on every execution. For dashboards or reports that display generated summaries, materialize the results:&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-sql&quot;&gt;CREATE TABLE aicompleteexp.gold.product_summaries_materialized AS
SELECT * FROM aicompleteexp.gold.v_product_summaries;
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Refresh this table on a schedule (weekly, after each sales data update) to keep summaries current without running LLM calls on every dashboard load.&lt;/p&gt;
&lt;h2&gt;Step 6: Enable AI-Generated Wikis and Tags&lt;/h2&gt;
&lt;p&gt;Add metadata context to your Gold views so the AI Agent can answer questions about your enriched catalog:&lt;/p&gt;
&lt;ol&gt;
&lt;li&gt;Click &lt;strong&gt;Admin&lt;/strong&gt; in the left sidebar, then go to &lt;strong&gt;Project Settings&lt;/strong&gt;.&lt;/li&gt;
&lt;li&gt;Select the &lt;strong&gt;Preferences&lt;/strong&gt; tab.&lt;/li&gt;
&lt;li&gt;Scroll to the &lt;strong&gt;AI&lt;/strong&gt; section and enable &lt;strong&gt;Generate Wikis and Labels&lt;/strong&gt;.&lt;/li&gt;
&lt;li&gt;Go to the &lt;strong&gt;Catalog&lt;/strong&gt; and navigate to your Gold views under &lt;code&gt;aicompleteexp.gold&lt;/code&gt;.&lt;/li&gt;
&lt;li&gt;Click the &lt;strong&gt;Edit&lt;/strong&gt; button (pencil icon) next to each view.&lt;/li&gt;
&lt;li&gt;In the &lt;strong&gt;Details&lt;/strong&gt; tab, click &lt;strong&gt;Generate Wiki&lt;/strong&gt; and &lt;strong&gt;Generate Tags&lt;/strong&gt;.&lt;/li&gt;
&lt;li&gt;Repeat for all Gold views.&lt;/li&gt;
&lt;/ol&gt;
&lt;p&gt;To enhance the generated wiki, copy it into the AI Agent and ask for improvements. For example: &amp;quot;Add context explaining that the executive_summary column is generated by an LLM using actual revenue and customer data, and that summaries are refreshed weekly after the sales data pipeline runs.&amp;quot;&lt;/p&gt;
&lt;h2&gt;Step 7: Ask Questions with the AI Agent&lt;/h2&gt;
&lt;p&gt;Navigate to the AI Agent and try these prompts:&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;&amp;quot;Which products have the highest net customer growth?&amp;quot;&lt;/strong&gt;&lt;/p&gt;
&lt;p&gt;The Agent queries &lt;code&gt;v_product_summaries&lt;/code&gt;, sorts by &lt;code&gt;net_customer_growth&lt;/code&gt;, and returns the top products with their AI-generated summaries.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;&amp;quot;Show me a chart of total revenue by product category&amp;quot;&lt;/strong&gt;&lt;/p&gt;
&lt;p&gt;The Agent groups products by &lt;code&gt;category&lt;/code&gt; in &lt;code&gt;v_product_performance&lt;/code&gt;, sums revenue, and generates a bar chart showing which categories drive the most revenue.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;&amp;quot;List all products in the Security category with their marketing descriptions&amp;quot;&lt;/strong&gt;&lt;/p&gt;
&lt;p&gt;The Agent filters &lt;code&gt;v_marketing_copy&lt;/code&gt; for &lt;code&gt;category = &apos;Security&apos;&lt;/code&gt; and returns product names alongside the LLM-generated marketing paragraphs.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;&amp;quot;Create a chart comparing new customers vs churned customers by product&amp;quot;&lt;/strong&gt;&lt;/p&gt;
&lt;p&gt;The Agent queries &lt;code&gt;v_product_performance&lt;/code&gt; and creates a grouped bar chart showing customer acquisition and churn side by side, making it easy to spot products with healthy vs. concerning net growth.&lt;/p&gt;
&lt;h2&gt;Why Apache Iceberg Matters&lt;/h2&gt;
&lt;p&gt;Your materialized summary tables are Apache Iceberg tables in the built-in Open Catalog. This means:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Time travel:&lt;/strong&gt; Compare this week&apos;s AI-generated summaries with last week&apos;s to see how the narrative changed as sales data evolved&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Schema evolution:&lt;/strong&gt; Add new generated columns (like translations to additional languages) without rewriting existing data&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;ACID transactions:&lt;/strong&gt; CTAS jobs write atomically; dashboards never see partial results&lt;/li&gt;
&lt;/ul&gt;
&lt;h3&gt;Iceberg vs. Federated for AI_COMPLETE Workloads&lt;/h3&gt;
&lt;p&gt;&lt;strong&gt;Keep data federated when:&lt;/strong&gt; Your source data updates frequently and you want the latest products or sales figures in real-time queries. Use manual Reflections to cache results.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Migrate to Iceberg when:&lt;/strong&gt; You&apos;re generating summaries on historical data or building a curated catalog of marketing copy. CTAS materializes the generated text once, and Iceberg&apos;s automated performance management (compaction, manifest optimization) keeps the table fast as it grows.&lt;/p&gt;
&lt;h2&gt;Next Steps&lt;/h2&gt;
&lt;ol&gt;
&lt;li&gt;&lt;strong&gt;Connect your real product database&lt;/strong&gt; — replace simulated tables with federated connections to your actual catalog and CRM&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Generate descriptions in multiple languages&lt;/strong&gt; — create additional Gold views with French, German, or Japanese translations&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Add FGAC&lt;/strong&gt; — mask revenue numbers in generated summaries for roles that shouldn&apos;t see financial data&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Build Reflections&lt;/strong&gt; — create Reflections on materialized tables to accelerate dashboard queries at zero LLM cost&lt;/li&gt;
&lt;/ol&gt;
&lt;p&gt;If your team spends time manually writing product summaries, translating content, or creating executive briefings from raw data, &lt;code&gt;AI_COMPLETE&lt;/code&gt; automates that work inside the same SQL engine where your data already lives. Write a prompt, run a query, and get your generated text in the same governed platform where everything else runs.&lt;/p&gt;
&lt;p&gt;&lt;a href=&quot;https://www.dremio.com/get-started?utm_source=ev_buffer&amp;amp;utm_medium=influencer&amp;amp;utm_campaign=pag&amp;amp;utm_term=ai-complete-dremio-cloud&amp;amp;utm_content=alexmerced&quot;&gt;Try Dremio Cloud free for 30 days&lt;/a&gt; and start generating insights with SQL.&lt;/p&gt;
</content:encoded><author>Alex Merced</author></item><item><title>Connect MySQL to Dremio Cloud: Federated Analytics Without ETL</title><link>https://iceberglakehouse.com/posts/2026-03-connector-mysql/</link><guid isPermaLink="true">https://iceberglakehouse.com/posts/2026-03-connector-mysql/</guid><description>
MySQL runs more web applications, SaaS platforms, and e-commerce backends than any other database. It&apos;s fast for transactional reads and writes, but ...</description><pubDate>Sun, 01 Mar 2026 10:00:00 GMT</pubDate><content:encoded>&lt;p&gt;MySQL runs more web applications, SaaS platforms, and e-commerce backends than any other database. It&apos;s fast for transactional reads and writes, but it becomes a bottleneck when your data team needs to run analytical queries, join MySQL data with other sources, or build dashboards that don&apos;t compete with application traffic.&lt;/p&gt;
&lt;p&gt;Dremio Cloud connects directly to MySQL and queries it in place. Your data stays where it is. Dremio pushes filters (called predicate pushdowns) to MySQL when possible, joins MySQL data with any other connected source, and accelerates repeated queries with pre-computed Reflections so your production database isn&apos;t hit by every dashboard refresh.&lt;/p&gt;
&lt;p&gt;This guide covers everything from prerequisites to federated queries across MySQL and your other data sources.&lt;/p&gt;
&lt;h2&gt;Why MySQL Users Need Dremio&lt;/h2&gt;
&lt;p&gt;&lt;strong&gt;Analytics compete with application traffic.&lt;/strong&gt; MySQL was built for OLTP (Online Transaction Processing) — fast inserts, updates, and single-row lookups. Analytical queries that scan millions of rows, compute aggregations, or join large tables create lock contention and slow down application responses. Dremio&apos;s Reflections solve this: after the first query, analytical workloads hit Dremio&apos;s pre-computed cache instead of MySQL.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Data is siloed.&lt;/strong&gt; Your order data is in MySQL, customer engagement data is in MongoDB, and marketing attribution data is in S3. Joining these requires building ETL pipelines that extract, transform, and load data into a central warehouse. Dremio eliminates this by querying each source in place and joining the results in its query engine. One SQL query, multiple sources.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Read replicas are expensive and complex.&lt;/strong&gt; The common workaround for MySQL analytics is creating a read replica. This adds infrastructure cost, replication lag, and operational complexity. Dremio&apos;s Reflections provide the same benefit (offloading analytical reads) without a separate database instance. The query optimizer transparently serves results from Reflections when they match.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;No built-in semantic layer.&lt;/strong&gt; MySQL tables have raw column names and no business context. Dremio lets you create views with business logic (like defining what &amp;quot;active customer&amp;quot; means), attach wiki descriptions and labels to those views, and then let the AI Agent answer questions in plain English based on that context.&lt;/p&gt;
&lt;h2&gt;Prerequisites&lt;/h2&gt;
&lt;p&gt;Before connecting MySQL to Dremio Cloud, confirm you have:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;MySQL hostname or IP address&lt;/strong&gt;&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Port number&lt;/strong&gt; — MySQL defaults to &lt;code&gt;3306&lt;/code&gt;&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Username and password&lt;/strong&gt; — a MySQL user with &lt;code&gt;SELECT&lt;/code&gt; privileges on the tables you want to query&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Network access&lt;/strong&gt; — port &lt;code&gt;3306&lt;/code&gt; must be reachable from Dremio Cloud. Open the port in your AWS Security Group, Azure NSG, or firewall configuration&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Dremio Cloud account&lt;/strong&gt; — &lt;a href=&quot;https://www.dremio.com/get-started?utm_source=ev_buffer&amp;amp;utm_medium=influencer&amp;amp;utm_campaign=pag&amp;amp;utm_term=connector-mysql-dremio-cloud&amp;amp;utm_content=alexmerced&quot;&gt;sign up free for 30 days&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;h2&gt;Step-by-Step: Connect MySQL to Dremio Cloud&lt;/h2&gt;
&lt;h3&gt;1. Add the MySQL Source&lt;/h3&gt;
&lt;p&gt;In the Dremio console, click the &lt;strong&gt;&amp;quot;+&amp;quot;&lt;/strong&gt; button in the left sidebar, then select &lt;strong&gt;MySQL&lt;/strong&gt; under database sources. Alternatively, go to &lt;strong&gt;Databases&lt;/strong&gt; and click &lt;strong&gt;Add database&lt;/strong&gt;.&lt;/p&gt;
&lt;h3&gt;2. Configure General Settings&lt;/h3&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Name:&lt;/strong&gt; A descriptive identifier (e.g., &lt;code&gt;ecommerce-mysql&lt;/code&gt;). This name appears in SQL queries as the source prefix. Cannot include &lt;code&gt;/&lt;/code&gt;, &lt;code&gt;:&lt;/code&gt;, &lt;code&gt;[&lt;/code&gt;, or &lt;code&gt;]&lt;/code&gt;.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Host:&lt;/strong&gt; Your MySQL server&apos;s hostname (e.g., &lt;code&gt;my-rds-instance.abc123.us-east-1.rds.amazonaws.com&lt;/code&gt;).&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Port:&lt;/strong&gt; Default &lt;code&gt;3306&lt;/code&gt;.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Database (optional):&lt;/strong&gt; Specify a single database to connect to, or leave blank to access all databases the user can see.&lt;/li&gt;
&lt;/ul&gt;
&lt;h3&gt;3. Set Authentication&lt;/h3&gt;
&lt;p&gt;Two options:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;No Authentication:&lt;/strong&gt; For development instances with no password requirement.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Master Credentials:&lt;/strong&gt; Enter the MySQL username and password with &lt;code&gt;SELECT&lt;/code&gt; permissions on your tables.&lt;/li&gt;
&lt;/ul&gt;
&lt;h3&gt;4. Configure Advanced Options&lt;/h3&gt;
&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Setting&lt;/th&gt;
&lt;th&gt;What It Does&lt;/th&gt;
&lt;th&gt;Default&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Net write timeout (in seconds)&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;How long to wait for data from MySQL before dropping the connection.&lt;/td&gt;
&lt;td&gt;60&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Record fetch size&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Rows per batch. Set to 0 for automatic.&lt;/td&gt;
&lt;td&gt;200&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Maximum Idle Connections&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Idle connection pool size.&lt;/td&gt;
&lt;td&gt;8&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Connection idle time (s)&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Seconds before idle connections close.&lt;/td&gt;
&lt;td&gt;60&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Query timeout (s)&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Maximum query execution time before cancellation.&lt;/td&gt;
&lt;td&gt;None&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Properties&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Custom JDBC connection key-value pairs.&lt;/td&gt;
&lt;td&gt;None&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;
&lt;h3&gt;5. Set Reflection and Metadata Refresh&lt;/h3&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Reflection Refresh:&lt;/strong&gt; Controls how often Dremio re-queries MySQL to update pre-computed Reflections. For frequently changing data, set to 1-4 hours. For stable data, daily or weekly.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Metadata Refresh:&lt;/strong&gt; Controls how often Dremio checks for new tables or schema changes. Default is 1 hour for both discovery and details.&lt;/li&gt;
&lt;/ul&gt;
&lt;h3&gt;6. Set Privileges and Save&lt;/h3&gt;
&lt;p&gt;Optionally restrict which Dremio users can access this source. Click &lt;strong&gt;Save&lt;/strong&gt;.&lt;/p&gt;
&lt;h2&gt;Query MySQL Data in Dremio&lt;/h2&gt;
&lt;p&gt;Once connected, browse your MySQL schemas and tables in the SQL Runner:&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-sql&quot;&gt;SELECT order_id, customer_id, total_amount, order_date, status
FROM &amp;quot;ecommerce-mysql&amp;quot;.shop.orders
WHERE order_date &amp;gt;= &apos;2024-06-01&apos;
  AND status = &apos;completed&apos;
ORDER BY total_amount DESC;
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Dremio pushes the date filter, status filter, and sort to MySQL — only the matching rows are transferred.&lt;/p&gt;
&lt;h2&gt;Federate MySQL with Other Sources&lt;/h2&gt;
&lt;p&gt;Join MySQL order data with S3 clickstream data and PostgreSQL customer profiles in a single query:&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-sql&quot;&gt;SELECT
  c.customer_name,
  c.region,
  COUNT(o.order_id) AS total_orders,
  SUM(o.total_amount) AS total_revenue,
  COUNT(DISTINCT e.session_id) AS web_sessions
FROM &amp;quot;postgres-crm&amp;quot;.public.customers c
LEFT JOIN &amp;quot;ecommerce-mysql&amp;quot;.shop.orders o
  ON c.customer_id = o.customer_id
LEFT JOIN &amp;quot;s3-analytics&amp;quot;.clickstream.sessions e
  ON c.customer_id = e.user_id
GROUP BY c.customer_name, c.region
ORDER BY total_revenue DESC;
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;No ETL. No data warehouse loading. Three sources, one query.&lt;/p&gt;
&lt;h2&gt;Build Views and Enable the AI Agent&lt;/h2&gt;
&lt;p&gt;Create business-friendly views over MySQL data:&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-sql&quot;&gt;CREATE VIEW analytics.gold.order_summary AS
SELECT
  o.order_id,
  o.customer_id,
  o.total_amount,
  CAST(o.order_date AS TIMESTAMP) AS order_timestamp,
  o.status AS order_status,
  CASE
    WHEN o.total_amount &amp;gt; 500 THEN &apos;High Value&apos;
    WHEN o.total_amount &amp;gt; 100 THEN &apos;Medium Value&apos;
    ELSE &apos;Standard&apos;
  END AS order_tier
FROM &amp;quot;ecommerce-mysql&amp;quot;.shop.orders o
WHERE o.status IN (&apos;completed&apos;, &apos;shipped&apos;);
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Then navigate to the &lt;strong&gt;Catalog&lt;/strong&gt;, click &lt;strong&gt;Edit&lt;/strong&gt; (pencil icon) on the view, go to the &lt;strong&gt;Details&lt;/strong&gt; tab, and click &lt;strong&gt;Generate Wiki&lt;/strong&gt; and &lt;strong&gt;Generate Tags&lt;/strong&gt;. This gives Dremio&apos;s AI Agent the context it needs to answer questions like &amp;quot;How many high-value orders shipped last month?&amp;quot;&lt;/p&gt;
&lt;h2&gt;Predicate Pushdown: What Runs on MySQL&lt;/h2&gt;
&lt;p&gt;Dremio pushes a wide range of operations directly to MySQL, including:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Logical:&lt;/strong&gt; AND, OR, NOT&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Comparisons:&lt;/strong&gt; =, !=, &amp;lt;, &amp;gt;, &amp;lt;=, &amp;gt;=, LIKE, NOT LIKE, IS NULL, IS NOT NULL&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Aggregations:&lt;/strong&gt; SUM, AVG, COUNT, MIN, MAX, STDDEV, VAR_POP&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Math:&lt;/strong&gt; ABS, CEIL, FLOOR, ROUND, MOD, SQRT, POWER, LOG, EXP&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;String:&lt;/strong&gt; CONCAT, SUBSTR, LENGTH, LOWER, UPPER, TRIM, REPLACE, REVERSE&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Date/Time:&lt;/strong&gt; DATE_ADD, DATE_SUB, DATE_TRUNC, EXTRACT, TIMESTAMPADD, TIMESTAMPDIFF&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;This minimizes data transfer between MySQL and Dremio. Only the results of pushed-down operations cross the network.&lt;/p&gt;
&lt;h2&gt;Data Type Mapping&lt;/h2&gt;
&lt;p&gt;Key MySQL-to-Dremio type conversions:&lt;/p&gt;
&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;MySQL&lt;/th&gt;
&lt;th&gt;Dremio&lt;/th&gt;
&lt;th&gt;Notes&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;INT / INTEGER&lt;/td&gt;
&lt;td&gt;INTEGER&lt;/td&gt;
&lt;td&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;BIGINT&lt;/td&gt;
&lt;td&gt;BIGINT&lt;/td&gt;
&lt;td&gt;UNSIGNED converts to BIGINT&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;FLOAT&lt;/td&gt;
&lt;td&gt;FLOAT&lt;/td&gt;
&lt;td&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;DOUBLE / REAL&lt;/td&gt;
&lt;td&gt;DOUBLE&lt;/td&gt;
&lt;td&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;DECIMAL&lt;/td&gt;
&lt;td&gt;DECIMAL&lt;/td&gt;
&lt;td&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;VARCHAR / TEXT / CHAR&lt;/td&gt;
&lt;td&gt;VARCHAR&lt;/td&gt;
&lt;td&gt;ENUM and SET also map to VARCHAR&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;DATE&lt;/td&gt;
&lt;td&gt;DATE&lt;/td&gt;
&lt;td&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;DATETIME / TIMESTAMP&lt;/td&gt;
&lt;td&gt;TIMESTAMP&lt;/td&gt;
&lt;td&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;TIME&lt;/td&gt;
&lt;td&gt;TIME&lt;/td&gt;
&lt;td&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;BLOB / BINARY / VARBINARY&lt;/td&gt;
&lt;td&gt;VARBINARY&lt;/td&gt;
&lt;td&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;BIT&lt;/td&gt;
&lt;td&gt;BOOLEAN&lt;/td&gt;
&lt;td&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;TINYINT / SMALLINT / MEDIUMINT&lt;/td&gt;
&lt;td&gt;INTEGER&lt;/td&gt;
&lt;td&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;YEAR&lt;/td&gt;
&lt;td&gt;INTEGER&lt;/td&gt;
&lt;td&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;
&lt;p&gt;MySQL-specific types like &lt;code&gt;JSON&lt;/code&gt; or &lt;code&gt;GEOMETRY&lt;/code&gt; are not supported through the connector.&lt;/p&gt;
&lt;h2&gt;MySQL vs. Iceberg: When to Migrate&lt;/h2&gt;
&lt;p&gt;Keep data in MySQL when it&apos;s actively written and read by your application. Migrate historical or analytical datasets to Apache Iceberg tables in Dremio&apos;s Open Catalog when:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;The data doesn&apos;t change often (closed orders, historical logs)&lt;/li&gt;
&lt;li&gt;You need time travel (query the table as of any past timestamp)&lt;/li&gt;
&lt;li&gt;You want automated performance management (compaction, manifest optimization)&lt;/li&gt;
&lt;li&gt;You want Autonomous Reflections (Dremio auto-creates materializations based on query patterns)&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;For data that&apos;s still being written by your app, query it through the MySQL connector and create manual Reflections with a refresh schedule that matches your freshness needs.&lt;/p&gt;
&lt;h2&gt;AI-Powered Analytics on MySQL Data&lt;/h2&gt;
&lt;h3&gt;Dremio AI Agent&lt;/h3&gt;
&lt;p&gt;The AI Agent lets business users ask questions about MySQL data in plain English. A marketing manager asks &amp;quot;How many high-value orders shipped last month?&amp;quot; and the Agent generates the correct SQL by reading your view&apos;s wiki descriptions. It understands &amp;quot;high-value&amp;quot; means &lt;code&gt;total_amount &amp;gt; 500&lt;/code&gt; and &amp;quot;shipped&amp;quot; means &lt;code&gt;status = &apos;shipped&apos;&lt;/code&gt; because you defined those in the semantic layer.&lt;/p&gt;
&lt;h3&gt;Dremio MCP Server&lt;/h3&gt;
&lt;p&gt;The &lt;a href=&quot;https://github.com/dremio/dremio-mcp&quot;&gt;Dremio MCP Server&lt;/a&gt; connects external AI chat clients (Claude, ChatGPT) to your MySQL data through Dremio with OAuth authentication:&lt;/p&gt;
&lt;ol&gt;
&lt;li&gt;Create a Native OAuth application in Dremio Cloud&lt;/li&gt;
&lt;li&gt;Configure redirect URLs (e.g., &lt;code&gt;https://claude.ai/api/mcp/auth_callback&lt;/code&gt; for Claude)&lt;/li&gt;
&lt;li&gt;Connect using &lt;code&gt;mcp.dremio.cloud/mcp/{project_id}&lt;/code&gt;&lt;/li&gt;
&lt;/ol&gt;
&lt;p&gt;An e-commerce manager asks Claude &amp;quot;What&apos;s our average order value by region this quarter from MySQL?&amp;quot; and gets governed, accurate results — no SQL required.&lt;/p&gt;
&lt;h3&gt;AI SQL Functions&lt;/h3&gt;
&lt;p&gt;Use AI directly in queries against MySQL data:&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-sql&quot;&gt;-- Classify orders by likely customer intent
SELECT
  order_id,
  total_amount,
  order_tier,
  AI_CLASSIFY(
    &apos;Based on this order, classify the likely purchase motivation&apos;,
    &apos;Amount: $&apos; || CAST(total_amount AS VARCHAR) || &apos;, Status: &apos; || order_status || &apos;, Tier: &apos; || order_tier,
    ARRAY[&apos;Impulse Buy&apos;, &apos;Planned Purchase&apos;, &apos;Bulk Order&apos;, &apos;Reorder&apos;]
  ) AS purchase_motivation
FROM analytics.gold.order_summary
WHERE order_status = &apos;completed&apos;;

-- Generate order analysis summaries
SELECT
  DATE_TRUNC(&apos;week&apos;, order_timestamp) AS week,
  COUNT(*) AS orders,
  SUM(total_amount) AS revenue,
  AI_GENERATE(
    &apos;Write a one-sentence weekly sales summary&apos;,
    &apos;Orders: &apos; || CAST(COUNT(*) AS VARCHAR) || &apos;, Revenue: $&apos; || CAST(SUM(total_amount) AS VARCHAR) || &apos;, High Value Orders: &apos; || CAST(SUM(CASE WHEN order_tier = &apos;High Value&apos; THEN 1 ELSE 0 END) AS VARCHAR)
  ) AS weekly_summary
FROM analytics.gold.order_summary
GROUP BY DATE_TRUNC(&apos;week&apos;, order_timestamp)
ORDER BY week DESC
LIMIT 12;
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;&lt;code&gt;AI_CLASSIFY&lt;/code&gt; runs LLM inference inline, categorizing each order. &lt;code&gt;AI_GENERATE&lt;/code&gt; produces narrative summaries. Both enrich MySQL data with AI in real time.&lt;/p&gt;
&lt;h2&gt;Reflections for Performance&lt;/h2&gt;
&lt;p&gt;MySQL is optimized for OLTP — row-level reads and writes. Analytical aggregation queries compete with application workloads. Dremio&apos;s Reflections offload these:&lt;/p&gt;
&lt;ol&gt;
&lt;li&gt;Navigate to the view in the &lt;strong&gt;Catalog&lt;/strong&gt;&lt;/li&gt;
&lt;li&gt;Click the &lt;strong&gt;Reflections&lt;/strong&gt; tab&lt;/li&gt;
&lt;li&gt;Choose &lt;strong&gt;Raw Reflection&lt;/strong&gt; or &lt;strong&gt;Aggregation Reflection&lt;/strong&gt;&lt;/li&gt;
&lt;li&gt;Select columns and set the &lt;strong&gt;Refresh Interval&lt;/strong&gt;&lt;/li&gt;
&lt;li&gt;Click &lt;strong&gt;Save&lt;/strong&gt;&lt;/li&gt;
&lt;/ol&gt;
&lt;p&gt;BI tools get sub-second responses from Reflections. MySQL focuses on serving your application.&lt;/p&gt;
&lt;h2&gt;Governance on MySQL Data&lt;/h2&gt;
&lt;p&gt;MySQL has database-level grants but no column masking or row-level filtering. Dremio&apos;s Fine-Grained Access Control (FGAC) adds these:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Column masking:&lt;/strong&gt; Mask customer email, payment details, or pricing from specific roles. A marketing analyst sees order counts but not individual customer data.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Row-level filtering:&lt;/strong&gt; Restrict data by store, region, or department based on user role.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Unified policies:&lt;/strong&gt; Same governance applies across MySQL, PostgreSQL, S3, and all other sources.&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;These policies apply across SQL Runner, BI tools (Arrow Flight/ODBC), AI Agent, and MCP Server.&lt;/p&gt;
&lt;h2&gt;Connect BI Tools via Arrow Flight&lt;/h2&gt;
&lt;p&gt;Arrow Flight provides 10-100x faster data transfer than JDBC/ODBC:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Tableau:&lt;/strong&gt; Dremio connector for direct Arrow Flight access&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Power BI:&lt;/strong&gt; Dremio ODBC driver or native connector&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Python/Pandas:&lt;/strong&gt; &lt;code&gt;pyarrow.flight&lt;/code&gt; for programmatic data access&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Looker:&lt;/strong&gt; Connect via JDBC&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;dbt:&lt;/strong&gt; &lt;code&gt;dbt-dremio&lt;/code&gt; for SQL-based transformations&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;All queries benefit from Reflections, governance, and the semantic layer.&lt;/p&gt;
&lt;h2&gt;VS Code Copilot Integration&lt;/h2&gt;
&lt;p&gt;Dremio&apos;s VS Code extension with Copilot integration lets developers query MySQL data from their IDE. Ask Copilot &amp;quot;Show me high-value orders from MySQL this week&amp;quot; and get SQL generated from your semantic layer.&lt;/p&gt;
&lt;h2&gt;When to Keep Data in MySQL vs. Migrate&lt;/h2&gt;
&lt;p&gt;&lt;strong&gt;Keep in MySQL:&lt;/strong&gt; Transactional data for active applications, data with application-level foreign key constraints, operational data where real-time writes matter.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Migrate to Iceberg:&lt;/strong&gt; Historical order archives, reporting data, data consumed by non-application tools, datasets where MySQL replication lag creates analytics latency. Migrated Iceberg tables get automatic compaction, time travel, and Autonomous Reflections.&lt;/p&gt;
&lt;p&gt;For data staying in MySQL, create manual Reflections to offload analytical queries. For migrated Iceberg data, Dremio handles optimization automatically.&lt;/p&gt;
&lt;h2&gt;Get Started&lt;/h2&gt;
&lt;p&gt;MySQL users don&apos;t need to build ETL pipelines or provision a data warehouse to get analytical value from their data. Dremio Cloud connects to MySQL in minutes and gives you federation, acceleration, governance, and AI analytics on top.&lt;/p&gt;
&lt;p&gt;&lt;a href=&quot;https://www.dremio.com/get-started?utm_source=ev_buffer&amp;amp;utm_medium=influencer&amp;amp;utm_campaign=pag&amp;amp;utm_term=connector-mysql-dremio-cloud&amp;amp;utm_content=alexmerced&quot;&gt;Try Dremio Cloud free for 30 days&lt;/a&gt; and connect your MySQL databases.&lt;/p&gt;
</content:encoded><author>Alex Merced</author></item><item><title>Classify Your Data with SQL: A Hands-On Guide to Dremio&apos;s AI_CLASSIFY Function</title><link>https://iceberglakehouse.com/posts/2026-03-ai-ai-classify/</link><guid isPermaLink="true">https://iceberglakehouse.com/posts/2026-03-ai-ai-classify/</guid><description>
Most classification workflows require exporting data to Python, running a model, and importing results back into your warehouse. Dremio&apos;s `AI_CLASSIF...</description><pubDate>Sun, 01 Mar 2026 09:00:00 GMT</pubDate><content:encoded>&lt;p&gt;Most classification workflows require exporting data to Python, running a model, and importing results back into your warehouse. Dremio&apos;s &lt;code&gt;AI_CLASSIFY&lt;/code&gt; function eliminates that entire pipeline. You write a SELECT statement, pass in your text and your categories, and the LLM assigns a label. The classified data stays in your lakehouse, governed and queryable immediately.&lt;/p&gt;
&lt;p&gt;This tutorial walks you through a complete classification pipeline using a fresh Dremio Cloud account. You&apos;ll create sample customer feedback data, build a medallion architecture (Bronze → Silver → Gold), and use &lt;code&gt;AI_CLASSIFY&lt;/code&gt; to categorize reviews by sentiment, support tickets by department, and product issues by urgency, all inside SQL.&lt;/p&gt;
&lt;h2&gt;What You&apos;ll Build&lt;/h2&gt;
&lt;p&gt;By the end of this tutorial, you&apos;ll have:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;A customer feedback dataset with 50+ product reviews and 50+ support tickets&lt;/li&gt;
&lt;li&gt;Bronze views that standardize raw data&lt;/li&gt;
&lt;li&gt;Silver views that join reviews with ticket information&lt;/li&gt;
&lt;li&gt;Gold views that use &lt;code&gt;AI_CLASSIFY&lt;/code&gt; to add sentiment labels, department routing, and urgency tiers&lt;/li&gt;
&lt;li&gt;An Iceberg table that persists your classified data for dashboards&lt;/li&gt;
&lt;li&gt;Wiki metadata that enables the AI Agent to answer natural language questions about your classified data&lt;/li&gt;
&lt;/ul&gt;
&lt;h2&gt;Prerequisites&lt;/h2&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Dremio Cloud account&lt;/strong&gt; — &lt;a href=&quot;https://www.dremio.com/get-started?utm_source=ev_buffer&amp;amp;utm_medium=influencer&amp;amp;utm_campaign=pag&amp;amp;utm_term=ai-classify-dremio-cloud&amp;amp;utm_content=alexmerced&quot;&gt;sign up free for 30 days&lt;/a&gt; with $400 in compute credits&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;AI enabled&lt;/strong&gt; — go to Admin → Project Settings → Preferences → AI section and enable AI features&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Model Provider configured&lt;/strong&gt; — Dremio provides a hosted LLM by default, or you can connect your own (OpenAI, Anthropic, Google Gemini, AWS Bedrock, Azure OpenAI) under the AI preferences&lt;/li&gt;
&lt;/ul&gt;
&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;Note:&lt;/strong&gt; Tables in the built-in Open Catalog use &lt;code&gt;folder.subfolder.table_name&lt;/code&gt; without a catalog prefix. External sources use &lt;code&gt;source_name.schema.table_name&lt;/code&gt;.&lt;/p&gt;
&lt;/blockquote&gt;
&lt;h2&gt;Understanding AI_CLASSIFY&lt;/h2&gt;
&lt;p&gt;&lt;code&gt;AI_CLASSIFY&lt;/code&gt; sends text to a configured LLM and asks it to pick the best matching label from an array you provide. The function signature:&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-sql&quot;&gt;AI_CLASSIFY(
  [model_name VARCHAR,]
  prompt VARCHAR,
  categories ARRAY&amp;lt;VARCHAR|INT|FLOAT|BOOLEAN&amp;gt;
) → VARCHAR|INT|FLOAT|BOOLEAN
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;&lt;strong&gt;Parameters:&lt;/strong&gt;&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;model_name&lt;/strong&gt; (optional) — specify a particular model like &lt;code&gt;&apos;gpt.4o&apos;&lt;/code&gt;. Format is &lt;code&gt;modelProvider.modelName&lt;/code&gt;. If omitted, Dremio uses your default configured model.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;prompt&lt;/strong&gt; — the text you want classified. This is typically a column value or a concatenation of columns that gives the LLM enough context.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;categories&lt;/strong&gt; — an &lt;code&gt;ARRAY&lt;/code&gt; of possible labels. The LLM must return one of these values. Supports &lt;code&gt;VARCHAR&lt;/code&gt;, &lt;code&gt;INT&lt;/code&gt;, &lt;code&gt;FLOAT&lt;/code&gt;, and &lt;code&gt;BOOLEAN&lt;/code&gt; types.&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;The return type matches the array element type. If you pass &lt;code&gt;ARRAY[&apos;Positive&apos;, &apos;Negative&apos;, &apos;Neutral&apos;]&lt;/code&gt;, you get a &lt;code&gt;VARCHAR&lt;/code&gt; back. If you pass &lt;code&gt;ARRAY[1, 2, 3, 4, 5]&lt;/code&gt;, you get an &lt;code&gt;INT&lt;/code&gt;.&lt;/p&gt;
&lt;h2&gt;Step 1: Create Your Folder Structure&lt;/h2&gt;
&lt;p&gt;Open the &lt;strong&gt;SQL Runner&lt;/strong&gt; from the left sidebar in Dremio Cloud and run:&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-sql&quot;&gt;CREATE FOLDER IF NOT EXISTS aiclassifyexp;
CREATE FOLDER IF NOT EXISTS aiclassifyexp.feedback_data;
CREATE FOLDER IF NOT EXISTS aiclassifyexp.bronze;
CREATE FOLDER IF NOT EXISTS aiclassifyexp.silver;
CREATE FOLDER IF NOT EXISTS aiclassifyexp.gold;
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;This creates a namespace that simulates a customer feedback analytics pipeline with separate layers for raw data, standardized views, business logic, and final outputs.&lt;/p&gt;
&lt;h2&gt;Step 2: Seed Your Sample Data&lt;/h2&gt;
&lt;h3&gt;Customer Reviews Table&lt;/h3&gt;
&lt;p&gt;This table simulates product reviews collected from an e-commerce platform. Each review includes the customer name, product, a star rating, and the actual review text that we&apos;ll classify.&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-sql&quot;&gt;CREATE TABLE aiclassifyexp.feedback_data.customer_reviews (
  review_id INT,
  customer_name VARCHAR,
  product_name VARCHAR,
  star_rating INT,
  review_text VARCHAR,
  review_date DATE
);

INSERT INTO aiclassifyexp.feedback_data.customer_reviews VALUES
(1, &apos;Sarah Chen&apos;, &apos;CloudSync Pro&apos;, 5, &apos;Absolutely love this product. Setup took 5 minutes and sync speeds are incredible. Best purchase this year.&apos;, &apos;2025-08-15&apos;),
(2, &apos;James Rodriguez&apos;, &apos;CloudSync Pro&apos;, 1, &apos;Terrible experience. Lost three days of data after the last update. Support was unhelpful and dismissive.&apos;, &apos;2025-08-22&apos;),
(3, &apos;Emily Watson&apos;, &apos;DataVault Enterprise&apos;, 4, &apos;Solid encryption and good performance. The UI could use some polish but the core functionality is reliable.&apos;, &apos;2025-09-01&apos;),
(4, &apos;Michael Brown&apos;, &apos;CloudSync Pro&apos;, 3, &apos;It works fine most of the time but crashes occasionally when syncing large folders. Average product.&apos;, &apos;2025-09-05&apos;),
(5, &apos;Lisa Park&apos;, &apos;DataVault Enterprise&apos;, 5, &apos;Our security team approved this after a thorough review. Encryption standards exceed our compliance requirements.&apos;, &apos;2025-09-10&apos;),
(6, &apos;David Kim&apos;, &apos;QuickReport&apos;, 2, &apos;The reports look nice but generation takes forever. For the price point there are faster alternatives.&apos;, &apos;2025-09-12&apos;),
(7, &apos;Anna Kowalski&apos;, &apos;QuickReport&apos;, 4, &apos;Great templates and easy export options. Scheduling could be more flexible but overall a good tool.&apos;, &apos;2025-09-18&apos;),
(8, &apos;Robert Taylor&apos;, &apos;CloudSync Pro&apos;, 1, &apos;Second time this month it corrupted my files during sync. Considering switching to a competitor.&apos;, &apos;2025-09-20&apos;),
(9, &apos;Maria Garcia&apos;, &apos;DataVault Enterprise&apos;, 5, &apos;Migrated 50TB without a single issue. The deduplication feature alone saved us $2000/month in storage.&apos;, &apos;2025-09-25&apos;),
(10, &apos;Tom Williams&apos;, &apos;QuickReport&apos;, 3, &apos;Decent for basic reports. Falls short on complex multi-source dashboards. Not bad, not great.&apos;, &apos;2025-10-01&apos;),
(11, &apos;Jennifer Lee&apos;, &apos;CloudSync Pro&apos;, 4, &apos;Fast reliable syncing across all our devices. The mobile app needs improvement though.&apos;, &apos;2025-10-05&apos;),
(12, &apos;Chris Martinez&apos;, &apos;DataVault Enterprise&apos;, 2, &apos;Way too complicated for a small team. We spent two weeks just on initial configuration.&apos;, &apos;2025-10-08&apos;),
(13, &apos;Rachel Adams&apos;, &apos;QuickReport&apos;, 5, &apos;Finally a reporting tool that non-technical people can use. Our marketing team builds their own reports now.&apos;, &apos;2025-10-12&apos;),
(14, &apos;Kevin Thompson&apos;, &apos;CloudSync Pro&apos;, 1, &apos;Billing issue: charged twice and it took three weeks to get a refund. Product aside the billing system is broken.&apos;, &apos;2025-10-15&apos;),
(15, &apos;Sophia Nguyen&apos;, &apos;DataVault Enterprise&apos;, 4, &apos;Strong security features and audit logging. Integration with our SSO provider was straightforward.&apos;, &apos;2025-10-20&apos;),
(16, &apos;Daniel Wilson&apos;, &apos;QuickReport&apos;, 3, &apos;Good for monthly summaries but real-time dashboards lag noticeably. Suitable for batch reporting only.&apos;, &apos;2025-10-22&apos;),
(17, &apos;Amanda Clark&apos;, &apos;CloudSync Pro&apos;, 5, &apos;Our entire team switched from Dropbox. The conflict resolution on shared files is leagues better.&apos;, &apos;2025-10-25&apos;),
(18, &apos;Brian Harris&apos;, &apos;DataVault Enterprise&apos;, 1, &apos;Critical vulnerability found in version 3.2. Support acknowledged it but the patch took 6 weeks.&apos;, &apos;2025-10-28&apos;),
(19, &apos;Michelle Lopez&apos;, &apos;QuickReport&apos;, 4, &apos;Clean interface and the PDF export quality is excellent. API access for automation would be a welcome addition.&apos;, &apos;2025-11-01&apos;),
(20, &apos;Steven Moore&apos;, &apos;CloudSync Pro&apos;, 2, &apos;Sync works but the desktop app uses 800MB of RAM just sitting in the background. Needs optimization.&apos;, &apos;2025-11-05&apos;),
(21, &apos;Laura Jackson&apos;, &apos;DataVault Enterprise&apos;, 5, &apos;Passed our SOC 2 audit partly because of DataVault detailed access logs. Worth every penny.&apos;, &apos;2025-11-08&apos;),
(22, &apos;Andrew White&apos;, &apos;QuickReport&apos;, 2, &apos;Crashed twice during a client presentation. Embarrassing and unacceptable for a paid product.&apos;, &apos;2025-11-10&apos;),
(23, &apos;Catherine Hall&apos;, &apos;CloudSync Pro&apos;, 4, &apos;Selective sync feature is a lifesaver for laptops with small drives. Smart storage management.&apos;, &apos;2025-11-15&apos;),
(24, &apos;Mark Allen&apos;, &apos;DataVault Enterprise&apos;, 3, &apos;Good product hampered by poor documentation. We figured out most features through trial and error.&apos;, &apos;2025-11-18&apos;),
(25, &apos;Jessica Young&apos;, &apos;QuickReport&apos;, 5, &apos;The scheduled email reports feature saved our ops team 10 hours per week. Simple and effective.&apos;, &apos;2025-11-20&apos;);
&lt;/code&gt;&lt;/pre&gt;
&lt;h3&gt;Support Tickets Table&lt;/h3&gt;
&lt;p&gt;This table simulates a customer support system. Each ticket has a description written by the customer, a status, and a priority that was manually assigned. We&apos;ll use &lt;code&gt;AI_CLASSIFY&lt;/code&gt; to automatically route these tickets by department.&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-sql&quot;&gt;CREATE TABLE aiclassifyexp.feedback_data.support_tickets (
  ticket_id INT,
  customer_name VARCHAR,
  product_name VARCHAR,
  ticket_description VARCHAR,
  manual_priority VARCHAR,
  ticket_status VARCHAR,
  created_date DATE,
  resolved_date DATE
);

INSERT INTO aiclassifyexp.feedback_data.support_tickets VALUES
(1001, &apos;James Rodriguez&apos;, &apos;CloudSync Pro&apos;, &apos;Lost all synced files after update 4.2.1. Need immediate recovery assistance.&apos;, &apos;Critical&apos;, &apos;Resolved&apos;, &apos;2025-08-20&apos;, &apos;2025-08-25&apos;),
(1002, &apos;Kevin Thompson&apos;, &apos;CloudSync Pro&apos;, &apos;Charged $49.99 twice on my credit card for October subscription. Need refund for duplicate charge.&apos;, &apos;Medium&apos;, &apos;Resolved&apos;, &apos;2025-10-14&apos;, &apos;2025-11-04&apos;),
(1003, &apos;Robert Taylor&apos;, &apos;CloudSync Pro&apos;, &apos;Files corrupted during sync for the second time. Happening with files over 500MB.&apos;, &apos;High&apos;, &apos;Open&apos;, &apos;2025-09-19&apos;, NULL),
(1004, &apos;Chris Martinez&apos;, &apos;DataVault Enterprise&apos;, &apos;Cannot figure out how to configure SSO integration. Documentation references outdated menu options.&apos;, &apos;Medium&apos;, &apos;Resolved&apos;, &apos;2025-10-07&apos;, &apos;2025-10-10&apos;),
(1005, &apos;Brian Harris&apos;, &apos;DataVault Enterprise&apos;, &apos;Security scan flagged CVE-2025-1234 in version 3.2 encryption module. When will this be patched?&apos;, &apos;Critical&apos;, &apos;Resolved&apos;, &apos;2025-10-27&apos;, &apos;2025-12-08&apos;),
(1006, &apos;Andrew White&apos;, &apos;QuickReport&apos;, &apos;App crashes when rendering charts with more than 10000 data points. Happens consistently in Chrome.&apos;, &apos;High&apos;, &apos;Open&apos;, &apos;2025-11-09&apos;, NULL),
(1007, &apos;Sarah Chen&apos;, &apos;CloudSync Pro&apos;, &apos;Would love to see a Linux desktop client. Currently only Windows and Mac are supported.&apos;, &apos;Low&apos;, &apos;Open&apos;, &apos;2025-08-30&apos;, NULL),
(1008, &apos;David Kim&apos;, &apos;QuickReport&apos;, &apos;Report generation takes 45+ seconds for simple 3-page reports. Was faster in the previous version.&apos;, &apos;Medium&apos;, &apos;Open&apos;, &apos;2025-09-13&apos;, NULL),
(1009, &apos;Emily Watson&apos;, &apos;DataVault Enterprise&apos;, &apos;Need to add 50 new users to our plan. What are the volume discount options?&apos;, &apos;Low&apos;, &apos;Resolved&apos;, &apos;2025-09-03&apos;, &apos;2025-09-05&apos;),
(1010, &apos;Steven Moore&apos;, &apos;CloudSync Pro&apos;, &apos;Desktop app consuming excessive memory (800MB+). Running Windows 11 with 16GB RAM.&apos;, &apos;Medium&apos;, &apos;Open&apos;, &apos;2025-11-04&apos;, NULL),
(1011, &apos;Lisa Park&apos;, &apos;DataVault Enterprise&apos;, &apos;Can we get a custom retention policy for healthcare compliance? HIPAA requires 7-year retention.&apos;, &apos;Medium&apos;, &apos;Resolved&apos;, &apos;2025-09-12&apos;, &apos;2025-09-20&apos;),
(1012, &apos;Tom Williams&apos;, &apos;QuickReport&apos;, &apos;How do I connect QuickReport to a PostgreSQL database? Only seeing MySQL option in connectors.&apos;, &apos;Low&apos;, &apos;Resolved&apos;, &apos;2025-10-02&apos;, &apos;2025-10-03&apos;),
(1013, &apos;Mark Allen&apos;, &apos;DataVault Enterprise&apos;, &apos;API documentation has broken links on the authentication section. Pages return 404.&apos;, &apos;Low&apos;, &apos;Open&apos;, &apos;2025-11-17&apos;, NULL),
(1014, &apos;Michael Brown&apos;, &apos;CloudSync Pro&apos;, &apos;Selective sync keeps re-enabling folders I excluded. Happens after every app restart.&apos;, &apos;Medium&apos;, &apos;Open&apos;, &apos;2025-09-06&apos;, NULL),
(1015, &apos;Daniel Wilson&apos;, &apos;QuickReport&apos;, &apos;Real-time dashboard shows data that is 15 minutes stale. Expected near real-time refresh.&apos;, &apos;High&apos;, &apos;Open&apos;, &apos;2025-10-23&apos;, NULL),
(1016, &apos;Anna Kowalski&apos;, &apos;QuickReport&apos;, &apos;Can you add a dark mode option? The white background is hard on the eyes during evening work.&apos;, &apos;Low&apos;, &apos;Open&apos;, &apos;2025-09-19&apos;, NULL),
(1017, &apos;Sophia Nguyen&apos;, &apos;DataVault Enterprise&apos;, &apos;Our SSO integration broke after your last update. 200 users locked out for 4 hours.&apos;, &apos;Critical&apos;, &apos;Resolved&apos;, &apos;2025-10-21&apos;, &apos;2025-10-21&apos;),
(1018, &apos;Jennifer Lee&apos;, &apos;CloudSync Pro&apos;, &apos;Mobile app on iOS frequently logs me out. Have to re-authenticate 3-4 times per day.&apos;, &apos;Medium&apos;, &apos;Open&apos;, &apos;2025-10-06&apos;, NULL),
(1019, &apos;Rachel Adams&apos;, &apos;QuickReport&apos;, &apos;Love the product! Any plans for a Slack integration to send report summaries to channels?&apos;, &apos;Low&apos;, &apos;Open&apos;, &apos;2025-10-13&apos;, NULL),
(1020, &apos;Amanda Clark&apos;, &apos;CloudSync Pro&apos;, &apos;Conflict resolution dialog is confusing. Hard to tell which version is newer when filenames match.&apos;, &apos;Medium&apos;, &apos;Resolved&apos;, &apos;2025-10-26&apos;, &apos;2025-10-30&apos;),
(1021, &apos;Catherine Hall&apos;, &apos;CloudSync Pro&apos;, &apos;Bandwidth throttling feature needed. Sync saturates our office internet during business hours.&apos;, &apos;Medium&apos;, &apos;Open&apos;, &apos;2025-11-16&apos;, NULL),
(1022, &apos;Maria Garcia&apos;, &apos;DataVault Enterprise&apos;, &apos;Deduplication incorrectly merged two different client folders. Data was mixed across accounts.&apos;, &apos;Critical&apos;, &apos;Resolved&apos;, &apos;2025-09-26&apos;, &apos;2025-09-28&apos;),
(1023, &apos;Laura Jackson&apos;, &apos;DataVault Enterprise&apos;, &apos;Need export of all access logs for the past 12 months for our annual SOC 2 audit.&apos;, &apos;Medium&apos;, &apos;Resolved&apos;, &apos;2025-11-09&apos;, &apos;2025-11-11&apos;),
(1024, &apos;Jessica Young&apos;, &apos;QuickReport&apos;, &apos;Scheduled reports occasionally skip a week. No error notification when this happens.&apos;, &apos;High&apos;, &apos;Open&apos;, &apos;2025-11-21&apos;, NULL),
(1025, &apos;Michelle Lopez&apos;, &apos;QuickReport&apos;, &apos;Please add an API endpoint for programmatic report generation. We want to automate monthly client reports.&apos;, &apos;Low&apos;, &apos;Open&apos;, &apos;2025-11-02&apos;, NULL);
&lt;/code&gt;&lt;/pre&gt;
&lt;h2&gt;Step 3: Build Bronze Views&lt;/h2&gt;
&lt;p&gt;Bronze views standardize column names and data types without applying business logic. This creates a consistent foundation for downstream analysis.&lt;/p&gt;
&lt;p&gt;The reviews table needs its &lt;code&gt;DATE&lt;/code&gt; column cast to &lt;code&gt;TIMESTAMP&lt;/code&gt; for consistent joins later. The tickets table also needs date casting, and we rename &lt;code&gt;manual_priority&lt;/code&gt; to &lt;code&gt;assigned_priority&lt;/code&gt; to distinguish it from AI-generated classifications.&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-sql&quot;&gt;CREATE OR REPLACE VIEW aiclassifyexp.bronze.v_reviews AS
SELECT
  review_id,
  customer_name,
  product_name,
  star_rating,
  review_text,
  CAST(review_date AS TIMESTAMP) AS review_timestamp
FROM aiclassifyexp.feedback_data.customer_reviews;

CREATE OR REPLACE VIEW aiclassifyexp.bronze.v_tickets AS
SELECT
  ticket_id,
  customer_name,
  product_name,
  ticket_description,
  manual_priority AS assigned_priority,
  ticket_status,
  CAST(created_date AS TIMESTAMP) AS created_timestamp,
  CAST(resolved_date AS TIMESTAMP) AS resolved_timestamp
FROM aiclassifyexp.feedback_data.support_tickets;
&lt;/code&gt;&lt;/pre&gt;
&lt;h2&gt;Step 4: Build Silver Views&lt;/h2&gt;
&lt;p&gt;This Silver view joins reviews with related support tickets for the same customer and product. This gives us a combined picture: what did the customer say in their review, and did they also file a support ticket? The &lt;code&gt;LEFT JOIN&lt;/code&gt; ensures we keep all reviews even if the customer never opened a ticket.&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-sql&quot;&gt;CREATE OR REPLACE VIEW aiclassifyexp.silver.v_customer_feedback AS
SELECT
  r.review_id,
  r.customer_name,
  r.product_name,
  r.star_rating,
  r.review_text,
  r.review_timestamp,
  t.ticket_id,
  t.ticket_description,
  t.assigned_priority,
  t.ticket_status,
  t.created_timestamp AS ticket_created,
  t.resolved_timestamp AS ticket_resolved,
  CASE WHEN t.ticket_id IS NOT NULL THEN &apos;Yes&apos; ELSE &apos;No&apos; END AS has_support_ticket
FROM aiclassifyexp.bronze.v_reviews r
LEFT JOIN aiclassifyexp.bronze.v_tickets t
  ON r.customer_name = t.customer_name
  AND r.product_name = t.product_name;
&lt;/code&gt;&lt;/pre&gt;
&lt;h2&gt;Step 5: Build Gold Views with AI_CLASSIFY&lt;/h2&gt;
&lt;p&gt;This is where the AI functions do real work. Each Gold view applies &lt;code&gt;AI_CLASSIFY&lt;/code&gt; to categorize text that would otherwise require manual review or an external ML pipeline.&lt;/p&gt;
&lt;h3&gt;Gold View 1: Sentiment Classification&lt;/h3&gt;
&lt;p&gt;This view classifies every review as Positive, Negative, or Neutral. Instead of relying solely on star ratings (which can be inconsistent with the actual text), the LLM reads the full review and assigns a sentiment label. We concatenate the product name with the review text to give the model full context.&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-sql&quot;&gt;CREATE OR REPLACE VIEW aiclassifyexp.gold.v_review_sentiment AS
SELECT
  review_id,
  customer_name,
  product_name,
  star_rating,
  review_text,
  review_timestamp,
  AI_CLASSIFY(
    &apos;Classify the sentiment of this product review: &apos; || review_text,
    ARRAY[&apos;Positive&apos;, &apos;Negative&apos;, &apos;Neutral&apos;]
  ) AS ai_sentiment,
  has_support_ticket
FROM aiclassifyexp.silver.v_customer_feedback;
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Notice that we keep both &lt;code&gt;star_rating&lt;/code&gt; and &lt;code&gt;ai_sentiment&lt;/code&gt;. This lets you compare the two signals. A 3-star review with &amp;quot;Negative&amp;quot; AI sentiment suggests the customer is more frustrated than the rating alone indicates.&lt;/p&gt;
&lt;h3&gt;Gold View 2: Ticket Department Routing&lt;/h3&gt;
&lt;p&gt;This view uses &lt;code&gt;AI_CLASSIFY&lt;/code&gt; to automatically route support tickets to the right department based on the ticket description. Instead of a human reading every ticket and assigning it, the LLM reads the description and selects from four departments.&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-sql&quot;&gt;CREATE OR REPLACE VIEW aiclassifyexp.gold.v_ticket_routing AS
SELECT
  ticket_id,
  customer_name,
  product_name,
  ticket_description,
  assigned_priority,
  ticket_status,
  created_timestamp,
  resolved_timestamp,
  AI_CLASSIFY(
    &apos;Based on this support ticket, which department should handle it: &apos; || ticket_description,
    ARRAY[&apos;Billing&apos;, &apos;Technical Support&apos;, &apos;Feature Request&apos;, &apos;Account Management&apos;]
  ) AS ai_department,
  AI_CLASSIFY(
    &apos;Rate the urgency of this support ticket: &apos; || ticket_description,
    ARRAY[&apos;Critical&apos;, &apos;High&apos;, &apos;Medium&apos;, &apos;Low&apos;]
  ) AS ai_urgency
FROM aiclassifyexp.bronze.v_tickets;
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;This view applies two separate &lt;code&gt;AI_CLASSIFY&lt;/code&gt; calls on each row: one for department routing and one for urgency. You can compare &lt;code&gt;ai_urgency&lt;/code&gt; against the manually assigned &lt;code&gt;assigned_priority&lt;/code&gt; to find tickets where human triage may have underestimated or overestimated severity.&lt;/p&gt;
&lt;h3&gt;Using Numeric Categories&lt;/h3&gt;
&lt;p&gt;&lt;code&gt;AI_CLASSIFY&lt;/code&gt; also supports numeric arrays. If you want a 1-5 satisfaction score instead of text labels:&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-sql&quot;&gt;SELECT
  review_id,
  review_text,
  AI_CLASSIFY(
    &apos;Rate customer satisfaction from 1 (very dissatisfied) to 5 (very satisfied): &apos; || review_text,
    ARRAY[1, 2, 3, 4, 5]
  ) AS ai_satisfaction_score
FROM aiclassifyexp.bronze.v_reviews;
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;The LLM returns an &lt;code&gt;INT&lt;/code&gt; because the array contains integers. This is useful when you need numeric scores for aggregation, averages, or trend analysis.&lt;/p&gt;
&lt;h2&gt;Persisting Results with CTAS&lt;/h2&gt;
&lt;p&gt;AI function calls consume LLM tokens on every query execution. For dashboards or reports that run the same classification repeatedly, materialize the results into an Iceberg table with &lt;code&gt;CREATE TABLE AS SELECT&lt;/code&gt; (CTAS):&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-sql&quot;&gt;CREATE TABLE aiclassifyexp.gold.classified_reviews AS
SELECT * FROM aiclassifyexp.gold.v_review_sentiment;
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;This creates a physical Iceberg table with the AI classifications baked in. Subsequent queries against &lt;code&gt;classified_reviews&lt;/code&gt; are standard SQL queries with no LLM cost. Refresh the table periodically (daily, weekly) as new reviews come in by running CTAS again with &lt;code&gt;CREATE OR REPLACE TABLE&lt;/code&gt;.&lt;/p&gt;
&lt;h2&gt;Managing AI Workloads&lt;/h2&gt;
&lt;p&gt;AI function queries are more resource-intensive than standard SQL. Dremio provides engine routing to isolate these workloads:&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-sql&quot;&gt;-- Dremio provides these routing functions for workload management:
-- query_calls_ai_functions() — returns true if the query uses AI functions
-- query_has_attribute(&apos;AI_FUNCTIONS&apos;) — same check, different syntax
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;In your Dremio Cloud project settings, you can create engine routing rules that automatically direct queries containing AI functions to a dedicated engine. This prevents a large classification batch job from competing with your executive dashboards for compute resources. Set up a separate engine with appropriate scaling for AI workloads, and create a routing rule using &lt;code&gt;query_calls_ai_functions()&lt;/code&gt; to send AI queries there automatically.&lt;/p&gt;
&lt;h2&gt;Choosing Your Model Provider&lt;/h2&gt;
&lt;p&gt;The optional &lt;code&gt;model_name&lt;/code&gt; parameter lets you target specific models for different tasks:&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-sql&quot;&gt;-- Use a specific model for classification
SELECT AI_CLASSIFY(
  &apos;openai.gpt-4o&apos;,
  &apos;Classify this ticket: &apos; || ticket_description,
  ARRAY[&apos;Billing&apos;, &apos;Technical Support&apos;, &apos;Feature Request&apos;]
) AS department
FROM aiclassifyexp.bronze.v_tickets;
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Dremio supports multiple providers: OpenAI, Anthropic, Google Gemini, AWS Bedrock, and Azure OpenAI. You configure providers in Admin → Project Settings → Preferences → AI. The format is &lt;code&gt;providerName.modelName&lt;/code&gt;, where &lt;code&gt;providerName&lt;/code&gt; is the name you gave the provider during setup.&lt;/p&gt;
&lt;p&gt;If you skip &lt;code&gt;model_name&lt;/code&gt;, Dremio uses your default model. For most classification tasks, the default model works well. Specifying a model makes sense when you need a particular model&apos;s strengths (like a smaller, faster model for simple sentiment vs. a larger model for nuanced multi-class categorization).&lt;/p&gt;
&lt;h2&gt;Step 6: Enable AI-Generated Wikis and Tags&lt;/h2&gt;
&lt;p&gt;Good metadata makes the AI Agent more accurate when answering natural language questions. Here&apos;s how to add context to your Gold views:&lt;/p&gt;
&lt;ol&gt;
&lt;li&gt;Click &lt;strong&gt;Admin&lt;/strong&gt; in the left sidebar, then go to &lt;strong&gt;Project Settings&lt;/strong&gt;.&lt;/li&gt;
&lt;li&gt;Select the &lt;strong&gt;Preferences&lt;/strong&gt; tab.&lt;/li&gt;
&lt;li&gt;Scroll to the &lt;strong&gt;AI&lt;/strong&gt; section and enable &lt;strong&gt;Generate Wikis and Labels&lt;/strong&gt;.&lt;/li&gt;
&lt;li&gt;Go to the &lt;strong&gt;Catalog&lt;/strong&gt; and navigate to your Gold views under &lt;code&gt;aiclassifyexp.gold&lt;/code&gt;.&lt;/li&gt;
&lt;li&gt;Click the &lt;strong&gt;Edit&lt;/strong&gt; button (pencil icon) next to the desired view.&lt;/li&gt;
&lt;li&gt;In the &lt;strong&gt;Details&lt;/strong&gt; tab, find the &lt;strong&gt;Wiki&lt;/strong&gt; section and click &lt;strong&gt;Generate Wiki&lt;/strong&gt;. Do the same for the &lt;strong&gt;Tags&lt;/strong&gt; section by clicking &lt;strong&gt;Generate Tags&lt;/strong&gt;.&lt;/li&gt;
&lt;li&gt;Repeat for each Gold view.&lt;/li&gt;
&lt;/ol&gt;
&lt;p&gt;To enhance the generated wiki with additional business context, copy the output into the &lt;strong&gt;AI Agent&lt;/strong&gt; on the homepage and ask it to produce an improved version in a markdown code block. For example, ask the Agent to add details like &amp;quot;Positive sentiment reviews are candidates for testimonial collection. Negative sentiment reviews with support tickets should trigger a customer success outreach.&amp;quot; Copy the Agent&apos;s refined output and paste it back into the wiki editor.&lt;/p&gt;
&lt;p&gt;Wikis and labels are the context that Dremio&apos;s AI Agent reads before generating SQL. Better metadata produces more accurate natural language responses.&lt;/p&gt;
&lt;h2&gt;Step 7: Ask Questions with the AI Agent&lt;/h2&gt;
&lt;p&gt;With your classified data and enriched wikis in place, navigate to the AI Agent on the Dremio homepage and try these prompts:&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;&amp;quot;Which products have the most negative reviews?&amp;quot;&lt;/strong&gt;&lt;/p&gt;
&lt;p&gt;The Agent queries &lt;code&gt;v_review_sentiment&lt;/code&gt;, filters by &lt;code&gt;ai_sentiment = &apos;Negative&apos;&lt;/code&gt;, groups by &lt;code&gt;product_name&lt;/code&gt;, and returns a count. You&apos;ll see which products need attention based on LLM-analyzed sentiment rather than just star ratings.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;&amp;quot;Show me a chart of ticket routing by department&amp;quot;&lt;/strong&gt;&lt;/p&gt;
&lt;p&gt;The Agent queries &lt;code&gt;v_ticket_routing&lt;/code&gt;, groups by &lt;code&gt;ai_department&lt;/code&gt;, and generates a bar chart showing how tickets distribute across Billing, Technical Support, Feature Request, and Account Management.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;&amp;quot;List all critical urgency tickets that are still open, ordered by creation date&amp;quot;&lt;/strong&gt;&lt;/p&gt;
&lt;p&gt;The Agent filters &lt;code&gt;v_ticket_routing&lt;/code&gt; for &lt;code&gt;ai_urgency = &apos;Critical&apos;&lt;/code&gt; and &lt;code&gt;ticket_status = &apos;Open&apos;&lt;/code&gt;, sorts by &lt;code&gt;created_timestamp&lt;/code&gt;, and returns the results. This surfaces tickets that AI flagged as critical but haven&apos;t been resolved.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;&amp;quot;Create a chart showing sentiment distribution by product and whether the customer has a support ticket&amp;quot;&lt;/strong&gt;&lt;/p&gt;
&lt;p&gt;The Agent creates a multi-dimensional visualization from &lt;code&gt;v_review_sentiment&lt;/code&gt;, cross-referencing &lt;code&gt;product_name&lt;/code&gt;, &lt;code&gt;ai_sentiment&lt;/code&gt;, and &lt;code&gt;has_support_ticket&lt;/code&gt;. This reveals patterns like &amp;quot;CloudSync Pro has the most negative reviews among customers who also filed support tickets.&amp;quot;&lt;/p&gt;
&lt;h2&gt;Why Apache Iceberg Matters&lt;/h2&gt;
&lt;p&gt;All the tables you created in this tutorial are Apache Iceberg tables stored in Dremio&apos;s built-in Open Catalog. Iceberg provides ACID transactions, schema evolution, and time travel, but the performance benefits are especially relevant for AI-classified data.&lt;/p&gt;
&lt;h3&gt;Iceberg vs. Federated for AI Workloads&lt;/h3&gt;
&lt;p&gt;&lt;strong&gt;Keep data federated when:&lt;/strong&gt; Your classification needs real-time source data; for example, classifying support tickets as they arrive from a live PostgreSQL database. Use manual Reflections with a short refresh interval to accelerate federated queries.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Migrate to Iceberg when:&lt;/strong&gt; You&apos;re running batch classification jobs on historical data. The CTAS approach above creates Iceberg tables. Iceberg&apos;s automated performance management (compaction, manifest optimization, clustering) keeps these growing tables fast. Autonomous Reflections can auto-create pre-computed materializations based on how your dashboards query the classified data.&lt;/p&gt;
&lt;h3&gt;Cost Optimization Pattern&lt;/h3&gt;
&lt;p&gt;Run &lt;code&gt;AI_CLASSIFY&lt;/code&gt; once via CTAS to materialize results. Build Reflections on the materialized table for dashboard queries. This pattern means you pay for LLM tokens once during classification, and all subsequent analytical queries hit cached Reflections at zero LLM cost.&lt;/p&gt;
&lt;h2&gt;Next Steps&lt;/h2&gt;
&lt;ol&gt;
&lt;li&gt;&lt;strong&gt;Connect real data sources&lt;/strong&gt; — replace the &lt;code&gt;feedback_data&lt;/code&gt; folder with federated connections to your actual CRM, support platform, and review system&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Add Fine-Grained Access Control (FGAC)&lt;/strong&gt; — mask customer names or PII in classified results so analysts see sentiment patterns without accessing personal data&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Experiment with Boolean classification&lt;/strong&gt; — use &lt;code&gt;ARRAY[true, false]&lt;/code&gt; for binary decisions like &amp;quot;Is this review about a security concern?&amp;quot; or &amp;quot;Does this ticket mention data loss?&amp;quot;&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Scale with Reflections&lt;/strong&gt; — create Reflections on your materialized classification tables to accelerate dashboard queries&lt;/li&gt;
&lt;/ol&gt;
&lt;p&gt;If you&apos;re running manual classification processes today, whether it&apos;s tagging support tickets, scoring reviews, or categorizing feedback, &lt;code&gt;AI_CLASSIFY&lt;/code&gt; replaces those workflows with a single SQL query. The classification runs inside the same platform where your data lives, governed by the same access controls, and immediately available to every BI tool and AI agent connected to Dremio.&lt;/p&gt;
&lt;p&gt;&lt;a href=&quot;https://www.dremio.com/get-started?utm_source=ev_buffer&amp;amp;utm_medium=influencer&amp;amp;utm_campaign=pag&amp;amp;utm_term=ai-classify-dremio-cloud&amp;amp;utm_content=alexmerced&quot;&gt;Try Dremio Cloud free for 30 days&lt;/a&gt; and start classifying your data with SQL.&lt;/p&gt;
</content:encoded><author>Alex Merced</author></item><item><title>Connect PostgreSQL to Dremio Cloud: Query, Federate, and Accelerate Your Data</title><link>https://iceberglakehouse.com/posts/2026-03-connector-postgresql/</link><guid isPermaLink="true">https://iceberglakehouse.com/posts/2026-03-connector-postgresql/</guid><description>
PostgreSQL powers more production applications than almost any other open-source database. It&apos;s where your customer records, transaction logs, produc...</description><pubDate>Sun, 01 Mar 2026 09:00:00 GMT</pubDate><content:encoded>&lt;p&gt;PostgreSQL powers more production applications than almost any other open-source database. It&apos;s where your customer records, transaction logs, product catalogs, and operational data live. But running analytics directly against PostgreSQL creates problems: heavy analytical queries compete with transactional workloads, cross-database joins require custom ETL, and your data team can&apos;t access PostgreSQL data alongside data in S3, Snowflake, or other systems without building pipelines.&lt;/p&gt;
&lt;p&gt;Dremio Cloud solves this by connecting directly to PostgreSQL and querying it in place. No data movement, no ETL pipelines, no replica databases. You write SQL in Dremio, and it pushes filtering and aggregation work back to PostgreSQL when possible, fetches only the results, and lets you join that data with any other connected source in the same query.&lt;/p&gt;
&lt;p&gt;This guide walks through connecting PostgreSQL to Dremio Cloud, from prerequisites to your first federated query.&lt;/p&gt;
&lt;h2&gt;Why PostgreSQL Users Need Dremio&lt;/h2&gt;
&lt;p&gt;PostgreSQL is an excellent transactional database, but it wasn&apos;t designed for the analytics patterns that modern teams need. Here are the problems Dremio solves:&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Cross-source analytics without pipelines.&lt;/strong&gt; Your customer data is in PostgreSQL, your clickstream data is in S3, and your revenue data is in Snowflake. Without Dremio, joining these datasets requires building ETL pipelines to centralize everything into one system. With Dremio, you connect all three as sources and write a single SQL query that joins across them. Dremio handles the federation.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Protect production performance.&lt;/strong&gt; Running heavy &lt;code&gt;GROUP BY&lt;/code&gt; queries or full-table scans against your production PostgreSQL instance can degrade application performance. Dremio&apos;s Reflections solve this by creating pre-computed materializations of your most common analytical queries. After the first query, subsequent queries hit the Reflection instead of PostgreSQL, eliminating load on your production database.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Business context for AI.&lt;/strong&gt; Raw PostgreSQL tables have technical column names like &lt;code&gt;cust_id&lt;/code&gt; and &lt;code&gt;txn_amt&lt;/code&gt;. Dremio&apos;s semantic layer lets you create views that rename and restructure these columns with business logic, then attach wiki descriptions and labels. When your team asks Dremio&apos;s AI Agent &amp;quot;Who are our highest-value customers?&amp;quot;, the Agent understands what &amp;quot;highest-value&amp;quot; means because you&apos;ve defined it in the semantic layer.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Governance without modifying PostgreSQL.&lt;/strong&gt; Dremio&apos;s Fine-Grained Access Control (FGAC) lets you mask sensitive columns (Social Security numbers, email addresses) and filter rows based on user roles. You don&apos;t need to modify PostgreSQL permissions or create restricted database views — the governance layer lives in Dremio and applies across all tools and users.&lt;/p&gt;
&lt;h2&gt;What You Need Before Connecting&lt;/h2&gt;
&lt;p&gt;Before configuring the connection in Dremio, make sure you have:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;PostgreSQL hostname or IP address&lt;/strong&gt; — the network address of your database server&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Port number&lt;/strong&gt; — PostgreSQL defaults to &lt;code&gt;5432&lt;/code&gt;&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Database name&lt;/strong&gt; — the specific database you want to connect&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Username and password&lt;/strong&gt; — credentials for a user with read access to the tables you want to query&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Network accessibility&lt;/strong&gt; — Dremio Cloud connects to your PostgreSQL instance over the public internet by default. Ensure port 5432 (or your custom port) is open in your AWS Security Group, Azure NSG, or firewall rules&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;If your PostgreSQL instance is in a private subnet (common for production databases), you&apos;ll need to configure networking to allow Dremio Cloud to reach it. Check &lt;a href=&quot;https://docs.dremio.com/dremio-cloud/bring-data/connect/?utm_source=ev_buffer&amp;amp;utm_medium=influencer&amp;amp;utm_campaign=pag&amp;amp;utm_term=connector-postgresql-dremio-cloud&amp;amp;utm_content=alexmerced&quot;&gt;Dremio&apos;s network connectivity documentation&lt;/a&gt; for options.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Dremio Cloud account:&lt;/strong&gt; Sign up at &lt;a href=&quot;https://www.dremio.com/get-started?utm_source=ev_buffer&amp;amp;utm_medium=influencer&amp;amp;utm_campaign=pag&amp;amp;utm_term=connector-postgresql-dremio-cloud&amp;amp;utm_content=alexmerced&quot;&gt;dremio.com/get-started&lt;/a&gt; for a free 30-day trial with $400 in compute credits.&lt;/p&gt;
&lt;h2&gt;Step-by-Step: Connect PostgreSQL to Dremio Cloud&lt;/h2&gt;
&lt;h3&gt;1. Add a New Source&lt;/h3&gt;
&lt;p&gt;In the Dremio console, click the &lt;strong&gt;&amp;quot;+&amp;quot;&lt;/strong&gt; button in the left sidebar and select &lt;strong&gt;PostgreSQL&lt;/strong&gt; from the database source types. Alternatively, navigate to &lt;strong&gt;Databases&lt;/strong&gt; in the data panel and click &lt;strong&gt;Add database&lt;/strong&gt;.&lt;/p&gt;
&lt;h3&gt;2. Configure General Settings&lt;/h3&gt;
&lt;p&gt;Fill in the connection details:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Name:&lt;/strong&gt; Enter a descriptive name for this source (e.g., &lt;code&gt;production-postgres&lt;/code&gt; or &lt;code&gt;crm-database&lt;/code&gt;). This name will appear in your SQL queries when referencing tables from this source. Note: the name cannot include &lt;code&gt;/&lt;/code&gt;, &lt;code&gt;:&lt;/code&gt;, &lt;code&gt;[&lt;/code&gt;, or &lt;code&gt;]&lt;/code&gt;.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Host:&lt;/strong&gt; Enter your PostgreSQL hostname (e.g., &lt;code&gt;my-db.cluster-abc123.us-east-1.rds.amazonaws.com&lt;/code&gt;).&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Port:&lt;/strong&gt; Enter the port number. The default is &lt;code&gt;5432&lt;/code&gt;.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Database:&lt;/strong&gt; Enter the database name you want to connect to.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Encrypt connection:&lt;/strong&gt; Toggle this on to use SSL encryption between Dremio and PostgreSQL. Recommended for production connections, especially when connecting over the internet.&lt;/li&gt;
&lt;/ul&gt;
&lt;h3&gt;3. Set Authentication&lt;/h3&gt;
&lt;p&gt;Choose one of two authentication methods:&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Master Authentication (default):&lt;/strong&gt; Provide a username and password directly. This is the simplest option — enter the credentials for a PostgreSQL user that has &lt;code&gt;SELECT&lt;/code&gt; permissions on the tables you want to query.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Secret Resource URL:&lt;/strong&gt; Instead of storing the password in Dremio, provide an AWS Secrets Manager ARN (e.g., &lt;code&gt;arn:aws:secretsmanager:us-west-2:123456789012:secret:my-rds-secret-VNenFy&lt;/code&gt;). Dremio fetches the password from Secrets Manager at connection time. This is the preferred option for production deployments because it centralizes credential management and supports rotation.&lt;/p&gt;
&lt;h3&gt;4. Configure Advanced Options (Optional)&lt;/h3&gt;
&lt;p&gt;The advanced options let you fine-tune connection behavior:&lt;/p&gt;
&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Setting&lt;/th&gt;
&lt;th&gt;What It Does&lt;/th&gt;
&lt;th&gt;Default&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Record fetch size&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Number of rows Dremio fetches per batch. Set to 0 for automatic.&lt;/td&gt;
&lt;td&gt;200&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Maximum Idle Connections&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;How many idle connections Dremio maintains to PostgreSQL.&lt;/td&gt;
&lt;td&gt;8&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Connection Idle Time&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Seconds before an idle connection is closed.&lt;/td&gt;
&lt;td&gt;60&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Encryption Validation Mode&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;When SSL is enabled: validate certificate + hostname, certificate only, or no validation.&lt;/td&gt;
&lt;td&gt;Validate both&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Connection Properties&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Custom key-value pairs for JDBC connection parameters.&lt;/td&gt;
&lt;td&gt;None&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;
&lt;p&gt;For most users, the defaults work fine. If you&apos;re connecting to an Amazon RDS or Aurora instance, the default SSL settings are compatible.&lt;/p&gt;
&lt;h3&gt;5. Set Reflection Refresh Schedule&lt;/h3&gt;
&lt;p&gt;This controls how often Dremio refreshes pre-computed Reflections built on PostgreSQL data:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Refresh every:&lt;/strong&gt; How often Reflections update (hours, days, or weeks). More frequent refreshes mean fresher data but more queries against PostgreSQL.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Expire after:&lt;/strong&gt; How long before unused Reflections are automatically removed.&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;For operational PostgreSQL data that changes throughout the day, a refresh interval of 1-4 hours is typical. For historical data that rarely changes, daily or weekly is sufficient.&lt;/p&gt;
&lt;h3&gt;6. Configure Metadata Refresh&lt;/h3&gt;
&lt;p&gt;These settings control how often Dremio checks PostgreSQL for new or changed tables:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Dataset Discovery (Fetch every):&lt;/strong&gt; How often Dremio looks for new tables or schema changes. Default is 1 hour.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Dataset Details (Fetch every):&lt;/strong&gt; How often Dremio refreshes detailed metadata for tables you&apos;ve already queried. Default is 1 hour.&lt;/li&gt;
&lt;/ul&gt;
&lt;h3&gt;7. Set Privileges and Save&lt;/h3&gt;
&lt;p&gt;Optionally, restrict which Dremio users or roles can access this PostgreSQL source. Click &lt;strong&gt;Save&lt;/strong&gt; to create the connection.&lt;/p&gt;
&lt;h2&gt;Query PostgreSQL Data from Dremio&lt;/h2&gt;
&lt;p&gt;Once connected, your PostgreSQL database appears as a source in Dremio&apos;s SQL Runner. Browse the source to see schemas and tables, then query them directly:&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-sql&quot;&gt;SELECT customer_id, first_name, last_name, signup_date
FROM &amp;quot;production-postgres&amp;quot;.public.customers
WHERE signup_date &amp;gt; &apos;2024-01-01&apos;
ORDER BY signup_date DESC
LIMIT 100;
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;The source name (&lt;code&gt;production-postgres&lt;/code&gt;) is the name you gave the source. PostgreSQL schemas appear as sub-folders, and tables appear within those schemas.&lt;/p&gt;
&lt;h2&gt;Federate PostgreSQL with Other Sources&lt;/h2&gt;
&lt;p&gt;The real value appears when you combine PostgreSQL data with other sources. Here&apos;s an example that joins PostgreSQL customer data with S3 clickstream data:&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-sql&quot;&gt;SELECT
  c.customer_id,
  c.first_name || &apos; &apos; || c.last_name AS customer_name,
  c.segment,
  COUNT(e.event_id) AS total_events,
  SUM(CASE WHEN e.event_type = &apos;purchase&apos; THEN 1 ELSE 0 END) AS purchases
FROM &amp;quot;production-postgres&amp;quot;.public.customers c
LEFT JOIN &amp;quot;s3-clickstream&amp;quot;.events.user_events e
  ON c.customer_id = e.user_id
GROUP BY c.customer_id, c.first_name, c.last_name, c.segment
ORDER BY purchases DESC;
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Dremio pushes the filter and projection operations to PostgreSQL (this is called &lt;strong&gt;predicate pushdown&lt;/strong&gt;), fetches only the matching rows, then joins them with the S3 data in Dremio&apos;s query engine. PostgreSQL handles what it&apos;s good at (filtering indexed columns), and Dremio handles the cross-source join.&lt;/p&gt;
&lt;h2&gt;Build a Semantic Layer Over PostgreSQL&lt;/h2&gt;
&lt;p&gt;Create views to give your PostgreSQL data business-friendly names and logic:&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-sql&quot;&gt;CREATE VIEW analytics.gold.customer_overview AS
SELECT
  c.customer_id,
  c.first_name || &apos; &apos; || c.last_name AS full_name,
  c.email,
  c.segment AS customer_segment,
  c.signup_date,
  CASE
    WHEN c.segment = &apos;Enterprise&apos; AND c.lifetime_value &amp;gt; 50000 THEN &apos;Strategic&apos;
    WHEN c.lifetime_value &amp;gt; 10000 THEN &apos;High Value&apos;
    ELSE &apos;Standard&apos;
  END AS account_tier
FROM &amp;quot;production-postgres&amp;quot;.public.customers c;
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Then attach wiki descriptions and labels through the Catalog (edit pencil icon → Details tab → Generate Wiki/Tags) so the AI Agent understands the data when users ask natural language questions.&lt;/p&gt;
&lt;h2&gt;Predicate Pushdown: What Dremio Offloads to PostgreSQL&lt;/h2&gt;
&lt;p&gt;Dremio doesn&apos;t download entire PostgreSQL tables and process them locally. When possible, it pushes operations back to PostgreSQL to minimize data transfer. PostgreSQL supports an extensive set of pushdowns in Dremio, including:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Logical operators:&lt;/strong&gt; AND, OR, NOT&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Comparisons:&lt;/strong&gt; =, !=, &amp;lt;, &amp;gt;, &amp;lt;=, &amp;gt;=, BETWEEN, IN, LIKE, IS NULL, IS NOT NULL&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Aggregations:&lt;/strong&gt; SUM, AVG, COUNT, MIN, MAX, STDDEV, MEDIAN, VAR_POP&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Math functions:&lt;/strong&gt; ABS, CEIL, FLOOR, ROUND, MOD, SQRT, POWER, LOG&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;String functions:&lt;/strong&gt; CONCAT, SUBSTR, LENGTH, LOWER, UPPER, TRIM, REPLACE, REVERSE&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Date functions:&lt;/strong&gt; DATE_ADD, DATE_SUB, DATE_TRUNC (day, hour, month, quarter, year), EXTRACT&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;This means a query like &lt;code&gt;SELECT department, AVG(salary) FROM postgres.hr.employees WHERE hire_date &amp;gt; &apos;2023-01-01&apos; GROUP BY department&lt;/code&gt; runs mostly on PostgreSQL — Dremio sends the filter, aggregation, and grouping to Postgres and only transfers the summarized result.&lt;/p&gt;
&lt;h2&gt;Accelerate PostgreSQL Queries with Reflections&lt;/h2&gt;
&lt;p&gt;For queries that run frequently, create Reflections to avoid hitting PostgreSQL repeatedly:&lt;/p&gt;
&lt;ol&gt;
&lt;li&gt;Build a view over your PostgreSQL data.&lt;/li&gt;
&lt;li&gt;In the Catalog, select the view and create a Reflection.&lt;/li&gt;
&lt;li&gt;Choose the columns and aggregations to include.&lt;/li&gt;
&lt;li&gt;Set the refresh interval (how often Dremio re-queries PostgreSQL to update the Reflection).&lt;/li&gt;
&lt;/ol&gt;
&lt;p&gt;After the Reflection is built, Dremio&apos;s query optimizer automatically routes matching queries to the Reflection instead of PostgreSQL. Your analysts see the same tables and write the same SQL — the acceleration is transparent.&lt;/p&gt;
&lt;p&gt;This is particularly valuable for dashboard queries. BI tools like Tableau or Power BI connected to Dremio via Arrow Flight/ODBC get sub-second response times from Reflections, even though the source data lives in PostgreSQL.&lt;/p&gt;
&lt;h2&gt;Data Type Mapping&lt;/h2&gt;
&lt;p&gt;Dremio automatically maps PostgreSQL types to Dremio types. The key mappings to know:&lt;/p&gt;
&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;PostgreSQL&lt;/th&gt;
&lt;th&gt;Dremio&lt;/th&gt;
&lt;th&gt;Notes&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;BIGINT / BIGSERIAL&lt;/td&gt;
&lt;td&gt;BIGINT&lt;/td&gt;
&lt;td&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;INT / SERIAL&lt;/td&gt;
&lt;td&gt;INTEGER&lt;/td&gt;
&lt;td&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;NUMERIC&lt;/td&gt;
&lt;td&gt;DECIMAL&lt;/td&gt;
&lt;td&gt;Preserves precision&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;VARCHAR / TEXT / CHAR&lt;/td&gt;
&lt;td&gt;VARCHAR&lt;/td&gt;
&lt;td&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;BOOLEAN / BIT&lt;/td&gt;
&lt;td&gt;BOOLEAN&lt;/td&gt;
&lt;td&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;DATE&lt;/td&gt;
&lt;td&gt;DATE&lt;/td&gt;
&lt;td&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;TIMESTAMP / TIMESTAMPTZ&lt;/td&gt;
&lt;td&gt;TIMESTAMP&lt;/td&gt;
&lt;td&gt;Timezone-aware types convert&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;FLOAT4 / FLOAT8&lt;/td&gt;
&lt;td&gt;FLOAT / DOUBLE&lt;/td&gt;
&lt;td&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;BYTEA&lt;/td&gt;
&lt;td&gt;VARBINARY&lt;/td&gt;
&lt;td&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;MONEY&lt;/td&gt;
&lt;td&gt;DOUBLE&lt;/td&gt;
&lt;td&gt;Converted to numeric&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;
&lt;p&gt;Most types map directly. If you use PostgreSQL-specific types like &lt;code&gt;JSONB&lt;/code&gt;, &lt;code&gt;ARRAY&lt;/code&gt;, or &lt;code&gt;HSTORE&lt;/code&gt;, those are not supported in Dremio&apos;s connector and won&apos;t appear in query results.&lt;/p&gt;
&lt;h2&gt;AI-Powered Analytics on PostgreSQL Data&lt;/h2&gt;
&lt;h3&gt;Dremio AI Agent&lt;/h3&gt;
&lt;p&gt;The built-in AI Agent lets users ask questions about PostgreSQL data in plain English. Instead of writing SQL, a business user asks &amp;quot;Who are our highest-value enterprise customers?&amp;quot; and the Agent generates the correct query by reading the wiki descriptions attached to your semantic layer views. The Agent understands that &amp;quot;highest-value&amp;quot; maps to &lt;code&gt;lifetime_value&lt;/code&gt; and &amp;quot;enterprise&amp;quot; maps to &lt;code&gt;segment = &apos;Enterprise&apos;&lt;/code&gt; because you&apos;ve defined it in the view&apos;s wiki.&lt;/p&gt;
&lt;h3&gt;Dremio MCP Server&lt;/h3&gt;
&lt;p&gt;The &lt;a href=&quot;https://github.com/dremio/dremio-mcp&quot;&gt;Dremio MCP Server&lt;/a&gt; connects external AI chat clients — Claude, ChatGPT, and others — to your PostgreSQL data through Dremio. The hosted MCP Server provides OAuth authentication that propagates user identity and authorization for every interaction:&lt;/p&gt;
&lt;ol&gt;
&lt;li&gt;Create a Native OAuth application in Dremio Cloud&lt;/li&gt;
&lt;li&gt;Configure redirect URLs for your AI client (e.g., &lt;code&gt;https://claude.ai/api/mcp/auth_callback&lt;/code&gt;)&lt;/li&gt;
&lt;li&gt;Connect using &lt;code&gt;mcp.dremio.cloud/mcp/{project_id}&lt;/code&gt;&lt;/li&gt;
&lt;/ol&gt;
&lt;p&gt;A sales director can ask Claude &amp;quot;Show me our strategic account customers who signed up in Q1&amp;quot; and get governed, accurate results from your PostgreSQL data without SQL.&lt;/p&gt;
&lt;h3&gt;AI SQL Functions&lt;/h3&gt;
&lt;p&gt;Use AI directly in queries against PostgreSQL data:&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-sql&quot;&gt;-- Classify customers based on their profile data
SELECT
  full_name,
  customer_segment,
  account_tier,
  AI_CLASSIFY(
    &apos;Based on this customer profile, predict their likely next action&apos;,
    &apos;Customer: &apos; || full_name || &apos;, Segment: &apos; || customer_segment || &apos;, Tier: &apos; || account_tier,
    ARRAY[&apos;Upsell Opportunity&apos;, &apos;Renewal Risk&apos;, &apos;Expansion Ready&apos;, &apos;Stable&apos;]
  ) AS predicted_action
FROM analytics.gold.customer_overview
WHERE account_tier IN (&apos;Strategic&apos;, &apos;High Value&apos;);

-- Generate personalized engagement plans
SELECT
  full_name,
  AI_GENERATE(
    &apos;Write a one-sentence personalized engagement recommendation&apos;,
    &apos;Customer: &apos; || full_name || &apos;, Segment: &apos; || customer_segment || &apos;, Tier: &apos; || account_tier || &apos;, Signup: &apos; || CAST(signup_date AS VARCHAR)
  ) AS engagement_recommendation
FROM analytics.gold.customer_overview
WHERE account_tier = &apos;Strategic&apos;;
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;&lt;code&gt;AI_CLASSIFY&lt;/code&gt; categorizes data with LLM inference inside SQL. &lt;code&gt;AI_GENERATE&lt;/code&gt; produces text. &lt;code&gt;AI_SIMILARITY&lt;/code&gt; (not shown) finds semantic matches between text fields. All run directly in your query.&lt;/p&gt;
&lt;h2&gt;Get Started&lt;/h2&gt;
&lt;p&gt;If you run PostgreSQL for your application data and want to include it in cross-source analytics, AI-driven queries, or governed dashboards without building ETL pipelines, Dremio Cloud is the fastest path.&lt;/p&gt;
&lt;p&gt;&lt;a href=&quot;https://www.dremio.com/get-started?utm_source=ev_buffer&amp;amp;utm_medium=influencer&amp;amp;utm_campaign=pag&amp;amp;utm_term=connector-postgresql-dremio-cloud&amp;amp;utm_content=alexmerced&quot;&gt;Try Dremio Cloud free for 30 days&lt;/a&gt; and connect your PostgreSQL instance in under 5 minutes.&lt;/p&gt;
</content:encoded><author>Alex Merced</author></item><item><title>Data Engineering Best Practices: The Complete Checklist</title><link>https://iceberglakehouse.com/posts/2026-02-debp-de-best-practices-checklist/</link><guid isPermaLink="true">https://iceberglakehouse.com/posts/2026-02-debp-de-best-practices-checklist/</guid><description>
![Comprehensive data engineering checklist organized by categories with status indicators](/assets/images/debp/10/de-checklist.png)

Best practices d...</description><pubDate>Wed, 18 Feb 2026 18:00:00 GMT</pubDate><content:encoded>&lt;p&gt;&lt;img src=&quot;/assets/images/debp/10/de-checklist.png&quot; alt=&quot;Comprehensive data engineering checklist organized by categories with status indicators&quot;&gt;&lt;/p&gt;
&lt;p&gt;Best practices documents are easy to write and hard to use. They list principles without context, advice without prioritization, and rules without explaining when to break them. This one is different. It&apos;s a practical, tool-agnostic checklist organized by the categories that matter most — with each item tied to a specific outcome.&lt;/p&gt;
&lt;p&gt;Use this as a recurring audit. Run through it quarterly. Any unchecked item is either a technical debt item or a conscious tradeoff. Know which is which.&lt;/p&gt;
&lt;h2&gt;Pipeline Design&lt;/h2&gt;
&lt;ul&gt;
&lt;li&gt;[ ] &lt;strong&gt;Separate ingestion from transformation.&lt;/strong&gt; Raw data lands unchanged. Business logic runs separately. This lets you replay raw data and isolate failures.&lt;/li&gt;
&lt;li&gt;[ ] &lt;strong&gt;Model pipelines as DAGs.&lt;/strong&gt; Each stage has explicit inputs and outputs. Independent stages run in parallel. Failed stages retry alone.&lt;/li&gt;
&lt;li&gt;[ ] &lt;strong&gt;Make dependencies explicit.&lt;/strong&gt; If pipeline B needs the output of pipeline A, declare that dependency in your orchestrator. Don&apos;t rely on timing assumptions.&lt;/li&gt;
&lt;li&gt;[ ] &lt;strong&gt;Use sensors or triggers for scheduling.&lt;/strong&gt; Wait for data to arrive, not for the clock to hit a certain time. Data-driven triggers are more reliable than cron jobs.&lt;/li&gt;
&lt;li&gt;[ ] &lt;strong&gt;Keep stages single-purpose.&lt;/strong&gt; An ingestion stage should not also validate, transform, and load. Each stage does one thing and does it well.&lt;/li&gt;
&lt;/ul&gt;
&lt;h2&gt;Data Quality&lt;/h2&gt;
&lt;ul&gt;
&lt;li&gt;[ ] &lt;strong&gt;Validate schema at ingestion.&lt;/strong&gt; Compare incoming data against expected column names, types, and nullability. Catch schema drift before it propagates.&lt;/li&gt;
&lt;li&gt;[ ] &lt;strong&gt;Check completeness.&lt;/strong&gt; Required fields have no nulls. If &lt;code&gt;customer_id&lt;/code&gt; is nullable in your orders table, downstream joins will silently lose rows.&lt;/li&gt;
&lt;li&gt;[ ] &lt;strong&gt;Enforce uniqueness.&lt;/strong&gt; Primary keys have no duplicates. Run dedup checks after every load. Double-counted records are worse than missing records.&lt;/li&gt;
&lt;li&gt;[ ] &lt;strong&gt;Quarantine bad records.&lt;/strong&gt; Route validation failures to a quarantine table with metadata (which check failed, when, the original record). Never drop records silently.&lt;/li&gt;
&lt;li&gt;[ ] &lt;strong&gt;Track quality metrics.&lt;/strong&gt; Null rates, duplicate rates, and range violations tracked per pipeline, per day. Trend these metrics to catch gradual degradation.&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;&lt;img src=&quot;/assets/images/debp/10/quality-checklist.png&quot; alt=&quot;Data quality checklist: schema validation, completeness, uniqueness, quarantine&quot;&gt;&lt;/p&gt;
&lt;h2&gt;Reliability and Idempotency&lt;/h2&gt;
&lt;ul&gt;
&lt;li&gt;[ ] &lt;strong&gt;Make every pipeline idempotent.&lt;/strong&gt; Running the same job twice produces the same result. Use partition overwrite or MERGE — never blind INSERT.&lt;/li&gt;
&lt;li&gt;[ ] &lt;strong&gt;Implement retry with backoff.&lt;/strong&gt; Transient failures (network, API limits) resolve themselves. Retry 3-5 times with exponential backoff before alerting.&lt;/li&gt;
&lt;li&gt;[ ] &lt;strong&gt;Use dead-letter queues.&lt;/strong&gt; Records that can&apos;t be processed go to a queue for inspection, not to /dev/null.&lt;/li&gt;
&lt;li&gt;[ ] &lt;strong&gt;Checkpoint progress.&lt;/strong&gt; After processing each batch or partition, record what&apos;s done. On failure, resume from the last checkpoint.&lt;/li&gt;
&lt;li&gt;[ ] &lt;strong&gt;Design for failure.&lt;/strong&gt; Every component will fail. Define the expected behavior for each failure mode: retry, skip and log, alert, or halt.&lt;/li&gt;
&lt;/ul&gt;
&lt;h2&gt;Schema Management&lt;/h2&gt;
&lt;ul&gt;
&lt;li&gt;[ ] &lt;strong&gt;Treat your schema as an API.&lt;/strong&gt; Column names are fields. Tables are endpoints. Consumers are clients. Changing the schema without coordination is as bad as changing an API without versioning.&lt;/li&gt;
&lt;li&gt;[ ] &lt;strong&gt;Use additive-only changes.&lt;/strong&gt; Add new columns. Never remove or rename columns without a deprecation period.&lt;/li&gt;
&lt;li&gt;[ ] &lt;strong&gt;Enforce contracts at boundaries.&lt;/strong&gt; Validate that incoming schema matches expectations at ingestion. Validate that outgoing schema matches consumer contracts at serving.&lt;/li&gt;
&lt;li&gt;[ ] &lt;strong&gt;Version breaking changes.&lt;/strong&gt; When a schema must change incompatibly, version it (v1, v2). Let consumers migrate on their own schedule.&lt;/li&gt;
&lt;li&gt;[ ] &lt;strong&gt;Document every column.&lt;/strong&gt; Column name, type, description, source, owner. If an engineer can&apos;t find this information in under 30 seconds, it&apos;s not documented.&lt;/li&gt;
&lt;/ul&gt;
&lt;h2&gt;Testing and Validation&lt;/h2&gt;
&lt;ul&gt;
&lt;li&gt;[ ] &lt;strong&gt;Run schema tests on every pipeline execution.&lt;/strong&gt; Column existence, data types, not-null constraints. These are fast, cheap, and catch the most common problems.&lt;/li&gt;
&lt;li&gt;[ ] &lt;strong&gt;Run uniqueness and null checks on primary keys.&lt;/strong&gt; The two most impactful data quality tests. Add them today.&lt;/li&gt;
&lt;li&gt;[ ] &lt;strong&gt;Compare row counts against baselines.&lt;/strong&gt; Alert when today&apos;s count deviates by more than 20% from the trailing average. Catches missing data and unexpected volume spikes.&lt;/li&gt;
&lt;li&gt;[ ] &lt;strong&gt;Test transformation logic with fixtures.&lt;/strong&gt; Small, known-good input datasets with expected outputs. Run these in CI before deploying pipeline changes.&lt;/li&gt;
&lt;li&gt;[ ] &lt;strong&gt;Add regression tests for key business metrics.&lt;/strong&gt; Total revenue, distinct customer count, and other critical aggregations compared against previous runs.&lt;/li&gt;
&lt;/ul&gt;
&lt;h2&gt;Observability and Monitoring&lt;/h2&gt;
&lt;ul&gt;
&lt;li&gt;[ ] &lt;strong&gt;Track data freshness per table.&lt;/strong&gt; The timestamp of the most recent row. Alert when it exceeds the SLA. This single metric catches more problems than any other.&lt;/li&gt;
&lt;li&gt;[ ] &lt;strong&gt;Alert on business impact, not every error.&lt;/strong&gt; SLA violations, quality regressions, and anomalous volume changes are alerts. Transient retries and expected maintenance are not.&lt;/li&gt;
&lt;li&gt;[ ] &lt;strong&gt;Use structured logging.&lt;/strong&gt; JSON-formatted log entries with pipeline name, stage, batch ID, timestamp, row count, and status. Searchable, parseable, filterable.&lt;/li&gt;
&lt;li&gt;[ ] &lt;strong&gt;Build data lineage.&lt;/strong&gt; Know where each table&apos;s data comes from and where it goes. Column-level lineage turns &amp;quot;the numbers are wrong&amp;quot; from a half-day investigation into a 10-minute graph traversal.&lt;/li&gt;
&lt;li&gt;[ ] &lt;strong&gt;Review observability quarterly.&lt;/strong&gt; Are alerts still relevant? Are thresholds still accurate? Are dashboards still used? Trim unactionable alerts and update stale baselines.&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;&lt;img src=&quot;/assets/images/debp/10/observability-checklist.png&quot; alt=&quot;Observability checklist: freshness tracking, alert severity, structured logs, lineage&quot;&gt;&lt;/p&gt;
&lt;h2&gt;What to Do Next&lt;/h2&gt;
&lt;p&gt;Print this checklist. Walk through it with your team in a 30-minute meeting. Check what&apos;s already in place, identify the three highest-impact unchecked items, and schedule them as engineering work — not aspirational goals on a wiki page. Best practices only matter when they&apos;re implemented.&lt;/p&gt;
&lt;p&gt;&lt;a href=&quot;https://www.dremio.com/get-started?utm_source=ev_buffer&amp;amp;utm_medium=influencer&amp;amp;utm_campaign=next-gen-dremio&amp;amp;utm_term=blog-021826-02-18-2026&amp;amp;utm_content=alexmerced&quot;&gt;Try Dremio Cloud free for 30 days&lt;/a&gt;&lt;/p&gt;
</content:encoded><author>Alex Merced</author></item><item><title>Data Modeling Best Practices: 7 Mistakes to Avoid</title><link>https://iceberglakehouse.com/posts/2026-02-dm-data-modeling-best-practices/</link><guid isPermaLink="true">https://iceberglakehouse.com/posts/2026-02-dm-data-modeling-best-practices/</guid><description>
![Checklist of data modeling quality markers with warning symbols on common mistakes](/assets/images/data_modeling/10/best-practices-checklist.png)

...</description><pubDate>Wed, 18 Feb 2026 18:00:00 GMT</pubDate><content:encoded>&lt;p&gt;&lt;img src=&quot;/assets/images/data_modeling/10/best-practices-checklist.png&quot; alt=&quot;Checklist of data modeling quality markers with warning symbols on common mistakes&quot;&gt;&lt;/p&gt;
&lt;p&gt;A bad data model doesn&apos;t announce itself. It hides behind slow dashboards, conflicting numbers, confused analysts, and AI agents that generate wrong SQL. By the time someone identifies the model as the root cause, the team has already built dozens of reports on top of it.&lt;/p&gt;
&lt;p&gt;Here are seven modeling mistakes that create downstream pain — and how to avoid each one.&lt;/p&gt;
&lt;h2&gt;Mistake 1: No Defined Grain&lt;/h2&gt;
&lt;p&gt;The grain declares what one row in a fact table represents. &amp;quot;One row per order line item.&amp;quot; &amp;quot;One row per daily user session.&amp;quot; &amp;quot;One row per monthly account balance.&amp;quot;&lt;/p&gt;
&lt;p&gt;Without a declared grain, aggregation produces wrong numbers. If some rows represent individual transactions and others represent daily summaries, a SUM query double-counts or under-counts depending on the mix.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Fix:&lt;/strong&gt; Before designing any fact table, write down the grain in one sentence. Share it with your team. If you can&apos;t state the grain clearly, the table isn&apos;t ready for production.&lt;/p&gt;
&lt;h2&gt;Mistake 2: Cryptic Naming&lt;/h2&gt;
&lt;p&gt;Columns named &lt;code&gt;c1&lt;/code&gt;, &lt;code&gt;dt&lt;/code&gt;, &lt;code&gt;amt&lt;/code&gt;, &lt;code&gt;flg&lt;/code&gt;, and &lt;code&gt;cat_cd&lt;/code&gt; save keystrokes during development but cost hours during analysis. Every analyst who encounters these names must either read the ETL code, ask the engineer, or guess.&lt;/p&gt;
&lt;p&gt;AI agents have the same problem. An agent asked to calculate &amp;quot;total revenue&amp;quot; can&apos;t identify the right column if it&apos;s called &lt;code&gt;amt3&lt;/code&gt; instead of &lt;code&gt;revenue_usd&lt;/code&gt;.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Fix:&lt;/strong&gt; Use descriptive, business-friendly names. &lt;code&gt;customer_name&lt;/code&gt;, &lt;code&gt;order_date&lt;/code&gt;, &lt;code&gt;revenue_usd&lt;/code&gt;, &lt;code&gt;is_active&lt;/code&gt;, &lt;code&gt;product_category&lt;/code&gt;. Include units where ambiguous (&lt;code&gt;weight_kg&lt;/code&gt;, &lt;code&gt;duration_minutes&lt;/code&gt;). Use &lt;code&gt;snake_case&lt;/code&gt; consistently.&lt;/p&gt;
&lt;h2&gt;Mistake 3: Skipping the Conceptual Model&lt;/h2&gt;
&lt;p&gt;&lt;img src=&quot;/assets/images/data_modeling/10/conceptual-foundation.png&quot; alt=&quot;Conceptual model as the foundation layer that business and technical teams align on&quot;&gt;&lt;/p&gt;
&lt;p&gt;Going straight from a stakeholder request to &lt;code&gt;CREATE TABLE&lt;/code&gt; skips the alignment step. Engineers build what they understand from the request. Stakeholders assumed something different. The gap surfaces weeks or months later when reports don&apos;t match expectations.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Fix:&lt;/strong&gt; For every new business domain, create a conceptual model first. List the entities, name the relationships, and get business stakeholder sign-off before writing any SQL.&lt;/p&gt;
&lt;h2&gt;Mistake 4: Over-Normalizing for Analytics&lt;/h2&gt;
&lt;p&gt;Third Normal Form (3NF) is correct for transactional systems where writes are frequent and consistency matters. Applied to an analytics workload, it creates queries with 10-15 joins that run slowly and break easily.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Fix:&lt;/strong&gt; Separate your transactional model from your analytical model. Keep the OLTP system in 3NF. Build a denormalized star schema (or a set of wide views) for analytics. Different workloads deserve different models.&lt;/p&gt;
&lt;h2&gt;Mistake 5: Under-Documenting&lt;/h2&gt;
&lt;p&gt;A data model without documentation is a puzzle that only its creator can solve. And even they forget the details after a few months.&lt;/p&gt;
&lt;p&gt;Without documentation, every new team member reverse-engineers the model from scratch. Every AI agent generates SQL based on guesses. Every analyst interprets column meanings differently, leading to metric discrepancies that take weeks to reconcile.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Fix:&lt;/strong&gt; Document at three levels:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Column level:&lt;/strong&gt; What does each column mean? Where does it come from?&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Table level:&lt;/strong&gt; What grain does this table use? Who maintains it?&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Model level:&lt;/strong&gt; How do tables connect? What business process does this model represent?&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;Platforms like &lt;a href=&quot;https://www.dremio.com/blog/agentic-analytics-semantic-layer/?utm_source=ev_buffer&amp;amp;utm_medium=influencer&amp;amp;utm_campaign=next-gen-dremio&amp;amp;utm_term=blog-021826-02-18-2026&amp;amp;utm_content=alexmerced&quot;&gt;Dremio&lt;/a&gt; make this practical with built-in Wikis for every dataset and Labels for classification (PII, Certified, Raw, Deprecated). The documentation lives next to the data, not in a separate spreadsheet that goes stale.&lt;/p&gt;
&lt;h2&gt;Mistake 6: One Model for Every Workload&lt;/h2&gt;
&lt;p&gt;&lt;img src=&quot;/assets/images/data_modeling/10/one-model-problem.png&quot; alt=&quot;Single model struggling to serve transactions, analytics, and AI simultaneously&quot;&gt;&lt;/p&gt;
&lt;p&gt;A model designed for a transactional application doesn&apos;t serve analytics well. A model designed for analytics doesn&apos;t serve a machine learning feature store well. Trying to make one model serve every use case leads to compromises that serve no use case well.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Fix:&lt;/strong&gt; Build purpose-specific models layered on top of shared source data. The Medallion Architecture does this naturally:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Bronze:&lt;/strong&gt; Raw data from sources (shared foundation)&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Silver:&lt;/strong&gt; Business logic layer (shared across analytics and ML)&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Gold:&lt;/strong&gt; Purpose-built views (one for dashboards, one for ML features, one for AI agents)&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;Each Gold view is tailored to its consumer without duplicating the transformation logic in Silver.&lt;/p&gt;
&lt;h2&gt;Mistake 7: Ignoring Governance&lt;/h2&gt;
&lt;p&gt;Data models don&apos;t exist in a vacuum. They contain PII, financial data, health records, and other sensitive information. Ignoring governance creates compliance risk and erodes trust.&lt;/p&gt;
&lt;p&gt;Common governance gaps:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;No access controls (everyone sees everything)&lt;/li&gt;
&lt;li&gt;No classification (no one knows which columns contain PII)&lt;/li&gt;
&lt;li&gt;No ownership (no one knows who to ask about table X)&lt;/li&gt;
&lt;li&gt;No lineage (no one knows where the data came from)&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;&lt;strong&gt;Fix:&lt;/strong&gt; Integrate governance from day one:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Tag columns by sensitivity (PII, financial, public)&lt;/li&gt;
&lt;li&gt;Assign ownership per table or domain&lt;/li&gt;
&lt;li&gt;Apply row and column-level access policies&lt;/li&gt;
&lt;li&gt;Document data lineage from source to consumption&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;In Dremio, Fine-Grained Access Control enforces row and column-level policies, Labels classify datasets, and the Open Catalog tracks lineage. Governance is part of the platform, not an afterthought.&lt;/p&gt;
&lt;h2&gt;What to Do Next&lt;/h2&gt;
&lt;p&gt;&lt;img src=&quot;/assets/images/data_modeling/10/modeling-cycle.png&quot; alt=&quot;Iterative data modeling cycle: design, document, measure, improve&quot;&gt;&lt;/p&gt;
&lt;p&gt;Pick one of these seven mistakes. Check whether your current data model has it. Fix it. Then move to the next one. Data modeling is iterative — no team gets it perfect on the first pass. The goal is not perfection but continuous improvement: clearer names, better documentation, tighter governance, and models that match what your consumers actually need.&lt;/p&gt;
&lt;p&gt;&lt;a href=&quot;https://www.dremio.com/get-started?utm_source=ev_buffer&amp;amp;utm_medium=influencer&amp;amp;utm_campaign=next-gen-dremio&amp;amp;utm_term=blog-021826-02-18-2026&amp;amp;utm_content=alexmerced&quot;&gt;Try Dremio Cloud free for 30 days&lt;/a&gt;&lt;/p&gt;
</content:encoded><author>Alex Merced</author></item><item><title>Semantic Layer Best Practices: 7 Mistakes to Avoid</title><link>https://iceberglakehouse.com/posts/2026-02-sl-semantic-layer-best-practices/</link><guid isPermaLink="true">https://iceberglakehouse.com/posts/2026-02-sl-semantic-layer-best-practices/</guid><description>
![Semantic layer best practices checklist — checks and mistakes](/assets/images/semantic_layer/10/best-practices.png)

Semantic layers don&apos;t fail bec...</description><pubDate>Wed, 18 Feb 2026 18:00:00 GMT</pubDate><content:encoded>&lt;p&gt;&lt;img src=&quot;/assets/images/semantic_layer/10/best-practices.png&quot; alt=&quot;Semantic layer best practices checklist — checks and mistakes&quot;&gt;&lt;/p&gt;
&lt;p&gt;Semantic layers don&apos;t fail because the technology is wrong. They fail because of design decisions made in the first two weeks — choices that seem reasonable at the time and create compounding problems for months afterward.&lt;/p&gt;
&lt;p&gt;Here are the seven mistakes that kill semantic layer projects, and how to avoid each one.&lt;/p&gt;
&lt;h2&gt;Mistake 1: Defining Metrics in Multiple Places&lt;/h2&gt;
&lt;p&gt;&lt;strong&gt;What happens&lt;/strong&gt;: Revenue is defined in a Tableau calculated field, a Power BI DAX measure, a dbt model, and a SQL view. Four sources of truth. None of them agree.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Why it&apos;s common&lt;/strong&gt;: Teams adopt new tools without migrating metric definitions. Each tool gets its own model. Over time, the definitions drift.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;The fix&lt;/strong&gt;: Every metric gets exactly one canonical definition in the semantic layer. All downstream tools query that definition. No exceptions. When someone needs Revenue, they query &lt;code&gt;business.revenue&lt;/code&gt;, not their own formula.&lt;/p&gt;
&lt;p&gt;This principle extends to AI agents. If your AI generates its own metric formulas instead of referencing the semantic layer, you&apos;ve just added another source of truth — the least trustworthy one.&lt;/p&gt;
&lt;h2&gt;Mistake 2: Skipping the Bronze Layer&lt;/h2&gt;
&lt;p&gt;&lt;strong&gt;What happens&lt;/strong&gt;: A data engineer creates a Silver view that joins raw source tables directly, mixing data cleanup (type casting, column renaming) with business logic (filters, calculations) in a single query. When the source schema changes — a column is renamed, a type is modified — the Silver view breaks.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Why it&apos;s common&lt;/strong&gt;: The Bronze layer feels redundant. It&apos;s just a 1:1 mapping of the source. Why add a layer that doesn&apos;t change anything?&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;The fix&lt;/strong&gt;: The Bronze layer absorbs schema changes. When a source renames &lt;code&gt;col_7&lt;/code&gt; to &lt;code&gt;order_date_utc&lt;/code&gt;, you update one Bronze view. The Silver and Gold views above it don&apos;t change. This insulation is worth the tiny overhead of maintaining passthrough views.&lt;/p&gt;
&lt;p&gt;Bronze views also standardize data formats. Timestamps normalized to UTC. Strings cast to consistent encodings. Column names made human-readable. This cleanup happens once, at the bottom of the stack, and every view above benefits.&lt;/p&gt;
&lt;h2&gt;Mistake 3: Using SQL Reserved Words as Column Names&lt;/h2&gt;
&lt;p&gt;&lt;img src=&quot;/assets/images/semantic_layer/10/naming-conventions.png&quot; alt=&quot;Bad vs. good naming conventions — cryptic abbreviations vs. clear business names&quot;&gt;&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;What happens&lt;/strong&gt;: A Bronze view exposes a column called &lt;code&gt;Date&lt;/code&gt;. Now every downstream query must reference &lt;code&gt;&amp;quot;Date&amp;quot;&lt;/code&gt; with double quotes. Analysts forget. AI agents don&apos;t quote it at all. Queries break intermittently. Debugging is frustrating because the error messages are cryptic.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Why it&apos;s common&lt;/strong&gt;: Source systems often use generic names. &lt;code&gt;Date&lt;/code&gt;, &lt;code&gt;Timestamp&lt;/code&gt;, &lt;code&gt;Order&lt;/code&gt;, &lt;code&gt;Group&lt;/code&gt;, &lt;code&gt;Role&lt;/code&gt; — all are SQL reserved words. Bronze views that don&apos;t rename them propagate the problem to every consumer.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;The fix&lt;/strong&gt;: Rename early. In the Bronze layer, map &lt;code&gt;Date&lt;/code&gt; to &lt;code&gt;TransactionDate&lt;/code&gt;, &lt;code&gt;Timestamp&lt;/code&gt; to &lt;code&gt;EventTimestamp&lt;/code&gt;, &lt;code&gt;Order&lt;/code&gt; to &lt;code&gt;CustomerOrder&lt;/code&gt;. Use domain-specific prefixes that are unambiguous and never conflict with SQL keywords.&lt;/p&gt;
&lt;p&gt;This small decision saves hundreds of hours of debugging across the life of the semantic layer. It also dramatically improves AI agent accuracy, since language models generating SQL rarely add appropriate quoting for reserved words.&lt;/p&gt;
&lt;h2&gt;Mistake 4: Building Without Stakeholder Input&lt;/h2&gt;
&lt;p&gt;&lt;strong&gt;What happens&lt;/strong&gt;: A data engineering team builds 50 Silver views based on the database schema. They expose every table, every column, every possible metric. Business users look at the result, don&apos;t recognize any of the terms, and go back to their spreadsheets.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Why it&apos;s common&lt;/strong&gt;: Data engineers understand the schema. They assume the schema structure maps to business needs. It usually doesn&apos;t.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;The fix&lt;/strong&gt;: Start with a metric glossary co-created with stakeholders from Sales, Finance, Marketing, and Product. Ask them: What are your top 5 metrics? How do you calculate them? What decisions do they drive? Build the Silver layer around those answers, not around the database schema.&lt;/p&gt;
&lt;p&gt;This step feels slow. It&apos;s the fastest path to adoption. A semantic layer that uses business language and models business concepts gets adopted. A semantic layer that mirrors the database schema gets ignored.&lt;/p&gt;
&lt;h2&gt;Mistake 5: Treating Documentation as Optional&lt;/h2&gt;
&lt;p&gt;&lt;strong&gt;What happens&lt;/strong&gt;: Views are created with no Wikis, no column descriptions, no Labels. The semantic layer works for the person who built it. Everyone else — analysts, AI agents, new team members — can&apos;t figure out what the views mean.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Why it&apos;s common&lt;/strong&gt;: Documentation takes time. Deadlines are tight. Teams plan to &amp;quot;add documentation later.&amp;quot; Later never comes.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;The fix&lt;/strong&gt;: Make documentation part of the view creation process, not a follow-up task. At minimum, every view gets:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;A one-sentence description of what it represents&lt;/li&gt;
&lt;li&gt;Labels for governance (PII, Finance, Certified)&lt;/li&gt;
&lt;li&gt;Column descriptions for any non-obvious field&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;Modern platforms reduce this burden with AI-generated documentation. &lt;a href=&quot;https://www.dremio.com/blog/5-powerful-dremio-ai-features-you-should-be-using/?utm_source=ev_buffer&amp;amp;utm_medium=influencer&amp;amp;utm_campaign=next-gen-dremio&amp;amp;utm_term=blog-021826-02-18-2026&amp;amp;utm_content=alexmerced&quot;&gt;Dremio&apos;s generative AI&lt;/a&gt; samples table data and auto-generates Wiki descriptions and Label suggestions. The AI provides a 70% first draft. The data team adds domain context for the other 30%.&lt;/p&gt;
&lt;p&gt;Undocumented views are invisible to AI agents. If the Wiki is empty, the AI agent has no context to generate accurate SQL. Documentation isn&apos;t just nice to have. It&apos;s an accuracy requirement.&lt;/p&gt;
&lt;h2&gt;Mistake 6: Applying Security at the BI Tool Level Only&lt;/h2&gt;
&lt;p&gt;&lt;strong&gt;What happens&lt;/strong&gt;: Row-level security is configured in Tableau so regional managers only see their region. Then an analyst opens a SQL client, queries the underlying table directly, and sees all regions. The security was enforced in the dashboard, not in the data.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Why it&apos;s common&lt;/strong&gt;: BI tools make it easy to apply filters and security rules. Data platforms require more setup. Teams take the easy path.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;The fix&lt;/strong&gt;: Enforce access policies at the semantic layer, not the BI layer. Row-level security and column masking should be applied on the virtual datasets (views). Every query path — dashboard, notebook, API, AI agent — inherits the same rules.&lt;/p&gt;
&lt;p&gt;Dremio implements this through Fine-Grained Access Control (FGAC): policies defined as UDFs at the view level. A regional manager queries &lt;code&gt;business.revenue&lt;/code&gt; and automatically sees only their region, regardless of how they access the data. No security gaps between tools.&lt;/p&gt;
&lt;h2&gt;Mistake 7: Trying to Model Everything at Once&lt;/h2&gt;
&lt;p&gt;&lt;img src=&quot;/assets/images/semantic_layer/10/incremental-growth.png&quot; alt=&quot;Incremental growth — from a small core to a comprehensive semantic layer&quot;&gt;&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;What happens&lt;/strong&gt;: The team commits to building a complete semantic layer covering every source, every table, and every metric. The project takes six months. By the time it launches, requirements have changed, stakeholder interest has waned, and half the views are out of date.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Why it&apos;s common&lt;/strong&gt;: Ambitious leaders want a &amp;quot;complete&amp;quot; solution. Data teams want to avoid rework. Neither wants to ship an incomplete layer.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;The fix&lt;/strong&gt;: Start with 3-5 core metrics that the organization actively debates (usually Revenue, Active Users, Churn). Build one Bronze → Silver → Gold pipeline per metric. Validate that the same question produces the same answer across two different tools.&lt;/p&gt;
&lt;p&gt;Once those metrics are stable, expand incrementally. Add new sources, new views, new metrics — one at a time. Each addition is low-risk because the layered architecture isolates changes. A new Gold view doesn&apos;t affect existing Silver views.&lt;/p&gt;
&lt;p&gt;The fastest semantic layers reach 80% organizational coverage not by modeling everything up front, but by proving value quickly and expanding from momentum.&lt;/p&gt;
&lt;h2&gt;What to Do Next&lt;/h2&gt;
&lt;p&gt;Pick one mistake from this list. Check whether your semantic layer (or your plan for one) is making it. Fix that one thing this week. Then come back for the next one.&lt;/p&gt;
&lt;p&gt;&lt;a href=&quot;https://www.dremio.com/get-started?utm_source=ev_buffer&amp;amp;utm_medium=influencer&amp;amp;utm_campaign=next-gen-dremio&amp;amp;utm_term=blog-021826-02-18-2026&amp;amp;utm_content=alexmerced&quot;&gt;Try Dremio Cloud free for 30 days&lt;/a&gt;&lt;/p&gt;
</content:encoded><author>Alex Merced</author></item><item><title>Pipeline Observability: Know When Things Break</title><link>https://iceberglakehouse.com/posts/2026-02-debp-observability-monitoring/</link><guid isPermaLink="true">https://iceberglakehouse.com/posts/2026-02-debp-observability-monitoring/</guid><description>
![Pipeline observability dashboard showing metrics, logs, and data lineage](/assets/images/debp/09/observability-dashboard.png)

An analyst messages ...</description><pubDate>Wed, 18 Feb 2026 17:00:00 GMT</pubDate><content:encoded>&lt;p&gt;&lt;img src=&quot;/assets/images/debp/09/observability-dashboard.png&quot; alt=&quot;Pipeline observability dashboard showing metrics, logs, and data lineage&quot;&gt;&lt;/p&gt;
&lt;p&gt;An analyst messages you on Slack: &amp;quot;The revenue numbers look wrong. Is the pipeline broken?&amp;quot; You check the orchestrator — all green. You check the target table — data loaded this morning. You check the row count — looks normal. Forty-five minutes later, you discover that a source API returned empty responses for one region, and the pipeline happily loaded zero rows for that region without alerting anyone.&lt;/p&gt;
&lt;p&gt;The pipeline succeeded. The data was wrong. No one knew until a human noticed.&lt;/p&gt;
&lt;p&gt;This is the cost of monitoring pipeline execution without monitoring pipeline output.&lt;/p&gt;
&lt;h2&gt;You Can&apos;t Fix What You Can&apos;t See&lt;/h2&gt;
&lt;p&gt;Traditional monitoring answers: did the job run? Did it succeed? How long did it take? These questions cover infrastructure health, not data health. A pipeline can execute perfectly — no errors, no retries, no timeouts — and still produce incorrect or incomplete data.&lt;/p&gt;
&lt;p&gt;Observability goes further. It answers: what did the pipeline process? How much? Was the data complete and correct? Is the output fresh? And when something is wrong, it provides enough context to diagnose the root cause without hunting through logs manually.&lt;/p&gt;
&lt;p&gt;The distinction matters. Monitoring tells you the pipeline ran. Observability tells you the pipeline worked.&lt;/p&gt;
&lt;h2&gt;The Three Pillars of Pipeline Observability&lt;/h2&gt;
&lt;p&gt;&lt;strong&gt;Metrics.&lt;/strong&gt; Quantitative measurements collected at every pipeline stage: row counts, processing time, error rates, data freshness, resource utilization. Metrics are cheap to collect, easy to aggregate, and essential for dashboards and alerting.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Logs.&lt;/strong&gt; Structured, timestamped records of what happened during execution. A useful log entry includes: pipeline name, stage name, batch ID, timestamp, action (started/completed/failed), row count, and any error message. Structured logs (JSON format) are searchable and parseable. Unstructured logs (&amp;quot;Processing data...&amp;quot;) are noise.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Lineage.&lt;/strong&gt; The path data takes from source to destination, at the table or column level. Lineage answers: where did this number come from? If the source changes, what downstream tables and dashboards are affected? Lineage turns debugging from archaeology into graph traversal.&lt;/p&gt;
&lt;p&gt;&lt;img src=&quot;/assets/images/debp/09/three-pillars.png&quot; alt=&quot;Three pillars: metrics tracking counts and timing, logs recording execution details, lineage mapping data flow&quot;&gt;&lt;/p&gt;
&lt;h2&gt;What to Measure&lt;/h2&gt;
&lt;p&gt;Not everything needs a metric. Measure what helps you answer these questions:&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Is the data fresh?&lt;/strong&gt; Track the timestamp of the most recent row in each target table. Compare it to the expected freshness (e.g., less than 2 hours old). A freshness metric that exceeds its SLA triggers an alert before anyone opens a dashboard.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Is the data complete?&lt;/strong&gt; Track row counts in vs. row counts out at each stage. A significant drop (e.g., input: 100,000 rows, output: 90,000 rows) means records were filtered, rejected, or lost.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Is the data correct?&lt;/strong&gt; Track quality metrics: null rates, duplicate rates, range violation counts. Trend these over time. A gradual increase in null rates indicates a deteriorating source.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Is the pipeline healthy?&lt;/strong&gt; Track execution time per stage. A stage that normally takes 5 minutes but now takes 50 minutes may indicate data volume growth, resource contention, or a bad query plan.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Is the pipeline meeting SLAs?&lt;/strong&gt; Define when data must be available (e.g., daily tables loaded by 6 AM). Track SLA compliance as a percentage. A pipeline with 95% SLA compliance has failed its consumers once every 20 days.&lt;/p&gt;
&lt;h2&gt;Alerting Without Alert Fatigue&lt;/h2&gt;
&lt;p&gt;Alert fatigue is the most common reason observability fails. Too many alerts and the on-call engineer starts ignoring them. Too few and real problems go unnoticed.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Alert on business impact, not on every error.&lt;/strong&gt; A transient retry is not an alert. A pipeline that misses its SLA by an hour is. A single null row is not an alert. A null rate jumping from 0.1% to 15% is.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Use severity levels.&lt;/strong&gt; Critical: data consumers are affected now (missed SLA, empty output). Warning: something is degrading but not yet impacting consumers (execution time increasing, row count declining). Info: notable but non-actionable (successful backfill, schema migration completed).&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Set thresholds dynamically.&lt;/strong&gt; Static thresholds (&amp;quot;alert if row count &amp;lt; 10,000&amp;quot;) break when data naturally grows or shrinks. Use rolling baselines: alert if today&apos;s row count deviates by more than 20% from the 7-day average.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Route alerts effectively.&lt;/strong&gt; Critical alerts go to PagerDuty or on-call channels. Warnings go to team Slack channels. Info goes to logs-only. Don&apos;t send everything to the same channel.&lt;/p&gt;
&lt;p&gt;&lt;img src=&quot;/assets/images/debp/09/alert-levels.png&quot; alt=&quot;Alert severity levels: critical triggers pages, warning goes to channel, info logged&quot;&gt;&lt;/p&gt;
&lt;h2&gt;Data Lineage for Impact Analysis&lt;/h2&gt;
&lt;p&gt;When a problem occurs, the first question is: what&apos;s affected? Lineage answers this.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Upstream analysis.&lt;/strong&gt; A dashboard shows wrong numbers. Lineage traces the dashboard back through the serving table, the transformation, the staging table, and the raw source. The break is visible in the graph.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Downstream impact analysis.&lt;/strong&gt; A source system announces a schema change. Lineage shows every table, model, and dashboard that depends on that source. You know the blast radius before making any changes.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Column-level lineage.&lt;/strong&gt; Table-level lineage shows connections between tables. Column-level lineage shows which source column feeds which target column. This level of detail turns a &amp;quot;the revenue is wrong&amp;quot; investigation from hours to minutes.&lt;/p&gt;
&lt;h2&gt;What to Do Next&lt;/h2&gt;
&lt;p&gt;Add freshness tracking to your three most critical tables: record the max event timestamp after each load and alert when it exceeds the SLA. This single metric — data freshness — catches more problems than any other observability signal.&lt;/p&gt;
&lt;p&gt;&lt;a href=&quot;https://www.dremio.com/get-started?utm_source=ev_buffer&amp;amp;utm_medium=influencer&amp;amp;utm_campaign=next-gen-dremio&amp;amp;utm_term=blog-021826-02-18-2026&amp;amp;utm_content=alexmerced&quot;&gt;Try Dremio Cloud free for 30 days&lt;/a&gt;&lt;/p&gt;
</content:encoded><author>Alex Merced</author></item><item><title>Data Vault Modeling: Hubs, Links, and Satellites</title><link>https://iceberglakehouse.com/posts/2026-02-dm-data-vault-modeling/</link><guid isPermaLink="true">https://iceberglakehouse.com/posts/2026-02-dm-data-vault-modeling/</guid><description>
![Data Vault model showing Hubs, Links, and Satellites as interconnected components](/assets/images/data_modeling/09/data-vault-overview.png)

Dimens...</description><pubDate>Wed, 18 Feb 2026 17:00:00 GMT</pubDate><content:encoded>&lt;p&gt;&lt;img src=&quot;/assets/images/data_modeling/09/data-vault-overview.png&quot; alt=&quot;Data Vault model showing Hubs, Links, and Satellites as interconnected components&quot;&gt;&lt;/p&gt;
&lt;p&gt;Dimensional modeling works well when your source systems are stable and your business questions are predictable. But what happens when sources change constantly, new systems get added every quarter, and regulatory requirements demand a full audit trail of every attribute change?&lt;/p&gt;
&lt;p&gt;Data Vault modeling was designed for exactly this scenario. Created by Dan Linstedt, it separates data into three distinct table types — Hubs, Links, and Satellites — each handling a different concern: identity, relationships, and descriptive context.&lt;/p&gt;
&lt;h2&gt;What Problem Data Vault Solves&lt;/h2&gt;
&lt;p&gt;Traditional dimensional models embed everything about a business entity in one dimension table. A &lt;code&gt;dim_customers&lt;/code&gt; table contains the customer ID, name, address, segment, acquisition channel, and lifetime value. When a new source system provides additional customer attributes, you add columns to &lt;code&gt;dim_customers&lt;/code&gt;. When business rules change how &amp;quot;segment&amp;quot; is calculated, you update the ETL pipeline that populates that table.&lt;/p&gt;
&lt;p&gt;Over time, these dimension tables become fragile. They depend on multiple source systems. A change in one source breaks the ETL. Schema changes require coordinated updates across pipelines, tables, and downstream reports.&lt;/p&gt;
&lt;p&gt;Data Vault solves this by decomposing entities into independent components that can evolve separately.&lt;/p&gt;
&lt;h2&gt;The Three Building Blocks&lt;/h2&gt;
&lt;h3&gt;Hubs: Business Identity&lt;/h3&gt;
&lt;p&gt;A Hub stores unique business keys — the identifiers that define a business entity regardless of which source system provides them.&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-sql&quot;&gt;CREATE TABLE hub_customer (
    customer_hash_key BINARY(32),  -- Hash of the business key
    customer_id VARCHAR(50),        -- Natural business key
    load_date TIMESTAMP,
    record_source VARCHAR(100)
);
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Hubs are immutable. Once a business key is loaded, it never changes. A customer who has &lt;code&gt;customer_id = &apos;C-1042&apos;&lt;/code&gt; always has that key. Hubs answer the question: &lt;em&gt;What business concepts exist?&lt;/em&gt;&lt;/p&gt;
&lt;h3&gt;Links: Relationships&lt;/h3&gt;
&lt;p&gt;&lt;img src=&quot;/assets/images/data_modeling/09/hub-link-relationship.png&quot; alt=&quot;Hubs connected by Link tables representing relationships between business entities&quot;&gt;&lt;/p&gt;
&lt;p&gt;A Link stores relationships between Hubs. Every relationship — customer-to-order, order-to-product, employee-to-department — gets its own Link table.&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-sql&quot;&gt;CREATE TABLE link_customer_order (
    link_hash_key BINARY(32),
    customer_hash_key BINARY(32),
    order_hash_key BINARY(32),
    load_date TIMESTAMP,
    record_source VARCHAR(100)
);
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Links are also immutable. Once a relationship is recorded, it stays. Links support many-to-many relationships by default. They answer the question: &lt;em&gt;How are business concepts related?&lt;/em&gt;&lt;/p&gt;
&lt;h3&gt;Satellites: Descriptive Context&lt;/h3&gt;
&lt;p&gt;Satellites store the descriptive attributes of a Hub or Link, along with their change history.&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-sql&quot;&gt;CREATE TABLE sat_customer_details (
    customer_hash_key BINARY(32),
    effective_date TIMESTAMP,
    customer_name VARCHAR(200),
    email VARCHAR(200),
    city VARCHAR(100),
    segment VARCHAR(50),
    load_date TIMESTAMP,
    record_source VARCHAR(100)
);
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Every time an attribute changes, a new Satellite row is inserted. This is equivalent to SCD Type 2 — full history is preserved without modifying existing rows. Different source systems can feed different Satellites for the same Hub, allowing attributes to arrive independently.&lt;/p&gt;
&lt;h2&gt;How a Data Vault Query Works&lt;/h2&gt;
&lt;p&gt;To reconstruct a business entity (like a current customer profile), you join the Hub to its current Satellite rows:&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-sql&quot;&gt;SELECT
    h.customer_id,
    s.customer_name,
    s.email,
    s.city,
    s.segment
FROM hub_customer h
JOIN sat_customer_details s ON h.customer_hash_key = s.customer_hash_key
WHERE s.effective_date = (
    SELECT MAX(effective_date)
    FROM sat_customer_details s2
    WHERE s2.customer_hash_key = s.customer_hash_key
);
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;This is more complex than querying &lt;code&gt;dim_customers&lt;/code&gt; directly. That complexity is the primary criticism of Data Vault. In practice, teams build a presentation layer — star schema views on top of the vault — for business users and BI tools.&lt;/p&gt;
&lt;p&gt;Platforms like &lt;a href=&quot;https://www.dremio.com/blog/agentic-analytics-semantic-layer/?utm_source=ev_buffer&amp;amp;utm_medium=influencer&amp;amp;utm_campaign=next-gen-dremio&amp;amp;utm_term=blog-021826-02-18-2026&amp;amp;utm_content=alexmerced&quot;&gt;Dremio&lt;/a&gt; make this practical. The raw vault tables live in the Bronze layer. Silver-layer views reconstruct business entities by joining Hubs, Links, and Satellites. Gold-layer views present dimensional star schemas for dashboards and AI agents. Users never query the vault tables directly.&lt;/p&gt;
&lt;h2&gt;When Data Vault Fits&lt;/h2&gt;
&lt;p&gt;&lt;strong&gt;Multiple source systems that change frequently.&lt;/strong&gt; Adding a new source means adding new Satellites — not redesigning existing tables. The Hub and Link structure remains stable.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Regulated industries requiring full audit trails.&lt;/strong&gt; Financial services, healthcare, and government often need to prove what data looked like at any point in time. Satellites provide that out of the box.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Large enterprises with parallel development teams.&lt;/strong&gt; Hubs, Links, and Satellites can be loaded independently, enabling parallel ETL development without pipeline conflicts.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Long-term data warehouses with decades of history.&lt;/strong&gt; The separation of structure (Hubs, Links) from content (Satellites) makes the vault resilient to business changes over time.&lt;/p&gt;
&lt;h2&gt;When Data Vault Doesn&apos;t Fit&lt;/h2&gt;
&lt;p&gt;&lt;strong&gt;Small teams or simple source environments.&lt;/strong&gt; If you have five source tables and one BI tool, Data Vault adds complexity without proportional benefit. A star schema is faster to build and easier to maintain.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Direct BI tool access.&lt;/strong&gt; BI tools don&apos;t speak Data Vault natively. You always need a presentation layer on top, which means building two models instead of one.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Speed-to-value projects.&lt;/strong&gt; When the goal is &amp;quot;get a dashboard live this sprint,&amp;quot; Data Vault&apos;s up-front design work slows you down.&lt;/p&gt;
&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Factor&lt;/th&gt;
&lt;th&gt;Data Vault&lt;/th&gt;
&lt;th&gt;Dimensional Model&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Source flexibility&lt;/td&gt;
&lt;td&gt;High&lt;/td&gt;
&lt;td&gt;Moderate&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Audit trail&lt;/td&gt;
&lt;td&gt;Built-in&lt;/td&gt;
&lt;td&gt;Optional (SCDs)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Query simplicity&lt;/td&gt;
&lt;td&gt;Low (needs presentation layer)&lt;/td&gt;
&lt;td&gt;High&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Learning curve&lt;/td&gt;
&lt;td&gt;High&lt;/td&gt;
&lt;td&gt;Moderate&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Adding new sources&lt;/td&gt;
&lt;td&gt;Easy (new satellites)&lt;/td&gt;
&lt;td&gt;Harder (redesign dimensions)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;BI tool compatibility&lt;/td&gt;
&lt;td&gt;Low&lt;/td&gt;
&lt;td&gt;High&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;
&lt;h2&gt;What to Do Next&lt;/h2&gt;
&lt;p&gt;&lt;img src=&quot;/assets/images/data_modeling/09/data-vault-presentation.png&quot; alt=&quot;Presentation layer of star schema views built on top of a Data Vault foundation&quot;&gt;&lt;/p&gt;
&lt;p&gt;If you&apos;re evaluating Data Vault, start by counting your source systems and estimating how often they change schema. If the answer is &amp;quot;more than five sources&amp;quot; and &amp;quot;at least once a quarter,&amp;quot; Data Vault&apos;s separation of concerns will likely save you from painful redesign cycles. If your environment is simpler than that, a well-designed dimensional model will get you to production faster.&lt;/p&gt;
&lt;p&gt;&lt;a href=&quot;https://www.dremio.com/get-started?utm_source=ev_buffer&amp;amp;utm_medium=influencer&amp;amp;utm_campaign=next-gen-dremio&amp;amp;utm_term=blog-021826-02-18-2026&amp;amp;utm_content=alexmerced&quot;&gt;Try Dremio Cloud free for 30 days&lt;/a&gt;&lt;/p&gt;
</content:encoded><author>Alex Merced</author></item><item><title>How a Self-Documenting Semantic Layer Reduces Data Team Toil</title><link>https://iceberglakehouse.com/posts/2026-02-sl-self-documenting-semantic-layer/</link><guid isPermaLink="true">https://iceberglakehouse.com/posts/2026-02-sl-self-documenting-semantic-layer/</guid><description>
![Self-documenting semantic layer — AI generating descriptions and labels automatically](/assets/images/semantic_layer/09/self-documenting.png)

Ever...</description><pubDate>Wed, 18 Feb 2026 17:00:00 GMT</pubDate><content:encoded>&lt;p&gt;&lt;img src=&quot;/assets/images/semantic_layer/09/self-documenting.png&quot; alt=&quot;Self-documenting semantic layer — AI generating descriptions and labels automatically&quot;&gt;&lt;/p&gt;
&lt;p&gt;Every data team knows documentation is important. And almost every data team has a backlog of undocumented tables, unlabeled columns, and outdated descriptions that nobody has time to fix. The problem isn&apos;t motivation. It&apos;s that manual documentation doesn&apos;t scale.&lt;/p&gt;
&lt;p&gt;A self-documenting semantic layer changes the equation. Instead of asking humans to describe every column in every table, the platform generates descriptions automatically, suggests governance labels from data patterns, and propagates context through the view chain. Documentation becomes a byproduct of building the semantic layer, not a separate project.&lt;/p&gt;
&lt;h2&gt;The Documentation Problem Nobody Solves&lt;/h2&gt;
&lt;p&gt;Industry surveys consistently find that 70% or more of enterprise data assets are undocumented or poorly documented. The result: analysts spend 30-40% of their time searching for data and trying to understand what it means before they can start analyzing it.&lt;/p&gt;
&lt;p&gt;This isn&apos;t just a productivity problem. Undocumented data is a governance risk. A column named &lt;code&gt;status&lt;/code&gt; with values 0, 1, 2, and 3 could mean anything. An analyst guesses. An AI agent guesses worse. Nobody verifies. The wrong assumptions get baked into dashboards that drive business decisions.&lt;/p&gt;
&lt;p&gt;Data teams respond with documentation sprints. They burn a week writing Wiki pages for their top 50 tables. Two months later, half the descriptions are outdated because schemas have changed. The cycle repeats.&lt;/p&gt;
&lt;h2&gt;What Self-Documenting Actually Means&lt;/h2&gt;
&lt;p&gt;A self-documenting semantic layer generates and maintains documentation with minimal human effort. Three mechanisms work together:&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;AI-generated descriptions&lt;/strong&gt;: The platform samples data in a table and generates human-readable descriptions for each column and the table itself.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Automated label suggestions&lt;/strong&gt;: The platform analyzes column names, data types, and value patterns to suggest governance labels (PII, Finance, Certified).&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Metadata propagation&lt;/strong&gt;: When a Silver view references a Bronze view, column descriptions flow downstream automatically. Documentation written once at the Bronze level appears everywhere the column is used.&lt;/p&gt;
&lt;p&gt;Human oversight is still essential. AI provides a 70% first draft. Data engineers add the domain-specific context that only they know: business rules, edge cases, known data quality issues. The point isn&apos;t to eliminate human documentation. It&apos;s to eliminate the blank page.&lt;/p&gt;
&lt;h2&gt;AI-Generated Descriptions&lt;/h2&gt;
&lt;p&gt;&lt;img src=&quot;/assets/images/semantic_layer/09/ai-doc-generation.png&quot; alt=&quot;AI scanning data tables and generating documentation automatically&quot;&gt;&lt;/p&gt;
&lt;p&gt;Modern semantic layer platforms can sample a table&apos;s data and generate meaningful descriptions automatically.&lt;/p&gt;
&lt;p&gt;Consider a column named &lt;code&gt;cltv&lt;/code&gt; in a table called &lt;code&gt;customers&lt;/code&gt;. The AI samples values (1200.50, 3400.00, 780.25), examines the column name and table context, and generates:&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;cltv&lt;/strong&gt;: Customer Lifetime Value in USD. Represents the total revenue attributed to this customer from their first purchase to the current date, excluding refunded transactions.&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;Not every generated description will be this precise. But most are useful enough to replace the current state: an empty description that tells the analyst nothing.&lt;/p&gt;
&lt;p&gt;More examples:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;A column with values &amp;quot;US&amp;quot;, &amp;quot;UK&amp;quot;, &amp;quot;DE&amp;quot; → &amp;quot;ISO 3166 alpha-2 country code for the customer&apos;s billing address&amp;quot;&lt;/li&gt;
&lt;li&gt;A DATE column named &lt;code&gt;created_at&lt;/code&gt; in a &lt;code&gt;subscriptions&lt;/code&gt; table → &amp;quot;Date the subscription was created&amp;quot;&lt;/li&gt;
&lt;li&gt;A FLOAT column named &lt;code&gt;mrr&lt;/code&gt; → &amp;quot;Monthly Recurring Revenue in the account&apos;s base currency&amp;quot;&lt;/li&gt;
&lt;/ul&gt;
&lt;h2&gt;Automated Label Suggestions&lt;/h2&gt;
&lt;p&gt;Labels categorize data for governance and discovery. Manually tagging every column in a data warehouse with hundreds of tables is impractical. AI-based label suggestion makes it manageable:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Columns containing email-like patterns (text with @ symbols) → suggested label: &lt;strong&gt;PII&lt;/strong&gt;&lt;/li&gt;
&lt;li&gt;Columns with phone number patterns → suggested label: &lt;strong&gt;PII&lt;/strong&gt;&lt;/li&gt;
&lt;li&gt;Columns named &lt;code&gt;price&lt;/code&gt;, &lt;code&gt;total&lt;/code&gt;, &lt;code&gt;amount&lt;/code&gt;, &lt;code&gt;revenue&lt;/code&gt; → suggested label: &lt;strong&gt;Finance&lt;/strong&gt;&lt;/li&gt;
&lt;li&gt;Columns in tables marked &amp;quot;Certified&amp;quot; → suggested label propagated to downstream views&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;&lt;a href=&quot;https://www.dremio.com/blog/5-powerful-dremio-ai-features-you-should-be-using/?utm_source=ev_buffer&amp;amp;utm_medium=influencer&amp;amp;utm_campaign=next-gen-dremio&amp;amp;utm_term=blog-021826-02-18-2026&amp;amp;utm_content=alexmerced&quot;&gt;Dremio&apos;s approach&lt;/a&gt; combines these suggestions with human approval. The AI proposes labels. A data engineer reviews and accepts or rejects. Over time, the catalog fills up with accurate, useful labels without dedicated labeling sprints.&lt;/p&gt;
&lt;h2&gt;Metadata Propagation Through Views&lt;/h2&gt;
&lt;p&gt;&lt;img src=&quot;/assets/images/semantic_layer/09/metadata-propagation.png&quot; alt=&quot;Metadata flowing through Bronze, Silver, and Gold view layers&quot;&gt;&lt;/p&gt;
&lt;p&gt;In a well-designed semantic layer, documentation shouldn&apos;t need to be written more than once. The Bronze-Silver-Gold view architecture creates a natural propagation path:&lt;/p&gt;
&lt;ol&gt;
&lt;li&gt;&lt;strong&gt;Bronze layer&lt;/strong&gt;: Document the &lt;code&gt;CustomerID&lt;/code&gt; column as &amp;quot;Unique identifier for the customer, sourced from the CRM system.&amp;quot;&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Silver layer&lt;/strong&gt;: A Silver view references &lt;code&gt;CustomerID&lt;/code&gt;. The description propagates automatically. No re-documentation needed.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Gold layer&lt;/strong&gt;: An aggregated Gold view groups by &lt;code&gt;CustomerID&lt;/code&gt;. The description carries through.&lt;/li&gt;
&lt;/ol&gt;
&lt;p&gt;This propagation is especially valuable for join columns, filter columns, and commonly used dimensions that appear in dozens of views. Write the description once at the source, and it follows the column everywhere.&lt;/p&gt;
&lt;h2&gt;How This Reduces Toil&lt;/h2&gt;
&lt;p&gt;The impact on data team productivity is measurable:&lt;/p&gt;
&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Documentation Task&lt;/th&gt;
&lt;th&gt;Manual Approach&lt;/th&gt;
&lt;th&gt;Self-Documenting&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Column descriptions&lt;/td&gt;
&lt;td&gt;Write each by hand&lt;/td&gt;
&lt;td&gt;AI generates draft, human refines&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Governance labels&lt;/td&gt;
&lt;td&gt;Manual tagging sprint&lt;/td&gt;
&lt;td&gt;AI suggests from data patterns&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Downstream view docs&lt;/td&gt;
&lt;td&gt;Re-write for each view&lt;/td&gt;
&lt;td&gt;Propagated from upstream&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Schema change updates&lt;/td&gt;
&lt;td&gt;Manually check and update&lt;/td&gt;
&lt;td&gt;AI re-scans and flags changes&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;New table onboarding&lt;/td&gt;
&lt;td&gt;Create from scratch&lt;/td&gt;
&lt;td&gt;AI generates baseline immediately&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;
&lt;p&gt;The net effect: documentation coverage goes from 30% (what the team could manage manually) to 80-90% (AI baseline + human refinement). The team spends hours instead of weeks on documentation. And the documentation stays current because the AI can re-scan when schemas change — flagging outdated descriptions instead of waiting for someone to notice.&lt;/p&gt;
&lt;p&gt;For AI agents, this improvement is material. A richer, more accurate semantic layer means the AI generates better SQL, hallucinates less, and requires fewer corrections. Self-documentation isn&apos;t just a productivity feature. It&apos;s an AI accuracy feature.&lt;/p&gt;
&lt;h2&gt;What to Do Next&lt;/h2&gt;
&lt;p&gt;Pick your most-used table. Open it in your data platform. How many columns have descriptions? How many have governance labels? If the answer is &amp;quot;not many,&amp;quot; calculate how long it would take to document the entire table manually. Then consider a platform that does 70% of that work for you.&lt;/p&gt;
&lt;p&gt;&lt;a href=&quot;https://www.dremio.com/get-started?utm_source=ev_buffer&amp;amp;utm_medium=influencer&amp;amp;utm_campaign=next-gen-dremio&amp;amp;utm_term=blog-021826-02-18-2026&amp;amp;utm_content=alexmerced&quot;&gt;Try Dremio Cloud free for 30 days&lt;/a&gt;&lt;/p&gt;
</content:encoded><author>Alex Merced</author></item><item><title>Testing Data Pipelines: What to Validate and When</title><link>https://iceberglakehouse.com/posts/2026-02-debp-testing-data-pipelines/</link><guid isPermaLink="true">https://iceberglakehouse.com/posts/2026-02-debp-testing-data-pipelines/</guid><description>
![Data pipeline testing pyramid with schema tests at the base, contract tests in the middle, and regression tests at the top](/assets/images/debp/08/...</description><pubDate>Wed, 18 Feb 2026 16:00:00 GMT</pubDate><content:encoded>&lt;p&gt;&lt;img src=&quot;/assets/images/debp/08/testing-pyramid.png&quot; alt=&quot;Data pipeline testing pyramid with schema tests at the base, contract tests in the middle, and regression tests at the top&quot;&gt;&lt;/p&gt;
&lt;p&gt;Ask an application developer how they test their code and they&apos;ll describe unit tests, integration tests, CI/CD pipelines, and coverage metrics. Ask a data engineer the same question and the most common answer is: &amp;quot;we check the dashboard.&amp;quot;&lt;/p&gt;
&lt;p&gt;Data pipelines are software. They have inputs, logic, and outputs. They can have bugs. They can break silently. And unlike application bugs that trigger error pages, data bugs produce numbers that look plausible — until someone makes a business decision based on them.&lt;/p&gt;
&lt;h2&gt;Pipelines Are Software — They Need Tests&lt;/h2&gt;
&lt;p&gt;The bar for data pipeline testing shouldn&apos;t be lower than for application code. If anything, it should be higher. Application bugs are usually visible (broken UI, failed request). Data bugs are invisible (wrong aggregation, missing rows, stale values) and their impact compounds over time.&lt;/p&gt;
&lt;p&gt;Yet most data teams have no automated tests. They rely on manual spot-checks, analyst complaints, and hope. Testing a pipeline means catching problems before they reach consumers, not after.&lt;/p&gt;
&lt;h2&gt;The Testing Pyramid for Data&lt;/h2&gt;
&lt;p&gt;Borrow the testing pyramid from software engineering and adapt it for data:&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Base: Schema and contract tests.&lt;/strong&gt; Fast, cheap, run on every pipeline execution. Does the output schema match what consumers expect? Do required columns exist? Are data types correct? These tests catch structural problems (dropped columns, type changes) immediately.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Middle: Data validation tests.&lt;/strong&gt; Check the values in the output. Are primary keys unique? Are required columns non-null? Do amounts, dates, and counts fall within valid ranges? These tests catch quality problems (duplicates, nulls, outliers) before they propagate.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Top: Regression and integration tests.&lt;/strong&gt; Compare today&apos;s output to historical patterns. Did the row count change dramatically? Did the total revenue shift by more than 10%? These tests catch subtle logic errors and upstream data changes.&lt;/p&gt;
&lt;p&gt;Run more tests at the base (they&apos;re cheap and fast) and fewer at the top (they&apos;re expensive but comprehensive).&lt;/p&gt;
&lt;h2&gt;Schema and Contract Tests&lt;/h2&gt;
&lt;p&gt;Schema tests are the simplest and most impactful place to start. After every pipeline run, verify:&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Column existence.&lt;/strong&gt; Every expected column is present in the output. If a transformation accidentally drops a column, you want to know immediately — not when a downstream query fails.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Data types.&lt;/strong&gt; Columns have their expected types. A revenue column that silently became a string will pass a NULL check but break calculations.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Not-null constraints.&lt;/strong&gt; Required columns contain no nulls. An order table where &lt;code&gt;customer_id&lt;/code&gt; is null means the join to the customer table will silently lose rows.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Uniqueness.&lt;/strong&gt; Primary key columns have no duplicates. Duplicate order IDs mean double-counted revenue.&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-sql&quot;&gt;-- Example schema and contract tests
-- Check for unexpected nulls
SELECT COUNT(*) AS null_count
FROM orders
WHERE order_id IS NULL OR customer_id IS NULL;

-- Check for duplicates
SELECT order_id, COUNT(*) AS cnt
FROM orders
GROUP BY order_id
HAVING COUNT(*) &amp;gt; 1;
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;&lt;img src=&quot;/assets/images/debp/08/schema-tests.png&quot; alt=&quot;Schema test examples: column existence, type validation, null checks, uniqueness checks&quot;&gt;&lt;/p&gt;
&lt;h2&gt;Runtime Data Validation&lt;/h2&gt;
&lt;p&gt;Schema tests verify structure. Data validation tests verify content. Run these after every pipeline execution, before marking the job as successful:&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Range checks.&lt;/strong&gt; Numeric values fall within expected bounds. An order total of -$500 or $999,999,999 is likely a bug. Define acceptable ranges per column and flag outliers.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Referential integrity.&lt;/strong&gt; Foreign keys reference existing records. An order with &lt;code&gt;product_id = 12345&lt;/code&gt; should correspond to a row in the products table. Missing references indicate either missing data or a pipeline timing issue.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Freshness checks.&lt;/strong&gt; The most recent event timestamp is within the expected window. If a daily pipeline&apos;s output contains no events from today, something went wrong — even if the job succeeded.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Volume checks.&lt;/strong&gt; Row counts fall within historical norms. A daily feed that normally produces 50,000 rows but arrives with 500 should trigger an alert. Use percentage thresholds (±20% from the trailing 7-day average) to avoid false positives.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Custom business rules.&lt;/strong&gt; Domain-specific assertions. &amp;quot;Every invoice must have at least one line item.&amp;quot; &amp;quot;No employee should have a start date in the future.&amp;quot; These rules encode business knowledge that generic tests can&apos;t capture.&lt;/p&gt;
&lt;h2&gt;Regression and Anomaly Detection&lt;/h2&gt;
&lt;p&gt;Regression tests compare today&apos;s output to historical baselines:&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Aggregate comparison.&lt;/strong&gt; Compare key metrics (total revenue, row count, distinct customer count) against the previous run. Deviations beyond a threshold (e.g., ±15%) may indicate an upstream change, a bug in new transformation logic, or missing source data.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Distribution checks.&lt;/strong&gt; Compare the distribution of categorical columns (status values, country codes) against historical norms. A sudden spike in &amp;quot;unknown&amp;quot; status may indicate a schema change in the source.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Trend analysis.&lt;/strong&gt; Track metrics over time. A gradual decline in row count over weeks may indicate a leak that daily checks miss.&lt;/p&gt;
&lt;p&gt;&lt;img src=&quot;/assets/images/debp/08/regression-testing.png&quot; alt=&quot;Regression testing: comparing aggregates, distributions, and trends over time&quot;&gt;&lt;/p&gt;
&lt;p&gt;Regression tests are more expensive to maintain because they require historical baselines and threshold tuning. Start simple (row count ± 20%) and refine as you learn what normal looks like.&lt;/p&gt;
&lt;h2&gt;What to Do Next&lt;/h2&gt;
&lt;p&gt;Add three tests to your most critical pipeline today: a uniqueness check on the primary key, a null check on required columns, and a row count comparison against yesterday&apos;s output. Run them after every pipeline execution. These three tests alone will catch the majority of data problems before they reach consumers.&lt;/p&gt;
&lt;p&gt;&lt;a href=&quot;https://www.dremio.com/get-started?utm_source=ev_buffer&amp;amp;utm_medium=influencer&amp;amp;utm_campaign=next-gen-dremio&amp;amp;utm_term=blog-021826-02-18-2026&amp;amp;utm_content=alexmerced&quot;&gt;Try Dremio Cloud free for 30 days&lt;/a&gt;&lt;/p&gt;
</content:encoded><author>Alex Merced</author></item><item><title>Denormalization: When and Why to Flatten Your Data</title><link>https://iceberglakehouse.com/posts/2026-02-dm-denormalization-when-why/</link><guid isPermaLink="true">https://iceberglakehouse.com/posts/2026-02-dm-denormalization-when-why/</guid><description>
![Normalized model with many interconnected tables vs. denormalized wide flat table](/assets/images/data_modeling/08/denormalization-overview.png)

N...</description><pubDate>Wed, 18 Feb 2026 16:00:00 GMT</pubDate><content:encoded>&lt;p&gt;&lt;img src=&quot;/assets/images/data_modeling/08/denormalization-overview.png&quot; alt=&quot;Normalized model with many interconnected tables vs. denormalized wide flat table&quot;&gt;&lt;/p&gt;
&lt;p&gt;Normalization is the first rule taught in database design. Eliminate redundancy. Store each fact once. Use foreign keys. It&apos;s the right rule for transactional systems. And it&apos;s the wrong rule for most analytics workloads.&lt;/p&gt;
&lt;p&gt;Denormalization is the deliberate introduction of redundancy into your data model to reduce joins and speed up queries. Done poorly, it creates a maintenance nightmare. Done well, it turns slow dashboards into fast ones and makes your data accessible to analysts and AI agents who can&apos;t write 12-table joins.&lt;/p&gt;
&lt;h2&gt;What Normalization Gives You (and What It Costs)&lt;/h2&gt;
&lt;p&gt;Normalization (Third Normal Form and beyond) organizes data so that each piece of information exists in exactly one place. A customer&apos;s city lives in the customers table. An order&apos;s product lives in the order_items table joined to the products table.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;What normalization gives you:&lt;/strong&gt;&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;No update anomalies (change a city in one row, not thousands)&lt;/li&gt;
&lt;li&gt;Smaller storage footprint (no duplicated data)&lt;/li&gt;
&lt;li&gt;Strong data integrity (constraints enforced at the schema level)&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;&lt;strong&gt;What normalization costs you:&lt;/strong&gt;&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;More joins per query (a report might join 10-15 tables)&lt;/li&gt;
&lt;li&gt;Slower read performance (each join adds latency)&lt;/li&gt;
&lt;li&gt;More complex SQL (longer queries, more error-prone)&lt;/li&gt;
&lt;li&gt;Harder self-service (analysts struggle with multi-join queries)&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;For an OLTP system processing 10,000 inserts per second, normalization is correct. For an OLAP system answering &amp;quot;revenue by region by quarter,&amp;quot; it&apos;s a performance bottleneck.&lt;/p&gt;
&lt;h2&gt;What Denormalization Actually Means&lt;/h2&gt;
&lt;p&gt;Denormalization takes several forms:&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Embedding dimension attributes in fact tables.&lt;/strong&gt; Instead of joining &lt;code&gt;orders → customers&lt;/code&gt; to get the customer name, include &lt;code&gt;customer_name&lt;/code&gt; directly in the orders table.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Pre-joining lookup tables.&lt;/strong&gt; Instead of maintaining separate &lt;code&gt;cities&lt;/code&gt;, &lt;code&gt;states&lt;/code&gt;, and &lt;code&gt;countries&lt;/code&gt; tables, flatten them into a single column: &lt;code&gt;customer_city_state_country&lt;/code&gt;.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Adding calculated columns.&lt;/strong&gt; Instead of computing &lt;code&gt;quantity × price × (1 - discount)&lt;/code&gt; in every query, store &lt;code&gt;net_revenue&lt;/code&gt; as a pre-computed column.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Creating wide summary tables.&lt;/strong&gt; Instead of joining across 8 tables for a monthly report, create a &lt;code&gt;monthly_summary&lt;/code&gt; table with all needed columns in one place.&lt;/p&gt;
&lt;p&gt;&lt;img src=&quot;/assets/images/data_modeling/08/denormalization-techniques.png&quot; alt=&quot;Denormalization techniques: embedding, pre-joining, calculating, and flattening into wide tables&quot;&gt;&lt;/p&gt;
&lt;p&gt;The key insight: denormalization trades write-time simplicity for read-time simplicity. Updating a customer&apos;s city now requires updating it in multiple places. But querying revenue by city no longer requires a join.&lt;/p&gt;
&lt;h2&gt;When to Denormalize&lt;/h2&gt;
&lt;p&gt;&lt;strong&gt;Analytics and reporting workloads.&lt;/strong&gt; If your model primarily serves dashboards, reports, and ad-hoc queries, denormalization reduces query time and complexity.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Self-service environments.&lt;/strong&gt; Business users selecting fields in a BI tool get better results from a wide, flat table than from a web of normalized tables they don&apos;t understand.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;AI-driven queries.&lt;/strong&gt; When an AI agent generates SQL, fewer tables and fewer joins reduce the chance of wrong join conditions and hallucinated relationships.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Read-heavy, write-light patterns.&lt;/strong&gt; If your data loads once a day (batch ETL) and gets queried thousands of times, optimizing for reads makes sense.&lt;/p&gt;
&lt;h2&gt;When NOT to Denormalize&lt;/h2&gt;
&lt;p&gt;&lt;strong&gt;High-frequency transactional writes.&lt;/strong&gt; If your system processes real-time inserts and updates, denormalized redundancy creates update anomalies. A customer moving to a new city means updating hundreds of order rows.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;When consistency matters more than speed.&lt;/strong&gt; Financial systems with audit requirements often need the strict integrity that normalization provides.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Small datasets.&lt;/strong&gt; If the query joins 5 tables with 1,000 rows each, denormalization won&apos;t improve performance noticeably. The overhead of redundancy isn&apos;t worth the marginal speed gain.&lt;/p&gt;
&lt;h2&gt;The Tradeoffs&lt;/h2&gt;
&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Benefit&lt;/th&gt;
&lt;th&gt;Cost&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Fewer joins per query&lt;/td&gt;
&lt;td&gt;Update anomalies (same data in multiple places)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Faster read performance&lt;/td&gt;
&lt;td&gt;Larger storage footprint&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Simpler SQL for analysts&lt;/td&gt;
&lt;td&gt;Pipeline complexity (keeping redundant data in sync)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Better BI tool compatibility&lt;/td&gt;
&lt;td&gt;Risk of inconsistency if pipelines fail&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;AI agents write more accurate SQL&lt;/td&gt;
&lt;td&gt;More effort to maintain data quality&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;
&lt;h2&gt;Virtual Denormalization: The Middle Path&lt;/h2&gt;
&lt;p&gt;There&apos;s a way to get the query benefits of denormalization without the physical redundancy: SQL views.&lt;/p&gt;
&lt;p&gt;A view can join and flatten multiple normalized tables into a single logical table. Consumers query the view as if it&apos;s one wide table — simple SQL, no joins required. But the underlying data stays normalized. Update a customer&apos;s city in the customers table, and the view reflects the change automatically.&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-sql&quot;&gt;CREATE VIEW v_orders_enriched AS
SELECT
    o.order_id,
    o.order_date,
    c.customer_name,
    c.city AS customer_city,
    p.product_name,
    p.category AS product_category,
    o.quantity * o.unit_price AS revenue
FROM orders o
JOIN customers c ON o.customer_id = c.customer_id
JOIN products p ON o.product_id = p.product_id;
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Analysts query &lt;code&gt;v_orders_enriched&lt;/code&gt; without knowing the underlying structure. The join logic is defined once and reused by everyone.&lt;/p&gt;
&lt;p&gt;The tradeoff: views execute the joins at query time. For very large datasets, this can be slow. Platforms like &lt;a href=&quot;https://www.dremio.com/blog/5-ways-dremio-reflections-outsmart-traditional-materialized-views/?utm_source=ev_buffer&amp;amp;utm_medium=influencer&amp;amp;utm_campaign=next-gen-dremio&amp;amp;utm_term=blog-021826-02-18-2026&amp;amp;utm_content=alexmerced&quot;&gt;Dremio&lt;/a&gt; solve this with Reflections — which physically materialize the view&apos;s results in an optimized format, updated automatically. Users still query the logical view, but the engine substitutes the pre-computed Reflection for performance. You get the simplicity of denormalization, the consistency of normalization, and the speed of materialization.&lt;/p&gt;
&lt;h2&gt;What to Do Next&lt;/h2&gt;
&lt;p&gt;&lt;img src=&quot;/assets/images/data_modeling/08/virtual-denormalization.png&quot; alt=&quot;Virtual view acting as a denormalized layer over normalized source tables&quot;&gt;&lt;/p&gt;
&lt;p&gt;Identify your most-queried report or dashboard. Count the joins in the underlying SQL. If there are more than five, create a denormalized view that flattens the data. Compare query performance before and after. If the view is still too slow for your SLA, adding a materialized acceleration layer (like Reflections) closes the gap.&lt;/p&gt;
&lt;p&gt;&lt;a href=&quot;https://www.dremio.com/get-started?utm_source=ev_buffer&amp;amp;utm_medium=influencer&amp;amp;utm_campaign=next-gen-dremio&amp;amp;utm_term=blog-021826-02-18-2026&amp;amp;utm_content=alexmerced&quot;&gt;Try Dremio Cloud free for 30 days&lt;/a&gt;&lt;/p&gt;
</content:encoded><author>Alex Merced</author></item><item><title>Headless BI: How a Universal Semantic Layer Replaces Tool-Specific Models</title><link>https://iceberglakehouse.com/posts/2026-02-sl-headless-bi-semantic-layer/</link><guid isPermaLink="true">https://iceberglakehouse.com/posts/2026-02-sl-headless-bi-semantic-layer/</guid><description>
![Headless BI — one semantic layer serving all consumers](/assets/images/semantic_layer/08/headless-bi.png)

Your organization uses Tableau for execu...</description><pubDate>Wed, 18 Feb 2026 16:00:00 GMT</pubDate><content:encoded>&lt;p&gt;&lt;img src=&quot;/assets/images/semantic_layer/08/headless-bi.png&quot; alt=&quot;Headless BI — one semantic layer serving all consumers&quot;&gt;&lt;/p&gt;
&lt;p&gt;Your organization uses Tableau for executive dashboards, Power BI for operational reports, and Python notebooks for data science. Revenue is defined in Tableau&apos;s calculated field, Power BI&apos;s DAX measure, and a SQL query inside a Jupyter notebook. Three tools. Three definitions. None of them match.&lt;/p&gt;
&lt;p&gt;This is what happens when semantic models are locked inside BI tools. Headless BI fixes it by pulling the definitions out.&lt;/p&gt;
&lt;h2&gt;The Problem with Tool-Specific Semantic Models&lt;/h2&gt;
&lt;p&gt;&lt;img src=&quot;/assets/images/semantic_layer/08/tool-lock-in.png&quot; alt=&quot;BI tool lock-in — metrics trapped in isolated silos&quot;&gt;&lt;/p&gt;
&lt;p&gt;Every major BI tool comes with its own modeling layer. Looker has LookML. Tableau has the Data Model. Power BI has DAX and the tabular model. Each one defines metrics, relationships, and calculated fields in a proprietary format.&lt;/p&gt;
&lt;p&gt;This creates three problems:&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Definition duplication.&lt;/strong&gt; Every metric must be defined in every tool. Revenue in Tableau. Revenue in Power BI. Revenue in the data science notebook. When the formula changes (say, a new exclusion rule is added), you update it in three places. Or you forget one, and your dashboards disagree.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Tool lock-in.&lt;/strong&gt; Your metric definitions are trapped inside the tool&apos;s proprietary format. Switching from Tableau to a different visualization layer means rebuilding every metric from scratch. The data model doesn&apos;t migrate.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;AI agent exclusion.&lt;/strong&gt; When you add an AI agent to your stack, it can&apos;t access the Looker LookML definitions or the Power BI DAX measures. It has no semantic model to work with. It generates SQL based on raw table schemas and gets the formulas wrong.&lt;/p&gt;
&lt;h2&gt;What Headless BI Means&lt;/h2&gt;
&lt;p&gt;Headless BI is an architecture pattern where metric definitions and business logic are decoupled from the visualization layer. The &amp;quot;head&amp;quot; (the dashboard or chart) is separate from the &amp;quot;body&amp;quot; (the semantic definitions).&lt;/p&gt;
&lt;p&gt;In a headless architecture:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Metrics are defined once in a platform-neutral semantic layer&lt;/li&gt;
&lt;li&gt;Definitions are exposed via standard interfaces: SQL, JDBC, ODBC, Arrow Flight, REST&lt;/li&gt;
&lt;li&gt;Any tool — Tableau, Power BI, Python, an AI agent, a custom app — connects to the same definitions&lt;/li&gt;
&lt;li&gt;Adding a new visualization tool requires zero metric migration&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;The semantic layer becomes a shared service. Visualization tools consume it. They don&apos;t own it.&lt;/p&gt;
&lt;h2&gt;Tool-Specific vs. Universal Semantic Layer&lt;/h2&gt;
&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Dimension&lt;/th&gt;
&lt;th&gt;Tool-Specific Model&lt;/th&gt;
&lt;th&gt;Universal Semantic Layer&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Where metrics are defined&lt;/td&gt;
&lt;td&gt;Inside each BI tool&lt;/td&gt;
&lt;td&gt;Centralized, tool-independent&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Number of Revenue definitions&lt;/td&gt;
&lt;td&gt;One per tool&lt;/td&gt;
&lt;td&gt;One total&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Formula change process&lt;/td&gt;
&lt;td&gt;Update every tool&lt;/td&gt;
&lt;td&gt;Update once, propagates&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;New tool onboarding&lt;/td&gt;
&lt;td&gt;Rebuild all definitions&lt;/td&gt;
&lt;td&gt;Connect and query&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;AI agent access&lt;/td&gt;
&lt;td&gt;No (locked in BI format)&lt;/td&gt;
&lt;td&gt;Yes (standard SQL interface)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Portability&lt;/td&gt;
&lt;td&gt;Vendor-locked&lt;/td&gt;
&lt;td&gt;Open and interoperable&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;
&lt;h2&gt;What Composable Analytics Looks Like&lt;/h2&gt;
&lt;p&gt;&lt;img src=&quot;/assets/images/semantic_layer/08/composable-analytics.png&quot; alt=&quot;Composable analytics — modular blocks snapping together&quot;&gt;&lt;/p&gt;
&lt;p&gt;Headless BI is one piece of a broader shift called &lt;strong&gt;composable analytics&lt;/strong&gt;. Instead of buying a monolithic BI platform that bundles data modeling, metric definitions, and visualizations together, you assemble your analytics stack from modular, interchangeable components.&lt;/p&gt;
&lt;p&gt;The semantic layer is the metric module. Choose any visualization tool on top. Choose any data storage underneath. Swap components without rebuilding definitions.&lt;/p&gt;
&lt;p&gt;This modularity matters most for AI. An AI agent becomes a first-class consumer of the semantic layer, alongside dashboards and notebooks. It connects to the same interface, reads the same metric definitions, and gets the same answers. No special integration needed.&lt;/p&gt;
&lt;h2&gt;How This Works in Practice&lt;/h2&gt;
&lt;p&gt;Dremio functions as a universal semantic layer that any tool can consume. The architecture:&lt;/p&gt;
&lt;ol&gt;
&lt;li&gt;&lt;strong&gt;Virtual datasets (SQL views)&lt;/strong&gt; define business logic and metric calculations once&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Wikis and Labels&lt;/strong&gt; document business context for human and AI consumers&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Fine-Grained Access Control&lt;/strong&gt; enforces security policies at the query level&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Reflections&lt;/strong&gt; optimize performance automatically for any consumer&lt;/li&gt;
&lt;/ol&gt;
&lt;p&gt;Connection options include ODBC, JDBC, Arrow Flight (for columnar high-speed clients), and REST API. A Tableau dashboard connects via ODBC. A Python notebook connects via Arrow Flight. Dremio&apos;s AI Agent &lt;a href=&quot;https://www.dremio.com/blog/agentic-analytics-semantic-layer/?utm_source=ev_buffer&amp;amp;utm_medium=influencer&amp;amp;utm_campaign=next-gen-dremio&amp;amp;utm_term=blog-021826-02-18-2026&amp;amp;utm_content=alexmerced&quot;&gt;reads the Wikis and Labels&lt;/a&gt; to generate accurate SQL from natural language. All three hit the same virtual datasets. All three get the same answers.&lt;/p&gt;
&lt;p&gt;Because the entire semantic layer is built on open standards (Apache Iceberg for data, Apache Polaris for the catalog), the definitions aren&apos;t locked to Dremio&apos;s format. You can inspect, export, and query the same data with any Iceberg-compatible engine.&lt;/p&gt;
&lt;h2&gt;What to Do Next&lt;/h2&gt;
&lt;p&gt;Count the number of places your organization defines its top metric (probably Revenue or Monthly Active Users). If that number is greater than one, you&apos;re paying a consistency tax every time someone changes the formula. A universal semantic layer eliminates that tax by defining it once and serving it everywhere.&lt;/p&gt;
&lt;p&gt;&lt;a href=&quot;https://www.dremio.com/get-started?utm_source=ev_buffer&amp;amp;utm_medium=influencer&amp;amp;utm_campaign=next-gen-dremio&amp;amp;utm_term=blog-021826-02-18-2026&amp;amp;utm_content=alexmerced&quot;&gt;Try Dremio Cloud free for 30 days&lt;/a&gt;&lt;/p&gt;
</content:encoded><author>Alex Merced</author></item><item><title>Partition and Organize Data for Performance</title><link>https://iceberglakehouse.com/posts/2026-02-debp-partition-and-organize/</link><guid isPermaLink="true">https://iceberglakehouse.com/posts/2026-02-debp-partition-and-organize/</guid><description>
![Table data split into partitions by date with query scanning only the relevant partition](/assets/images/debp/07/partition-overview.png)

A table w...</description><pubDate>Wed, 18 Feb 2026 15:00:00 GMT</pubDate><content:encoded>&lt;p&gt;&lt;img src=&quot;/assets/images/debp/07/partition-overview.png&quot; alt=&quot;Table data split into partitions by date with query scanning only the relevant partition&quot;&gt;&lt;/p&gt;
&lt;p&gt;A table with 500 million rows takes 45 seconds to query. After partitioning it by date, the same query — filtering on a single day — returns in 2 seconds. The SQL didn&apos;t change. The data didn&apos;t change. The only thing that changed was how the data was organized on disk.&lt;/p&gt;
&lt;p&gt;Performance in analytical workloads is almost never about faster hardware. It&apos;s about reading less data.&lt;/p&gt;
&lt;h2&gt;Read Less Data, Run Faster Queries&lt;/h2&gt;
&lt;p&gt;Analytical query engines scan data to answer queries. A full table scan reads every row, every column. But most queries only need a fraction of the data: this week&apos;s transactions, this region&apos;s customers, this product category&apos;s sales.&lt;/p&gt;
&lt;p&gt;Partitioning and data organization let the engine skip irrelevant data entirely:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Partition pruning.&lt;/strong&gt; The engine reads only the partitions that match the query&apos;s WHERE clause.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Column pruning.&lt;/strong&gt; Columnar formats (Parquet, ORC) read only the requested columns.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Predicate pushdown.&lt;/strong&gt; Min/max statistics in file metadata let the engine skip files whose value ranges don&apos;t match the filter.&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;Combined, these techniques can reduce the data scanned from terabytes to megabytes. The fastest query is the one that reads the least data.&lt;/p&gt;
&lt;h2&gt;Partitioning Strategies&lt;/h2&gt;
&lt;p&gt;&lt;strong&gt;Time-based partitioning.&lt;/strong&gt; Partition by date, hour, or month. This is the most common strategy because most analytical queries filter by time. A daily partition structure means a query for &amp;quot;last week&amp;quot; reads 7 partitions instead of scanning the entire table.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Value-based partitioning.&lt;/strong&gt; Partition by a categorical column: region, source system, customer tier. This works when queries consistently filter on that column. A multi-tenant application might partition by tenant ID so each tenant&apos;s queries touch only their data.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Hash-based partitioning.&lt;/strong&gt; Distribute data evenly across N buckets using a hash function on a key column. This is useful for join-heavy workloads: two tables hashed on the same join key can be joined partition-to-partition without shuffling data.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Composite partitioning.&lt;/strong&gt; Combine strategies: partition by date, then bucket by customer ID within each date. This handles queries that filter on date and join on customer ID.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Choosing the right strategy:&lt;/strong&gt; Look at your most frequent queries. What columns appear in WHERE clauses and JOIN conditions? Those are your partition candidates. If 90% of queries filter by date, partition by date.&lt;/p&gt;
&lt;h2&gt;File-Level Organization&lt;/h2&gt;
&lt;p&gt;Partitioning controls which directory the query engine reads. File-level organization controls how efficiently it reads within that directory.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Sorting.&lt;/strong&gt; Sort rows within each file by a frequently filtered column. If queries often filter &lt;code&gt;WHERE status = &apos;active&apos;&lt;/code&gt;, sorting by status clusters active rows together. The engine reads min/max metadata, sees that a file&apos;s status range is only &apos;active&apos;, and skips files that don&apos;t match.&lt;/p&gt;
&lt;p&gt;&lt;img src=&quot;/assets/images/debp/07/sorted-files.png&quot; alt=&quot;Sorted data within partitions enabling file-level skip based on min/max metadata&quot;&gt;&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Z-ordering.&lt;/strong&gt; When queries filter on multiple columns, linear sorting optimizes for only one. Z-ordering interleaves the sort order across multiple columns, enabling predicate pushdown on any combination of the Z-ordered columns. It&apos;s especially effective for 2-3 column filter combinations.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;File sizing.&lt;/strong&gt; Target file sizes between 128 MB and 1 GB. Files too small (&amp;lt; 10 MB) create metadata overhead and excessive file-open operations. Files too large (&amp;gt; 2 GB) reduce parallelism and waste I/O when only a fraction of the file is needed.&lt;/p&gt;
&lt;h2&gt;Compaction: The Maintenance Task You Can&apos;t Skip&lt;/h2&gt;
&lt;p&gt;Streaming writes and frequent small batch appends create many small files. A partition with 10,000 files of 1 MB each is dramatically slower to query than the same data in 10 files of 1 GB each.&lt;/p&gt;
&lt;p&gt;Compaction merges small files into optimally-sized files. It&apos;s the data equivalent of defragmenting a disk.&lt;/p&gt;
&lt;p&gt;Run compaction:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;After streaming writes accumulate small files&lt;/li&gt;
&lt;li&gt;After many small batch appends&lt;/li&gt;
&lt;li&gt;On a regular schedule (daily or weekly) for active partitions&lt;/li&gt;
&lt;li&gt;Targeted at partitions where file counts exceed a threshold&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;Compaction also provides an opportunity to re-sort data within files, clean up deleted records (in formats that use soft deletes like Iceberg and Delta), and update file-level statistics.&lt;/p&gt;
&lt;h2&gt;Common Partitioning Mistakes&lt;/h2&gt;
&lt;p&gt;&lt;strong&gt;Over-partitioning.&lt;/strong&gt; Partitioning by a high-cardinality column (user ID, transaction ID) creates millions of partitions, each with a few rows. The engine spends more time listing and opening files than reading data. Rule of thumb: keep individual partition sizes above 100 MB.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Under-partitioning.&lt;/strong&gt; A single partition for the entire table means every query scans everything. If your table has billions of rows and no partitions, even simple queries are slow.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Misaligned partitions.&lt;/strong&gt; Partitioning by month when every query filters by day means the engine reads an entire month&apos;s data for a single-day query. Align partition granularity with query granularity.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Ignoring compaction.&lt;/strong&gt; Streaming into a table without compacting creates the small-file problem. Query performance degrades gradually until someone notices. Schedule compaction as part of pipeline maintenance.&lt;/p&gt;
&lt;p&gt;&lt;img src=&quot;/assets/images/debp/07/partition-mistakes.png&quot; alt=&quot;Common mistakes: too many partitions, wrong partition key, no compaction&quot;&gt;&lt;/p&gt;
&lt;h2&gt;What to Do Next&lt;/h2&gt;
&lt;p&gt;Identify your slowest analytical query. Check the table&apos;s partitioning strategy. If the table has no partitions, add one aligned with the query&apos;s most common WHERE clause. If it&apos;s already partitioned, check file sizes — if the average file is under 10 MB, run compaction. Measure before and after.&lt;/p&gt;
&lt;p&gt;&lt;a href=&quot;https://www.dremio.com/get-started?utm_source=ev_buffer&amp;amp;utm_medium=influencer&amp;amp;utm_campaign=next-gen-dremio&amp;amp;utm_term=blog-021826-02-18-2026&amp;amp;utm_content=alexmerced&quot;&gt;Try Dremio Cloud free for 30 days&lt;/a&gt;&lt;/p&gt;
</content:encoded><author>Alex Merced</author></item><item><title>Data Modeling for Analytics: Optimize for Queries, Not Transactions</title><link>https://iceberglakehouse.com/posts/2026-02-dm-data-modeling-for-analytics/</link><guid isPermaLink="true">https://iceberglakehouse.com/posts/2026-02-dm-data-modeling-for-analytics/</guid><description>
![OLTP normalized model vs. OLAP denormalized model side by side](/assets/images/data_modeling/07/analytics-data-modeling.png)

The data model that r...</description><pubDate>Wed, 18 Feb 2026 15:00:00 GMT</pubDate><content:encoded>&lt;p&gt;&lt;img src=&quot;/assets/images/data_modeling/07/analytics-data-modeling.png&quot; alt=&quot;OLTP normalized model vs. OLAP denormalized model side by side&quot;&gt;&lt;/p&gt;
&lt;p&gt;The data model that runs your production application is almost never the right model for analytics. Transactional systems are designed for fast writes — inserting orders, updating inventory, processing payments. Analytics systems are designed for fast reads — scanning millions of rows, aggregating across dimensions, filtering by date ranges.&lt;/p&gt;
&lt;p&gt;Using a transactional model for analytics is like using a filing cabinet when you need a search engine. The data is there, but finding answers takes too long.&lt;/p&gt;
&lt;h2&gt;Transactions vs. Analytics: Two Different Problems&lt;/h2&gt;
&lt;p&gt;Transactional (OLTP) workloads process many small operations: insert one order, update one account balance, delete one expired session. These models are normalized to Third Normal Form (3NF) or beyond —  every piece of data stored once, redundancy eliminated, consistency enforced through constraints.&lt;/p&gt;
&lt;p&gt;Analytical (OLAP) workloads process few large operations: scan all orders for the last year, aggregate revenue by region and product category, calculate year-over-year growth. These models are denormalized — data is pre-joined, attributes are flattened, and the structure is optimized for scans rather than updates.&lt;/p&gt;
&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Aspect&lt;/th&gt;
&lt;th&gt;OLTP Model&lt;/th&gt;
&lt;th&gt;OLAP Model&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Optimization target&lt;/td&gt;
&lt;td&gt;Write speed&lt;/td&gt;
&lt;td&gt;Read speed&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Normalization&lt;/td&gt;
&lt;td&gt;3NF or higher&lt;/td&gt;
&lt;td&gt;Denormalized&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Table structure&lt;/td&gt;
&lt;td&gt;Narrow and many&lt;/td&gt;
&lt;td&gt;Wide and few&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Joins per query&lt;/td&gt;
&lt;td&gt;Many (10-20)&lt;/td&gt;
&lt;td&gt;Few (3-5)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Storage format&lt;/td&gt;
&lt;td&gt;Row-oriented&lt;/td&gt;
&lt;td&gt;Columnar&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Typical query&lt;/td&gt;
&lt;td&gt;UPDATE one row&lt;/td&gt;
&lt;td&gt;SUM across millions&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;
&lt;h2&gt;Why Normalized Models Slow Down Analytics&lt;/h2&gt;
&lt;p&gt;A normalized 3NF model might have 15 tables involved in answering &amp;quot;What was revenue by product category by month?&amp;quot; The query engine must join orders to order_items to products to categories to dates, applying filters and aggregations across each join.&lt;/p&gt;
&lt;p&gt;&lt;img src=&quot;/assets/images/data_modeling/07/normalized-vs-denormalized-query.png&quot; alt=&quot;Chain of joins through normalized tables versus one wide scan through a denormalized table&quot;&gt;&lt;/p&gt;
&lt;p&gt;Each join adds latency. Each join also adds a point of failure — wrong join condition, missing foreign key, ambiguous column name. An AI agent generating SQL against a 15-table normalized model has far more opportunities to make a mistake than against a 4-table star schema.&lt;/p&gt;
&lt;p&gt;The fix is not to abandon normalization. Keep your OLTP model normalized for your application. But create a separate analytical model — denormalized, structured for queries, with pre-built joins and business-friendly column names — for reporting and analytics.&lt;/p&gt;
&lt;h2&gt;Designing for Read Performance&lt;/h2&gt;
&lt;p&gt;Analytical data models follow several patterns that optimize for read performance:&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Wide tables reduce joins.&lt;/strong&gt; Instead of &lt;code&gt;orders → customers → addresses → cities → states&lt;/code&gt;, create a single &lt;code&gt;fact_orders&lt;/code&gt; view with &lt;code&gt;customer_name&lt;/code&gt;, &lt;code&gt;customer_city&lt;/code&gt;, &lt;code&gt;customer_state&lt;/code&gt; included. Every join you eliminate saves query time and reduces complexity.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Pre-computed columns reduce repeated calculations.&lt;/strong&gt; If every report calculates &lt;code&gt;quantity * unit_price * (1 - discount)&lt;/code&gt; as &amp;quot;net revenue,&amp;quot; compute it once in the model and expose it as a column. This eliminates repeated formula definitions and ensures consistency.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Consistent naming improves discoverability.&lt;/strong&gt; Use &lt;code&gt;order_date&lt;/code&gt; instead of &lt;code&gt;dt&lt;/code&gt;. Use &lt;code&gt;customer_email&lt;/code&gt; instead of &lt;code&gt;email&lt;/code&gt;. When column names are self-explanatory, analysts find the right data faster, and AI agents generate more accurate SQL.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Date dimensions enable time-based analysis.&lt;/strong&gt; A date dimension with &lt;code&gt;fiscal_quarter&lt;/code&gt;, &lt;code&gt;is_weekend&lt;/code&gt;, &lt;code&gt;is_holiday&lt;/code&gt;, and &lt;code&gt;week_of_year&lt;/code&gt; makes time-based filtering trivial. Without it, every analyst writes a different &lt;code&gt;CASE WHEN MONTH(date) IN (1,2,3) THEN &apos;Q1&apos;&lt;/code&gt; expression.&lt;/p&gt;
&lt;h2&gt;Pre-Aggregation and Summary Tables&lt;/h2&gt;
&lt;p&gt;Not every query needs to scan raw data. For frequently run aggregations, pre-aggregated summary tables reduce query time from minutes to milliseconds.&lt;/p&gt;
&lt;p&gt;Common patterns:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Daily summary&lt;/strong&gt;: Total revenue, order count, average order value per day per product category&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Monthly snapshot&lt;/strong&gt;: Active customers, churned customers, MRR per segment&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Rolling window&lt;/strong&gt;: 7-day and 30-day moving averages for key metrics&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;The tradeoff is maintenance. Every summary table needs a refresh pipeline, and stale summaries produce outdated numbers.&lt;/p&gt;
&lt;p&gt;Platforms like &lt;a href=&quot;https://www.dremio.com/blog/5-ways-dremio-reflections-outsmart-traditional-materialized-views/?utm_source=ev_buffer&amp;amp;utm_medium=influencer&amp;amp;utm_campaign=next-gen-dremio&amp;amp;utm_term=blog-021826-02-18-2026&amp;amp;utm_content=alexmerced&quot;&gt;Dremio&lt;/a&gt; handle this automatically with Reflections — pre-computed aggregations and materializations that the query optimizer uses transparently. Users query the logical views; Dremio substitutes the fastest Reflection without the user knowing. No manual summary table management required.&lt;/p&gt;
&lt;h2&gt;Columnar Storage and Physical Layout&lt;/h2&gt;
&lt;p&gt;Analytics models benefit from columnar storage formats like Parquet:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Column pruning&lt;/strong&gt;: Queries that touch 5 of 50 columns only read those 5 columns from disk&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Compression&lt;/strong&gt;: Repeated values in a column (category names, status codes) compress efficiently&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Vectorized processing&lt;/strong&gt;: Engines like Dremio (built on Apache Arrow) process columnar data in CPU-cache-friendly batches&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;Physical layout decisions that matter:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Partition by time&lt;/strong&gt;: Most analytics queries filter by date range. Partitioning by month or day lets the engine skip irrelevant data files entirely.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Sort by high-cardinality filters&lt;/strong&gt;: If queries frequently filter by &lt;code&gt;customer_id&lt;/code&gt; or &lt;code&gt;region&lt;/code&gt;, sorting data within partitions enables min/max pruning.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Compact regularly&lt;/strong&gt;: Small files from streaming inserts slow down scan performance. Compaction rewrites small files into larger, optimized ones.&lt;/li&gt;
&lt;/ul&gt;
&lt;h2&gt;What to Do Next&lt;/h2&gt;
&lt;p&gt;&lt;img src=&quot;/assets/images/data_modeling/07/analytics-architecture.png&quot; alt=&quot;Analytics model with wide tables, pre-aggregations, and columnar storage feeding dashboards&quot;&gt;&lt;/p&gt;
&lt;p&gt;Find your slowest dashboard. Look at the queries behind it. Count the joins, measure the scan size, and check whether the model is normalized 3NF or denormalized for analytics. If it&apos;s still using the transactional model, create an analytical view layer on top — a denormalized star schema with pre-computed columns, clear naming, and a date dimension. The dashboard performance improvement is usually immediate and significant.&lt;/p&gt;
&lt;p&gt;&lt;a href=&quot;https://www.dremio.com/get-started?utm_source=ev_buffer&amp;amp;utm_medium=influencer&amp;amp;utm_campaign=next-gen-dremio&amp;amp;utm_term=blog-021826-02-18-2026&amp;amp;utm_content=alexmerced&quot;&gt;Try Dremio Cloud free for 30 days&lt;/a&gt;&lt;/p&gt;
</content:encoded><author>Alex Merced</author></item><item><title>Data Virtualization and the Semantic Layer: Query Without Copying</title><link>https://iceberglakehouse.com/posts/2026-02-sl-data-virtualization-semantic-layer/</link><guid isPermaLink="true">https://iceberglakehouse.com/posts/2026-02-sl-data-virtualization-semantic-layer/</guid><description>
![Data virtualization — connecting sources to a unified semantic layer without copying](/assets/images/semantic_layer/07/data-virtualization.png)

Ev...</description><pubDate>Wed, 18 Feb 2026 15:00:00 GMT</pubDate><content:encoded>&lt;p&gt;&lt;img src=&quot;/assets/images/semantic_layer/07/data-virtualization.png&quot; alt=&quot;Data virtualization — connecting sources to a unified semantic layer without copying&quot;&gt;&lt;/p&gt;
&lt;p&gt;Every data pipeline you build to move data from one system to another costs you three things: time to build it, money to run it, and freshness you lose while waiting for the next sync. Most analytics architectures accept this cost as unavoidable. It isn&apos;t.&lt;/p&gt;
&lt;p&gt;Data virtualization eliminates the movement. A semantic layer adds meaning and governance on top. Together, they give you a complete analytics layer over distributed data without copying a single table.&lt;/p&gt;
&lt;h2&gt;The Data Movement Tax&lt;/h2&gt;
&lt;p&gt;Traditional analytics architecture looks like this: data lives in operational databases, SaaS tools, and cloud storage. To analyze it, you extract it, transform it, and load it into a central warehouse. Every source gets an ETL pipeline. Every pipeline needs monitoring, error handling, and scheduling.&lt;/p&gt;
&lt;p&gt;The result: your analytics are always behind your operational data. The warehouse reflects what happened as of the last sync, not what&apos;s happening now. You pay for storage in both the source and the warehouse. And when you add a new source, you add a new pipeline.&lt;/p&gt;
&lt;p&gt;This model made sense when compute was expensive and storage was local. In a cloud-native world where compute is elastic and storage is cheap, the calculus changes.&lt;/p&gt;
&lt;h2&gt;What Data Virtualization Does&lt;/h2&gt;
&lt;p&gt;&lt;img src=&quot;/assets/images/semantic_layer/07/etl-vs-virtual.png&quot; alt=&quot;ETL pipelines vs. data virtualization — physical movement vs. lightweight connections&quot;&gt;&lt;/p&gt;
&lt;p&gt;Data virtualization lets you query data where it lives. Instead of copying data to a central location, you connect to each source and issue queries directly. A virtualization engine translates your SQL into the source&apos;s native protocol (JDBC for databases, S3 API for object storage, REST for SaaS), retrieves the data, and combines results from multiple sources into a single result set.&lt;/p&gt;
&lt;p&gt;From the user&apos;s perspective, all data appears in one unified namespace. A PostgreSQL production database, an S3 data lake full of Parquet files, and a Snowflake analytics warehouse all look like tables in the same catalog.&lt;/p&gt;
&lt;p&gt;The keyword is &amp;quot;no replication.&amp;quot; The data stays where it is. The queries go to the data, not the other way around.&lt;/p&gt;
&lt;h2&gt;What a Semantic Layer Adds on Top&lt;/h2&gt;
&lt;p&gt;Virtualization solves the access problem. But access without context is dangerous. Raw access to 50 federated sources means 50 sources where analysts can write conflicting metric formulas, join tables incorrectly, and query sensitive columns without authorization.&lt;/p&gt;
&lt;p&gt;A semantic layer added on top of virtualization provides:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Metric definitions&lt;/strong&gt;: &amp;quot;Revenue&amp;quot; is calculated the same way regardless of which source the data comes from&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Documentation&lt;/strong&gt;: Wikis describe what each federated table and column represent in business terms&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Join paths&lt;/strong&gt;: Pre-defined relationships prevent analysts from guessing how tables connect&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Access policies&lt;/strong&gt;: Row-level security and column masking enforced at the view level, even for sources that have no fine-grained access controls of their own&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;The combination is powerful: you get real-time access to all your data (virtualization) with consistent meaning and governance (semantic layer), and without data movement (no ETL).&lt;/p&gt;
&lt;h2&gt;Why They&apos;re Stronger Together&lt;/h2&gt;
&lt;p&gt;Each technology is useful alone. Together, they cover gaps neither can fill individually:&lt;/p&gt;
&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Capability&lt;/th&gt;
&lt;th&gt;Virtualization Only&lt;/th&gt;
&lt;th&gt;Semantic Layer Only&lt;/th&gt;
&lt;th&gt;Both Together&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Access distributed data&lt;/td&gt;
&lt;td&gt;Yes&lt;/td&gt;
&lt;td&gt;No (limited to centralized data)&lt;/td&gt;
&lt;td&gt;Yes&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Business definitions&lt;/td&gt;
&lt;td&gt;No&lt;/td&gt;
&lt;td&gt;Yes&lt;/td&gt;
&lt;td&gt;Yes&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Governance enforcement&lt;/td&gt;
&lt;td&gt;No&lt;/td&gt;
&lt;td&gt;Yes&lt;/td&gt;
&lt;td&gt;Yes&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Zero data movement&lt;/td&gt;
&lt;td&gt;Yes&lt;/td&gt;
&lt;td&gt;No&lt;/td&gt;
&lt;td&gt;Yes&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Real-time access&lt;/td&gt;
&lt;td&gt;Yes&lt;/td&gt;
&lt;td&gt;Depends on data freshness&lt;/td&gt;
&lt;td&gt;Yes&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Unified namespace&lt;/td&gt;
&lt;td&gt;Yes&lt;/td&gt;
&lt;td&gt;Yes&lt;/td&gt;
&lt;td&gt;Yes&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;
&lt;p&gt;Virtualization without a semantic layer gives you raw SQL access to everything. Powerful for engineers. Risky for an organization. No metric consistency, no governance, no documentation.&lt;/p&gt;
&lt;p&gt;A semantic layer without virtualization covers only the data that&apos;s been moved to the platform&apos;s native storage. Every source that hasn&apos;t been ETL&apos;d is invisible to the layer. You get great governance over a subset of your data, and no governance over the rest.&lt;/p&gt;
&lt;h2&gt;How It Works in Practice&lt;/h2&gt;
&lt;p&gt;&lt;a href=&quot;https://www.dremio.com/blog/why-agentic-analytics-requires-federation-virtualization-and-the-lakehouse-how-dremio-delivers/?utm_source=ev_buffer&amp;amp;utm_medium=influencer&amp;amp;utm_campaign=next-gen-dremio&amp;amp;utm_term=blog-021826-02-18-2026&amp;amp;utm_content=alexmerced&quot;&gt;Dremio&lt;/a&gt; is built on this architecture natively. It combines a high-performance virtualization engine (supporting 30+ source types including S3, ADLS, PostgreSQL, MySQL, MongoDB, Snowflake, and Redshift) with a full semantic layer (virtual datasets, Wikis, Labels, Fine-Grained Access Control).&lt;/p&gt;
&lt;p&gt;A practical query flow:&lt;/p&gt;
&lt;ol&gt;
&lt;li&gt;An analyst queries &lt;code&gt;business.revenue_by_region&lt;/code&gt; — a virtual dataset (view)&lt;/li&gt;
&lt;li&gt;Dremio&apos;s optimizer determines that this view joins data from PostgreSQL (customer records) and S3/Iceberg (order transactions)&lt;/li&gt;
&lt;li&gt;Predicate pushdowns push filter logic to each source (e.g., date range filters applied at the source)&lt;/li&gt;
&lt;li&gt;Results are combined using Apache Arrow&apos;s columnar format (zero serialization overhead)&lt;/li&gt;
&lt;li&gt;Row-level security filters the results based on the analyst&apos;s role&lt;/li&gt;
&lt;li&gt;If a Reflection (pre-computed copy) exists, Dremio substitutes it transparently for faster performance&lt;/li&gt;
&lt;/ol&gt;
&lt;p&gt;The analyst sees one table. Behind it, two sources, one semantic layer, and automatic performance optimization.&lt;/p&gt;
&lt;h2&gt;When to Virtualize vs. When to Materialize&lt;/h2&gt;
&lt;p&gt;&lt;img src=&quot;/assets/images/semantic_layer/07/virtualize-materialize.png&quot; alt=&quot;Virtualize vs. materialize decision framework&quot;&gt;&lt;/p&gt;
&lt;p&gt;Not every query should hit the source directly. The right architecture uses both strategies:&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Virtualize when:&lt;/strong&gt;&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;The data changes frequently and freshness matters&lt;/li&gt;
&lt;li&gt;The dataset is queried infrequently (monthly reports, ad-hoc exploration)&lt;/li&gt;
&lt;li&gt;Compliance requires data to stay in its source system&lt;/li&gt;
&lt;li&gt;You&apos;re evaluating a new source before committing to a pipeline&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;&lt;strong&gt;Materialize when:&lt;/strong&gt;&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Multiple dashboards query the same dataset hundreds of times daily&lt;/li&gt;
&lt;li&gt;Joins across sources are slow because of network latency&lt;/li&gt;
&lt;li&gt;Table-level optimizations (compaction, partitioning, clustering) would improve performance&lt;/li&gt;
&lt;li&gt;AI workloads need scan-heavy access to large datasets&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;The practical strategy: start every source as a federated (virtual) connection. Monitor query frequency and performance. When a dataset crosses the line into &amp;quot;queried daily by multiple teams,&amp;quot; materialize it as an Apache Iceberg table. Dremio&apos;s Reflections automate this for the most common query patterns, creating materialized copies that the optimizer uses transparently.&lt;/p&gt;
&lt;h2&gt;What to Do Next&lt;/h2&gt;
&lt;p&gt;Count your current ETL pipelines. For each one, ask: does the destination system need a physical copy of this data, or does it just need to query it? Every pipeline that exists purely for query access is a candidate for virtualization. Replace the pipeline with a federated connection, add a semantic layer for context, and watch your infrastructure costs drop.&lt;/p&gt;
&lt;p&gt;&lt;a href=&quot;https://www.dremio.com/get-started?utm_source=ev_buffer&amp;amp;utm_medium=influencer&amp;amp;utm_campaign=next-gen-dremio&amp;amp;utm_term=blog-021826-02-18-2026&amp;amp;utm_content=alexmerced&quot;&gt;Try Dremio Cloud free for 30 days&lt;/a&gt;&lt;/p&gt;
</content:encoded><author>Alex Merced</author></item><item><title>Batch vs. Streaming: Choose the Right Processing Model</title><link>https://iceberglakehouse.com/posts/2026-02-debp-batch-vs-streaming/</link><guid isPermaLink="true">https://iceberglakehouse.com/posts/2026-02-debp-batch-vs-streaming/</guid><description>
![Batch processing in scheduled groups vs streaming in continuous flow](/assets/images/debp/06/batch-vs-streaming.png)

&quot;We need real-time data.&quot; Thi...</description><pubDate>Wed, 18 Feb 2026 14:00:00 GMT</pubDate><content:encoded>&lt;p&gt;&lt;img src=&quot;/assets/images/debp/06/batch-vs-streaming.png&quot; alt=&quot;Batch processing in scheduled groups vs streaming in continuous flow&quot;&gt;&lt;/p&gt;
&lt;p&gt;&amp;quot;We need real-time data.&amp;quot; This is one of the most expensive sentences in data engineering — because it&apos;s rarely true, and implementing it when it&apos;s not needed multiplies complexity, cost, and operational burden.&lt;/p&gt;
&lt;p&gt;The question isn&apos;t &amp;quot;should we use streaming?&amp;quot; The question is &amp;quot;how fresh does the data actually need to be, and what are we willing to pay for that freshness?&amp;quot;&lt;/p&gt;
&lt;h2&gt;The Question Isn&apos;t &amp;quot;Real-Time or Not&amp;quot; — It&apos;s &amp;quot;How Fresh?&amp;quot;&lt;/h2&gt;
&lt;p&gt;Freshness requirements exist on a spectrum:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Daily&lt;/strong&gt; (24-hour latency): Fine for financial reporting, historical trend analysis, ML training datasets&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Hourly&lt;/strong&gt; (1-hour latency): Adequate for operational dashboards, inventory tracking, marketing attribution&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Near-real-time&lt;/strong&gt; (1-15 minutes): Sufficient for user activity feeds, recommendation updates, alerting&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Real-time&lt;/strong&gt; (sub-second): Required for fraud detection, stock trading, IoT safety systems&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;Most &amp;quot;we need real-time&amp;quot; requests are actually &amp;quot;we need hourly&amp;quot; or &amp;quot;we need 5-minute&amp;quot; requests. Clarifying the actual latency requirement before choosing an architecture prevents overengineering.&lt;/p&gt;
&lt;h2&gt;When Batch Wins&lt;/h2&gt;
&lt;p&gt;Batch processing is the default choice. Choose it unless you have a specific, justified reason to stream.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Simpler failure recovery.&lt;/strong&gt; A batch job fails at 3 AM. You fix the bug, rerun the job, and it reprocesses the same bounded dataset. Recovery is predictable and testable.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Easier testing.&lt;/strong&gt; Given input dataset X, the output should be Y. You can version test datasets, run them locally, and assert exact outputs. Streaming test scenarios require simulating time, ordering, and late-arriving events — dramatically harder.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Lower operational cost.&lt;/strong&gt; Batch jobs run on schedule, consume resources during execution, and release them when done. Streaming jobs run continuously, consuming resources 24/7 even during low-volume periods.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Better tooling maturity.&lt;/strong&gt; SQL-based transformations, orchestrators with DAG visualization, version-controlled dbt models — the batch ecosystem is deeper and more mature for most data warehouse workloads.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;When to use:&lt;/strong&gt; Daily/hourly analytics, data warehouse loading, ML training data, compliance reporting, historical backfills.&lt;/p&gt;
&lt;h2&gt;When Streaming Wins&lt;/h2&gt;
&lt;p&gt;Streaming processing is the right choice when latency is measured in seconds and the cost of stale data is high.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Fraud detection.&lt;/strong&gt; You can&apos;t batch-process credit card transactions once an hour. By the time you detect a fraudulent pattern, thousands of dollars are already gone. Fraud detection needs event-by-event evaluation in real time.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;IoT and safety systems.&lt;/strong&gt; A temperature sensor in a chemical plant detecting an abnormal reading can&apos;t wait for the next hourly batch. Alerting must happen in seconds.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Real-time personalization.&lt;/strong&gt; Showing a user recommendations based on what they did 30 seconds ago requires streaming user events through a recommendation engine.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Operational systems.&lt;/strong&gt; Inventory management, ride-sharing pricing, and live logistics tracking all need sub-minute data freshness to function correctly.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;When to use:&lt;/strong&gt; Event-driven business logic, sub-second alerting, real-time user-facing features.&lt;/p&gt;
&lt;p&gt;&lt;img src=&quot;/assets/images/debp/06/latency-spectrum.png&quot; alt=&quot;Spectrum from batch to streaming with example use cases at each latency level&quot;&gt;&lt;/p&gt;
&lt;h2&gt;The Micro-Batch Middle Ground&lt;/h2&gt;
&lt;p&gt;Micro-batch processing runs batch jobs at very short intervals — every 1, 5, or 15 minutes. It captures most of the value of streaming with the simplicity of batch.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Same tools, shorter intervals.&lt;/strong&gt; Your existing batch infrastructure (SQL transformations, orchestrators, testing frameworks) works unchanged. You just schedule runs more frequently.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Most use cases are satisfied.&lt;/strong&gt; An operational dashboard refreshing every 5 minutes feels &amp;quot;real-time&amp;quot; to most business users. Marketing attribution updating every 15 minutes is fresh enough for campaign optimization.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Significantly lower complexity.&lt;/strong&gt; No stream processing framework to learn. No state management. No watermark configuration. No event ordering challenges.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;The tradeoff:&lt;/strong&gt; Micro-batch cannot achieve sub-second latency. If you genuinely need event-by-event processing under one second, you need a streaming framework.&lt;/p&gt;
&lt;h2&gt;A Decision Framework&lt;/h2&gt;
&lt;p&gt;Before choosing between batch, micro-batch, and streaming, answer these questions:&lt;/p&gt;
&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Question&lt;/th&gt;
&lt;th&gt;Batch&lt;/th&gt;
&lt;th&gt;Micro-batch&lt;/th&gt;
&lt;th&gt;Streaming&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Required latency&lt;/td&gt;
&lt;td&gt;Hours&lt;/td&gt;
&lt;td&gt;Minutes&lt;/td&gt;
&lt;td&gt;Seconds&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Cost of stale data&lt;/td&gt;
&lt;td&gt;Low&lt;/td&gt;
&lt;td&gt;Medium&lt;/td&gt;
&lt;td&gt;High&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Team streaming expertise&lt;/td&gt;
&lt;td&gt;Not needed&lt;/td&gt;
&lt;td&gt;Not needed&lt;/td&gt;
&lt;td&gt;Required&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Operational budget&lt;/td&gt;
&lt;td&gt;Lowest&lt;/td&gt;
&lt;td&gt;Low&lt;/td&gt;
&lt;td&gt;Highest&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Recovery complexity&lt;/td&gt;
&lt;td&gt;Simple rerun&lt;/td&gt;
&lt;td&gt;Simple rerun&lt;/td&gt;
&lt;td&gt;Complex&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;
&lt;p&gt;&lt;strong&gt;Start with batch.&lt;/strong&gt; If stakeholders say &amp;quot;we need real-time,&amp;quot; ask &amp;quot;what&apos;s the cost of a 15-minute delay?&amp;quot; If the answer is &amp;quot;that&apos;s fine,&amp;quot; micro-batch gives you near-real-time at batch-level complexity.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Upgrade to streaming only when justified.&lt;/strong&gt; Sub-second latency requirements, event-driven business logic, and high-volume event processing are legitimate streaming use cases. &amp;quot;I want the dashboard to update faster&amp;quot; is usually not.&lt;/p&gt;
&lt;p&gt;&lt;img src=&quot;/assets/images/debp/06/decision-framework.png&quot; alt=&quot;Decision framework: start batch, upgrade to micro-batch, stream only when sub-second needed&quot;&gt;&lt;/p&gt;
&lt;h2&gt;What to Do Next&lt;/h2&gt;
&lt;p&gt;List every pipeline in your platform and categorize it by actual (not requested) latency requirement. You&apos;ll likely find that 80% or more of your workloads are well-served by batch or micro-batch. Focus streaming investment on the 20% that genuinely needs it.&lt;/p&gt;
&lt;p&gt;&lt;a href=&quot;https://www.dremio.com/get-started?utm_source=ev_buffer&amp;amp;utm_medium=influencer&amp;amp;utm_campaign=next-gen-dremio&amp;amp;utm_term=blog-021826-02-18-2026&amp;amp;utm_content=alexmerced&quot;&gt;Try Dremio Cloud free for 30 days&lt;/a&gt;&lt;/p&gt;
</content:encoded><author>Alex Merced</author></item><item><title>Slowly Changing Dimensions: Types 1-3 with Examples</title><link>https://iceberglakehouse.com/posts/2026-02-dm-slowly-changing-dimensions/</link><guid isPermaLink="true">https://iceberglakehouse.com/posts/2026-02-dm-slowly-changing-dimensions/</guid><description>
![Dimension timeline showing attribute values changing across time periods](/assets/images/data_modeling/06/slowly-changing-dimensions.png)

Dimensio...</description><pubDate>Wed, 18 Feb 2026 14:00:00 GMT</pubDate><content:encoded>&lt;p&gt;&lt;img src=&quot;/assets/images/data_modeling/06/slowly-changing-dimensions.png&quot; alt=&quot;Dimension timeline showing attribute values changing across time periods&quot;&gt;&lt;/p&gt;
&lt;p&gt;Dimensions change. A customer moves cities. A product gets reclassified. An employee changes departments. How your data model handles these changes determines whether your historical reports are accurate or misleading.&lt;/p&gt;
&lt;p&gt;Slowly Changing Dimensions (SCDs) are design patterns for managing dimension attribute changes over time. The three most common types — overwrite, track history, and track one change — each make a different tradeoff between simplicity and historical accuracy.&lt;/p&gt;
&lt;h2&gt;Why Dimensions Change&lt;/h2&gt;
&lt;p&gt;Dimension tables store descriptive attributes: customer addresses, product categories, employee titles. These attributes don&apos;t stay constant. A customer who was in &amp;quot;New York&amp;quot; last quarter is now in &amp;quot;Chicago.&amp;quot; A product that was in &amp;quot;Accessories&amp;quot; is now in &amp;quot;Electronics.&amp;quot;&lt;/p&gt;
&lt;p&gt;If your fact table recorded sales tied to that customer, do last quarter&apos;s reports show &amp;quot;New York&amp;quot; (where the customer was at the time of the sale) or &amp;quot;Chicago&amp;quot; (where the customer is now)? The answer depends on your SCD type.&lt;/p&gt;
&lt;h2&gt;Type 1: Overwrite the Old Value&lt;/h2&gt;
&lt;p&gt;Type 1 updates the dimension row in place. The old value is gone.&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-sql&quot;&gt;UPDATE dim_customers
SET city = &apos;Chicago&apos;
WHERE customer_id = 1042;
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;After this update, every historical fact associated with customer 1042 now appears under &amp;quot;Chicago&amp;quot; — including sales that happened when the customer was in New York.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;When to use Type 1:&lt;/strong&gt;&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Correcting errors (fixing a misspelled name)&lt;/li&gt;
&lt;li&gt;When historical accuracy for that attribute doesn&apos;t matter&lt;/li&gt;
&lt;li&gt;When the attribute rarely changes&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;&lt;strong&gt;Tradeoff:&lt;/strong&gt; No history. If someone asks &amp;quot;How much revenue came from New York customers last quarter?&amp;quot; they get the wrong answer because customer 1042 is now labeled Chicago.&lt;/p&gt;
&lt;h2&gt;Type 2: Track Full History&lt;/h2&gt;
&lt;p&gt;&lt;img src=&quot;/assets/images/data_modeling/06/scd-type-2.png&quot; alt=&quot;SCD Type 2 showing multiple rows for the same entity with effective and expiry dates&quot;&gt;&lt;/p&gt;
&lt;p&gt;Type 2 inserts a new row for each change. The original row is marked as expired, and the new row becomes the current version.&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-sql&quot;&gt;-- Original row (now expired)
-- customer_key: 1042, city: New York, effective_date: 2023-01-15, expiry_date: 2025-03-01, is_current: FALSE

-- New row
INSERT INTO dim_customers (customer_key, customer_id, city, effective_date, expiry_date, is_current)
VALUES (5001, 1042, &apos;Chicago&apos;, &apos;2025-03-01&apos;, &apos;9999-12-31&apos;, TRUE);
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Now each fact row references a specific version of the customer dimension. Sales from Q1 2024 reference customer_key 1042 (New York). Sales from Q2 2025 reference customer_key 5001 (Chicago). Historical reports are accurate.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;When to use Type 2:&lt;/strong&gt;&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;When historical accuracy matters (most analytics use cases)&lt;/li&gt;
&lt;li&gt;When you need to analyze trends by attribute value over time&lt;/li&gt;
&lt;li&gt;When regulatory or audit requirements demand change tracking&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;&lt;strong&gt;Tradeoff:&lt;/strong&gt; The dimension table grows. A customer who changes city three times has three rows. Queries must filter on &lt;code&gt;is_current = TRUE&lt;/code&gt; for current-state analysis, or join on date ranges for point-in-time analysis. This adds complexity to every query.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Surrogate keys are essential.&lt;/strong&gt; The natural business key (customer_id = 1042) appears in multiple rows. A surrogate key (customer_key, auto-incremented) uniquely identifies each version. Fact tables reference the surrogate key, not the natural key.&lt;/p&gt;
&lt;h2&gt;Type 3: Track One Change&lt;/h2&gt;
&lt;p&gt;Type 3 adds a column for the previous value instead of adding a row.&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-sql&quot;&gt;ALTER TABLE dim_customers ADD COLUMN previous_city VARCHAR(100);

UPDATE dim_customers
SET previous_city = city, city = &apos;Chicago&apos;
WHERE customer_id = 1042;
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;The table now has both &lt;code&gt;city = &apos;Chicago&apos;&lt;/code&gt; and &lt;code&gt;previous_city = &apos;New York&apos;&lt;/code&gt;.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;When to use Type 3:&lt;/strong&gt;&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;When you need quick access to both the current and immediately prior value&lt;/li&gt;
&lt;li&gt;When only one level of history matters&lt;/li&gt;
&lt;li&gt;When the dimension changes infrequently&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;&lt;strong&gt;Tradeoff:&lt;/strong&gt; You only track one change deep. If the customer moves again, the previous value is overwritten. Type 3 is rarely used in practice because most use cases require either no history (Type 1) or full history (Type 2).&lt;/p&gt;
&lt;h2&gt;Choosing the Right Type&lt;/h2&gt;
&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Factor&lt;/th&gt;
&lt;th&gt;Type 1 (Overwrite)&lt;/th&gt;
&lt;th&gt;Type 2 (New Row)&lt;/th&gt;
&lt;th&gt;Type 3 (New Column)&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;History preserved&lt;/td&gt;
&lt;td&gt;No&lt;/td&gt;
&lt;td&gt;Full&lt;/td&gt;
&lt;td&gt;One level&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Dimension growth&lt;/td&gt;
&lt;td&gt;No growth&lt;/td&gt;
&lt;td&gt;Grows over time&lt;/td&gt;
&lt;td&gt;No growth&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Query complexity&lt;/td&gt;
&lt;td&gt;Simple&lt;/td&gt;
&lt;td&gt;Moderate (date filtering)&lt;/td&gt;
&lt;td&gt;Simple&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Best for&lt;/td&gt;
&lt;td&gt;Error corrections&lt;/td&gt;
&lt;td&gt;Trend analysis&lt;/td&gt;
&lt;td&gt;Before/after comparison&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Storage impact&lt;/td&gt;
&lt;td&gt;None&lt;/td&gt;
&lt;td&gt;Moderate&lt;/td&gt;
&lt;td&gt;Minimal&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Implementation effort&lt;/td&gt;
&lt;td&gt;Low&lt;/td&gt;
&lt;td&gt;High&lt;/td&gt;
&lt;td&gt;Low&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;
&lt;p&gt;Most analytics organizations use &lt;strong&gt;Type 2 as the default&lt;/strong&gt; and Type 1 for error corrections. Type 3 is a niche choice for specific before/after reporting needs.&lt;/p&gt;
&lt;p&gt;In a lakehouse environment, Iceberg&apos;s time-travel feature provides an implicit form of historical tracking at the table level. You can query any past snapshot of a table:&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-sql&quot;&gt;SELECT * FROM dim_customers FOR SYSTEM_TIME AS OF &apos;2024-06-15T00:00:00&apos;;
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;This doesn&apos;t replace SCD Type 2 (which tracks attribute-level changes with effective dates), but it provides a safety net for point-in-time analysis.&lt;/p&gt;
&lt;p&gt;Platforms like &lt;a href=&quot;https://www.dremio.com/blog/agentic-analytics-semantic-layer/?utm_source=ev_buffer&amp;amp;utm_medium=influencer&amp;amp;utm_campaign=next-gen-dremio&amp;amp;utm_term=blog-021826-02-18-2026&amp;amp;utm_content=alexmerced&quot;&gt;Dremio&lt;/a&gt; support both approaches. SQL views can present a current-state view (filtering &lt;code&gt;WHERE is_current = TRUE&lt;/code&gt;) or an as-of view (joining on effective dates). Wikis document which SCD type each dimension uses, giving AI agents and analysts the context they need to write correct queries.&lt;/p&gt;
&lt;h2&gt;What to Do Next&lt;/h2&gt;
&lt;p&gt;&lt;img src=&quot;/assets/images/data_modeling/06/scd-decision-guide.png&quot; alt=&quot;Choosing between SCD types based on reporting requirements and complexity tolerance&quot;&gt;&lt;/p&gt;
&lt;p&gt;Audit your dimension tables. For each one, decide: Does historical accuracy matter for this attribute? If yes, implement Type 2. If the attribute changes rarely and history doesn&apos;t matter, Type 1 is sufficient. Document your choice — when the next engineer encounters the dimension, they need to know whether they&apos;re looking at current state or historical versions.&lt;/p&gt;
&lt;p&gt;&lt;a href=&quot;https://www.dremio.com/get-started?utm_source=ev_buffer&amp;amp;utm_medium=influencer&amp;amp;utm_campaign=next-gen-dremio&amp;amp;utm_term=blog-021826-02-18-2026&amp;amp;utm_content=alexmerced&quot;&gt;Try Dremio Cloud free for 30 days&lt;/a&gt;&lt;/p&gt;
</content:encoded><author>Alex Merced</author></item><item><title>The Role of the Semantic Layer in Data Governance</title><link>https://iceberglakehouse.com/posts/2026-02-sl-semantic-layer-data-governance/</link><guid isPermaLink="true">https://iceberglakehouse.com/posts/2026-02-sl-semantic-layer-data-governance/</guid><description>
![Data governance through a semantic layer — centralized policies and documentation](/assets/images/semantic_layer/06/governance-semantic.png)

Most ...</description><pubDate>Wed, 18 Feb 2026 14:00:00 GMT</pubDate><content:encoded>&lt;p&gt;&lt;img src=&quot;/assets/images/semantic_layer/06/governance-semantic.png&quot; alt=&quot;Data governance through a semantic layer — centralized policies and documentation&quot;&gt;&lt;/p&gt;
&lt;p&gt;Most organizations have a data governance policy. It lives in a Confluence page. It defines who owns what data, what terms mean, and who should have access. And almost nobody follows it, because it&apos;s not enforced where queries actually run.&lt;/p&gt;
&lt;p&gt;A semantic layer changes that. It moves governance from a document into the query path, where every rule is applied automatically, for every user, through every tool.&lt;/p&gt;
&lt;h2&gt;Governance on Paper vs. Governance in Practice&lt;/h2&gt;
&lt;p&gt;Data governance fails when it depends on people doing the right thing manually. A policy says &amp;quot;Revenue means completed orders minus refunds.&amp;quot; An analyst writes a slightly different formula. A dashboard uses the wrong table. An AI agent invents its own definition. The governance policy exists. Nobody follows it. And the organization makes decisions on inconsistent data.&lt;/p&gt;
&lt;p&gt;The root cause isn&apos;t that people are careless. It&apos;s that governance is separated from the systems people actually use to query data. Enforcement happens in a side channel — documentation, review processes, audit logs — not in the query itself.&lt;/p&gt;
&lt;h2&gt;Centralized Definitions Eliminate Conflicting Metrics&lt;/h2&gt;
&lt;p&gt;A semantic layer solves the definition problem by making the governance policy code.&lt;/p&gt;
&lt;p&gt;Revenue isn&apos;t a paragraph in a wiki. It&apos;s a SQL view:&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-sql&quot;&gt;CREATE VIEW business.revenue AS
SELECT
    OrderDate,
    Region,
    SUM(OrderTotal) AS Revenue
FROM silver.orders_enriched
WHERE Status = &apos;completed&apos; AND Refunded = FALSE
GROUP BY OrderDate, Region;
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Every dashboard, notebook, and AI agent that needs Revenue queries this view. There&apos;s no alternative formula to use. The semantic layer IS the governance for this metric.&lt;/p&gt;
&lt;p&gt;When the definition changes (say, a new refund category is added), the view is updated once, and every consumer gets the new logic automatically. No rollout. No migration. No &amp;quot;did everyone update their dashboard?&amp;quot;&lt;/p&gt;
&lt;h2&gt;Access Policies Enforced at Query Time&lt;/h2&gt;
&lt;p&gt;&lt;img src=&quot;/assets/images/semantic_layer/06/governance-enforcement.png&quot; alt=&quot;All query paths routing through a single governance enforcement gate&quot;&gt;&lt;/p&gt;
&lt;p&gt;The second governance gap: access control. Most organizations enforce security at the BI tool level. Tableau restricts who sees which dashboard. Power BI applies row-level filters. But if someone opens a SQL client and queries the underlying table directly, those filters don&apos;t apply.&lt;/p&gt;
&lt;p&gt;A semantic layer enforces policies at a lower level. When access control exists in the semantic layer, it applies to every query path:&lt;/p&gt;
&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Query Path&lt;/th&gt;
&lt;th&gt;BI-Level Security&lt;/th&gt;
&lt;th&gt;Semantic Layer Security&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Dashboard&lt;/td&gt;
&lt;td&gt;Enforced&lt;/td&gt;
&lt;td&gt;Enforced&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;SQL notebook&lt;/td&gt;
&lt;td&gt;Not enforced&lt;/td&gt;
&lt;td&gt;Enforced&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;AI agent&lt;/td&gt;
&lt;td&gt;Not enforced&lt;/td&gt;
&lt;td&gt;Enforced&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;API/programmatic access&lt;/td&gt;
&lt;td&gt;Not enforced&lt;/td&gt;
&lt;td&gt;Enforced&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;
&lt;p&gt;Dremio implements this through &lt;a href=&quot;https://www.dremio.com/blog/agentic-analytics-semantic-layer/?utm_source=ev_buffer&amp;amp;utm_medium=influencer&amp;amp;utm_campaign=next-gen-dremio&amp;amp;utm_term=blog-021826-02-18-2026&amp;amp;utm_content=alexmerced&quot;&gt;Fine-Grained Access Control (FGAC)&lt;/a&gt;: policies defined as UDFs that filter rows and mask columns based on the querying user&apos;s role. These policies are applied at the virtual dataset (view) level. A regional manager queries &lt;code&gt;business.revenue&lt;/code&gt; and sees only their region. A data engineer sees all regions. Same view, same SQL, different results based on identity.&lt;/p&gt;
&lt;p&gt;This approach eliminates the &amp;quot;security gap&amp;quot; that appears when users bypass BI tools. Every route to the data flows through the semantic layer. Every route inherits the policies.&lt;/p&gt;
&lt;h2&gt;Lineage and Accountability Through Views&lt;/h2&gt;
&lt;p&gt;The layered view architecture (Bronze → Silver → Gold) that a semantic layer uses is inherently traceable. Every Gold metric traces back to its Silver business logic, which traces back to the Bronze source mapping, which traces back to raw data.&lt;/p&gt;
&lt;p&gt;This traceability matters for compliance. When an auditor asks &amp;quot;Where does your Revenue number come from?&amp;quot;, you don&apos;t search through dashboards and notebooks. You follow the view chain:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;code&gt;gold.monthly_revenue_by_region&lt;/code&gt; → references &lt;code&gt;silver.orders_enriched&lt;/code&gt;&lt;/li&gt;
&lt;li&gt;&lt;code&gt;silver.orders_enriched&lt;/code&gt; → joins &lt;code&gt;bronze.orders_raw&lt;/code&gt; with &lt;code&gt;bronze.customers_raw&lt;/code&gt;&lt;/li&gt;
&lt;li&gt;&lt;code&gt;bronze.orders_raw&lt;/code&gt; → maps to &lt;code&gt;production.public.orders&lt;/code&gt; in PostgreSQL&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;Every step is documented. Every transformation is visible. The lineage isn&apos;t reconstructed after the fact — it&apos;s structural.&lt;/p&gt;
&lt;h2&gt;Documentation as a Governance Tool&lt;/h2&gt;
&lt;p&gt;&lt;img src=&quot;/assets/images/semantic_layer/06/governance-labels.png&quot; alt=&quot;Data governance labels and tags applied to tables for compliance&quot;&gt;&lt;/p&gt;
&lt;p&gt;Governance is also about discoverability. Can someone find the right dataset without messaging five people? Can they tell whether a view is production-ready or experimental?&lt;/p&gt;
&lt;p&gt;Two mechanisms handle this in a semantic layer:&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Wikis&lt;/strong&gt; attach human-readable (and AI-readable) descriptions to tables, columns, and views. They explain what data represents, where it comes from, and any caveats. A column named &lt;code&gt;cltv&lt;/code&gt; gets a description: &amp;quot;Customer Lifetime Value, calculated as total revenue from first purchase to current date, excluding refunds.&amp;quot;&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Labels&lt;/strong&gt; categorize data for governance workflows. A label like &amp;quot;PII&amp;quot; triggers automatic column masking. A label like &amp;quot;Certified&amp;quot; indicates the view has been reviewed and approved for production use. A label like &amp;quot;Deprecated&amp;quot; warns consumers to migrate to the replacement.&lt;/p&gt;
&lt;p&gt;For organizations with thousands of datasets, manual documentation is impractical. Dremio&apos;s generative AI auto-generates Wiki descriptions by sampling table data and suggests Labels based on column content. This bootstraps documentation to 70% coverage automatically. The data team fills in what the AI misses.&lt;/p&gt;
&lt;h2&gt;Certification and Change Management&lt;/h2&gt;
&lt;p&gt;Not all views are equal. A semantic layer should distinguish between views that are experimental, under review, and production-ready.&lt;/p&gt;
&lt;p&gt;A practical certification workflow:&lt;/p&gt;
&lt;ol&gt;
&lt;li&gt;&lt;strong&gt;Draft&lt;/strong&gt;: New view created by an analyst. Not yet reviewed. Labeled &amp;quot;Draft.&amp;quot;&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Reviewed&lt;/strong&gt;: View reviewed by the data team. Business logic validated. Documentation complete.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Certified&lt;/strong&gt;: View approved for production use. Labeled &amp;quot;Certified.&amp;quot; Available in production dashboards and to AI agents.&lt;/li&gt;
&lt;/ol&gt;
&lt;p&gt;Each Certified view should have a documented owner — the person accountable for its accuracy and freshness. When business requirements change, the owner updates the view and documentation together. Changes are reviewed before the &amp;quot;Certified&amp;quot; label is reapplied.&lt;/p&gt;
&lt;p&gt;This workflow doesn&apos;t require advanced tooling. Labels, Wikis, and a team agreement on the process are sufficient. What matters is that governance is visible inside the semantic layer, not tracked in a separate system.&lt;/p&gt;
&lt;h2&gt;What to Do Next&lt;/h2&gt;
&lt;p&gt;Audit your top 10 business metrics. For each one, ask three questions: Is the formula defined in one place? Is access control enforced at the query level (not just the BI tool)? Can you trace the number back to its raw source in under 60 seconds? Every &amp;quot;no&amp;quot; is a governance gap that a semantic layer closes.&lt;/p&gt;
&lt;p&gt;&lt;a href=&quot;https://www.dremio.com/get-started?utm_source=ev_buffer&amp;amp;utm_medium=influencer&amp;amp;utm_campaign=next-gen-dremio&amp;amp;utm_term=blog-021826-02-18-2026&amp;amp;utm_content=alexmerced&quot;&gt;Try Dremio Cloud free for 30 days&lt;/a&gt;&lt;/p&gt;
</content:encoded><author>Alex Merced</author></item><item><title>Schema Evolution Without Breaking Consumers</title><link>https://iceberglakehouse.com/posts/2026-02-debp-schema-evolution/</link><guid isPermaLink="true">https://iceberglakehouse.com/posts/2026-02-debp-schema-evolution/</guid><description>
![Schema as a contract between producers and consumers with version tracking](/assets/images/debp/05/schema-contract.png)

A source team renames a co...</description><pubDate>Wed, 18 Feb 2026 13:00:00 GMT</pubDate><content:encoded>&lt;p&gt;&lt;img src=&quot;/assets/images/debp/05/schema-contract.png&quot; alt=&quot;Schema as a contract between producers and consumers with version tracking&quot;&gt;&lt;/p&gt;
&lt;p&gt;A source team renames a column from &lt;code&gt;user_id&lt;/code&gt; to &lt;code&gt;customer_id&lt;/code&gt;. Twelve hours later, five dashboards show blank values, two ML pipelines fail, and the data engineering team spends the morning tracing a problem that could have been prevented with one rule: treat your schema like an API.&lt;/p&gt;
&lt;p&gt;Schema evolution is the practice of changing data structures without breaking the systems that depend on them. Get it right, and your data platform stays flexible. Get it wrong, and every schema change becomes an emergency.&lt;/p&gt;
&lt;h2&gt;Your Schema Is an API&lt;/h2&gt;
&lt;p&gt;When an application team changes a REST API endpoint, they version it. They deprecate the old version. They give consumers time to migrate. They don&apos;t silently rename fields and hope nobody notices.&lt;/p&gt;
&lt;p&gt;Data schemas deserve the same discipline. Your columns are fields. Your tables are endpoints. Your downstream consumers — dashboards, ML pipelines, reports, other pipelines — are API clients. When you change the schema, you change the contract.&lt;/p&gt;
&lt;p&gt;The difference: API changes are usually intentional and reviewed. Schema changes often happen accidentally — a source system updates its export format, an engineer renames a column for readability, a new data type is introduced. Without guardrails, these changes propagate downstream silently.&lt;/p&gt;
&lt;h2&gt;Safe vs. Breaking Changes&lt;/h2&gt;
&lt;p&gt;Not all schema changes carry the same risk:&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Backward-compatible (safe) changes:&lt;/strong&gt;&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Adding a new optional column with a default value&lt;/li&gt;
&lt;li&gt;Widening a data type (INT to BIGINT, FLOAT to DOUBLE)&lt;/li&gt;
&lt;li&gt;Adding documentation or metadata to columns&lt;/li&gt;
&lt;li&gt;Reordering columns (if consumers reference by name, not position)&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;&lt;strong&gt;Breaking changes:&lt;/strong&gt;&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Removing a column that consumers reference&lt;/li&gt;
&lt;li&gt;Renaming a column without maintaining the old name&lt;/li&gt;
&lt;li&gt;Narrowing a data type (BIGINT to INT — values may overflow)&lt;/li&gt;
&lt;li&gt;Changing the semantic meaning of a column (e.g., &lt;code&gt;revenue&lt;/code&gt; from gross to net)&lt;/li&gt;
&lt;li&gt;Changing nullability (nullable to non-nullable breaks inserts with nulls)&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;The rule: backward-compatible changes can be deployed without coordination. Breaking changes require a migration plan.&lt;/p&gt;
&lt;h2&gt;The Additive-Only Pattern&lt;/h2&gt;
&lt;p&gt;The simplest schema evolution strategy: never remove or rename columns. Only add new ones.&lt;/p&gt;
&lt;p&gt;&lt;img src=&quot;/assets/images/debp/05/additive-evolution.png&quot; alt=&quot;Additive schema evolution: columns only added, never removed or renamed&quot;&gt;&lt;/p&gt;
&lt;p&gt;When a column needs to be replaced:&lt;/p&gt;
&lt;ol&gt;
&lt;li&gt;Add the new column alongside the old one&lt;/li&gt;
&lt;li&gt;Update producers to populate both columns&lt;/li&gt;
&lt;li&gt;Migrate consumers to the new column one at a time&lt;/li&gt;
&lt;li&gt;Once all consumers have migrated, mark the old column as deprecated&lt;/li&gt;
&lt;li&gt;Remove the old column only after a deprecation period (e.g., 90 days)&lt;/li&gt;
&lt;/ol&gt;
&lt;p&gt;This pattern is boring — and that&apos;s the point. Boring is reliable. Adding a column never breaks existing queries. Consumers that don&apos;t need the new column ignore it. Consumers that do need it can adopt it on their own schedule.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Tradeoff:&lt;/strong&gt; Table width grows over time. Schemas accumulate deprecated columns. This is an acceptable cost compared to production outages.&lt;/p&gt;
&lt;h2&gt;Schema Versioning and Migration&lt;/h2&gt;
&lt;p&gt;For changes that can&apos;t be additive (fundamental restructuring, data model migrations), use explicit versioning:&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Version in the table name.&lt;/strong&gt; &lt;code&gt;customers_v1&lt;/code&gt;, &lt;code&gt;customers_v2&lt;/code&gt; coexist. Consumers migrate from v1 to v2 at their own pace. A view named &lt;code&gt;customers&lt;/code&gt; points to the current version.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Version in metadata.&lt;/strong&gt; Store a schema version field in each record or partition. Consumers check the version and apply the appropriate parsing logic.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Schema registries.&lt;/strong&gt; Centralized systems that store and validate schemas. Producers register their schema. Consumers declare their expected schema. The registry checks compatibility and rejects breaking changes.&lt;/p&gt;
&lt;p&gt;Schema registries enforce rules automatically:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;BACKWARD compatible: new schema can read data written by old schema&lt;/li&gt;
&lt;li&gt;FORWARD compatible: old schema can read data written by new schema&lt;/li&gt;
&lt;li&gt;FULL compatible: both backward and forward compatible&lt;/li&gt;
&lt;/ul&gt;
&lt;h2&gt;Contract Enforcement at Pipeline Boundaries&lt;/h2&gt;
&lt;p&gt;Don&apos;t rely on conventions (&amp;quot;we don&apos;t rename columns&amp;quot;). Enforce contracts programmatically at pipeline boundaries:&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;At ingestion.&lt;/strong&gt; Compare the incoming data schema against the expected schema. If columns are missing, added, or retyped, log the difference and alert. For safe changes, proceed and notify. For breaking changes, halt and quarantine.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;At transformation.&lt;/strong&gt; Validate that every column referenced in SQL or transformation logic exists in the input schema. Catch missing-column errors at validation time, not at runtime.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;At serving.&lt;/strong&gt; Validate that output schemas match the contracts expected by consumers. If a downstream dashboard expects column &lt;code&gt;revenue&lt;/code&gt;, verify it exists and has the correct type before the pipeline marks the job as successful.&lt;/p&gt;
&lt;p&gt;&lt;img src=&quot;/assets/images/debp/05/contract-enforcement.png&quot; alt=&quot;Contract enforcement at pipeline boundaries: ingestion, transformation, serving&quot;&gt;&lt;/p&gt;
&lt;h2&gt;What to Do Next&lt;/h2&gt;
&lt;p&gt;Document the schema of your five most critical tables: column names, types, nullability, and a one-line description. That&apos;s your version 1 contract. Set up an automated check that compares incoming data against this contract and alerts on any deviation. You&apos;ll catch the next breaking change before it breaks anything.&lt;/p&gt;
&lt;p&gt;&lt;a href=&quot;https://www.dremio.com/get-started?utm_source=ev_buffer&amp;amp;utm_medium=influencer&amp;amp;utm_campaign=next-gen-dremio&amp;amp;utm_term=blog-021826-02-18-2026&amp;amp;utm_content=alexmerced&quot;&gt;Try Dremio Cloud free for 30 days&lt;/a&gt;&lt;/p&gt;
</content:encoded><author>Alex Merced</author></item><item><title>Dimensional Modeling: Facts, Dimensions, and Grains</title><link>https://iceberglakehouse.com/posts/2026-02-dm-dimensional-modeling/</link><guid isPermaLink="true">https://iceberglakehouse.com/posts/2026-02-dm-dimensional-modeling/</guid><description>
![Dimensional model showing a central fact table connected to surrounding dimension tables](/assets/images/data_modeling/05/dimensional-modeling.png)...</description><pubDate>Wed, 18 Feb 2026 13:00:00 GMT</pubDate><content:encoded>&lt;p&gt;&lt;img src=&quot;/assets/images/data_modeling/05/dimensional-modeling.png&quot; alt=&quot;Dimensional model showing a central fact table connected to surrounding dimension tables&quot;&gt;&lt;/p&gt;
&lt;p&gt;Dimensional modeling is the most widely used approach for organizing analytics data. Developed by Ralph Kimball, it structures data into two types of tables: facts (what happened) and dimensions (the context around what happened). The technique optimizes for query speed and business readability, not for storage efficiency or transactional integrity.&lt;/p&gt;
&lt;p&gt;If your goal is to answer business questions quickly and consistently, dimensional modeling is where you start.&lt;/p&gt;
&lt;h2&gt;Facts and Dimensions: The Two Building Blocks&lt;/h2&gt;
&lt;p&gt;&lt;strong&gt;Fact tables&lt;/strong&gt; store measurable events. Each row represents something that happened: a sale, a click, a shipment, a login. Fact tables are narrow (a few foreign keys and numeric measures) and deep (millions or billions of rows).&lt;/p&gt;
&lt;p&gt;A typical sales fact table might look like:&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-sql&quot;&gt;CREATE TABLE fact_sales (
    sale_id BIGINT,
    date_key INT,
    customer_key INT,
    product_key INT,
    store_key INT,
    quantity INT,
    unit_price DECIMAL(10,2),
    total_amount DECIMAL(12,2)
)
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;&lt;strong&gt;Dimension tables&lt;/strong&gt; provide context. They describe the &amp;quot;who, what, where, when, and how&amp;quot; behind each fact. Dimension tables are wide (many descriptive columns) and shallow (thousands to millions of rows).&lt;/p&gt;
&lt;p&gt;A customer dimension might include: customer_name, email, signup_date, city, state, country, segment, lifetime_value, acquisition_channel.&lt;/p&gt;
&lt;p&gt;Every analysis query joins a fact table to one or more dimension tables. &amp;quot;Revenue by region&amp;quot; joins the sales fact to the geography dimension. &amp;quot;Revenue by product category&amp;quot; joins the sales fact to the product dimension. The fact table provides the number; the dimensions provide the labels.&lt;/p&gt;
&lt;h2&gt;Declaring the Grain&lt;/h2&gt;
&lt;p&gt;&lt;img src=&quot;/assets/images/data_modeling/05/grain-declaration.png&quot; alt=&quot;Grain declaration as the foundation — one row per transaction per line item&quot;&gt;&lt;/p&gt;
&lt;p&gt;The grain is the most important decision in dimensional modeling. It declares what one row in your fact table represents.&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&amp;quot;One row per order line item&amp;quot; — each product within an order gets its own row&lt;/li&gt;
&lt;li&gt;&amp;quot;One row per daily customer session&amp;quot; — each customer&apos;s daily activity is aggregated into one row&lt;/li&gt;
&lt;li&gt;&amp;quot;One row per monthly account balance&amp;quot; — snapshot taken once per month&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;&lt;strong&gt;Getting the grain right matters because:&lt;/strong&gt;&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Too coarse: You lose detail. If your grain is &amp;quot;one row per order&amp;quot; you can&apos;t analyze individual line items.&lt;/li&gt;
&lt;li&gt;Too fine: You create an enormous table that&apos;s expensive to query. If your grain is &amp;quot;one row per page view&amp;quot; in a high-traffic application, the table grows by billions of rows per month.&lt;/li&gt;
&lt;li&gt;Inconsistent: If some rows represent individual items and others represent aggregated totals, every calculation produces wrong results.&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;Declare the grain first. Then identify which dimensions apply at that grain, and which numeric measures belong in the fact table. This order is not optional — skip it, and the model breaks down.&lt;/p&gt;
&lt;h2&gt;Designing Fact Tables&lt;/h2&gt;
&lt;p&gt;Three types of fact tables handle different analytical patterns:&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Transaction facts&lt;/strong&gt; record individual events. One row per sale, one row per click. This is the most common type. It supports the most detailed analysis but produces the largest tables.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Periodic snapshot facts&lt;/strong&gt; capture the state at regular intervals. One row per account per month. Useful for balance-tracking, inventory levels, and any measure that accumulates over time.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Accumulating snapshot facts&lt;/strong&gt; track the lifecycle of a process. One row per order, with date columns for each milestone (order_placed, payment_received, shipped, delivered). Useful for analyzing process efficiency and bottleneck identification.&lt;/p&gt;
&lt;p&gt;Best practices for fact tables:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Keep facts additive when possible (SUM-able across dimensions)&lt;/li&gt;
&lt;li&gt;Avoid storing text in fact tables — that belongs in dimensions&lt;/li&gt;
&lt;li&gt;Use surrogate keys (integers) for dimension references, not natural keys&lt;/li&gt;
&lt;li&gt;Never mix grains in one fact table&lt;/li&gt;
&lt;/ul&gt;
&lt;h2&gt;Designing Dimension Tables&lt;/h2&gt;
&lt;p&gt;Well-designed dimensions follow predictable patterns:&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Denormalize.&lt;/strong&gt; Include all descriptive attributes in one table. Product name, category, subcategory, brand, manufacturer, department — all in dim_products. This eliminates joins and makes queries readable.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Use surrogate keys.&lt;/strong&gt; Assign an integer key (product_key) that acts as the primary key. Keep the natural business key (product_sku) as a regular attribute. Surrogate keys insulate your model from source system key changes.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Add audit columns.&lt;/strong&gt; Include effective_date, expiry_date, and is_current flag for tracking changes over time (Slowly Changing Dimensions — covered in a separate article).&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Include &amp;quot;junk&amp;quot; dimensions.&lt;/strong&gt; Low-cardinality flags and indicators (is_promotional, is_online, payment_type) can be combined into a single &amp;quot;junk dimension&amp;quot; instead of cluttering the fact table.&lt;/p&gt;
&lt;h2&gt;Conformed Dimensions&lt;/h2&gt;
&lt;p&gt;A conformed dimension is shared across multiple fact tables. The best example is the Date dimension — every fact table references dates, and they should all use the same date dimension to ensure consistent filtering and grouping.&lt;/p&gt;
&lt;p&gt;Other conformed dimensions: Customer, Product, Employee, Geography. When Sales and Support both reference the same dim_customers table, you can analyze customer behavior across both domains without reconciling different customer definitions.&lt;/p&gt;
&lt;p&gt;Conformed dimensions are the connective tissue of a dimensional model. Without them, each fact table exists in isolation.&lt;/p&gt;
&lt;p&gt;Platforms like &lt;a href=&quot;https://www.dremio.com/blog/agentic-analytics-semantic-layer/?utm_source=ev_buffer&amp;amp;utm_medium=influencer&amp;amp;utm_campaign=next-gen-dremio&amp;amp;utm_term=blog-021826-02-18-2026&amp;amp;utm_content=alexmerced&quot;&gt;Dremio&lt;/a&gt; support dimensional modeling through virtual datasets. Fact and dimension views live in the Silver layer of a Medallion Architecture. Conformed dimensions are defined once and referenced by multiple fact views. Wikis document what each dimension attribute means, and AI agents use that documentation to generate accurate queries.&lt;/p&gt;
&lt;h2&gt;What to Do Next&lt;/h2&gt;
&lt;p&gt;&lt;img src=&quot;/assets/images/data_modeling/05/conformed-dimensions.png&quot; alt=&quot;Conformed dimensions shared across multiple fact tables in a unified model&quot;&gt;&lt;/p&gt;
&lt;p&gt;Start your dimensional model with one business process — the one your team queries most. Declare the grain. Identify the dimensions. Build the fact table. Then expand: pick the next business process, reuse the conformed dimensions, and add new ones as needed.&lt;/p&gt;
&lt;p&gt;&lt;a href=&quot;https://www.dremio.com/get-started?utm_source=ev_buffer&amp;amp;utm_medium=influencer&amp;amp;utm_campaign=next-gen-dremio&amp;amp;utm_term=blog-021826-02-18-2026&amp;amp;utm_content=alexmerced&quot;&gt;Try Dremio Cloud free for 30 days&lt;/a&gt;&lt;/p&gt;
</content:encoded><author>Alex Merced</author></item><item><title>Why Your AI Initiatives Fail Without a Semantic Layer</title><link>https://iceberglakehouse.com/posts/2026-02-sl-why-ai-fails-without-semantic-layer/</link><guid isPermaLink="true">https://iceberglakehouse.com/posts/2026-02-sl-why-ai-fails-without-semantic-layer/</guid><description>
![AI with vs without a semantic layer — failure modes and fixes](/assets/images/semantic_layer/05/ai-semantic-layer.png)

Your team builds an AI agen...</description><pubDate>Wed, 18 Feb 2026 13:00:00 GMT</pubDate><content:encoded>&lt;p&gt;&lt;img src=&quot;/assets/images/semantic_layer/05/ai-semantic-layer.png&quot; alt=&quot;AI with vs without a semantic layer — failure modes and fixes&quot;&gt;&lt;/p&gt;
&lt;p&gt;Your team builds an AI agent. It connects to your data warehouse. A product manager types &amp;quot;What was revenue last quarter?&amp;quot; and gets a number. The number is wrong. Nobody knows it&apos;s wrong until Finance runs the same query manually and gets a different result.&lt;/p&gt;
&lt;p&gt;This happens constantly. And the problem isn&apos;t the AI model. It&apos;s the missing layer between the model and your data.&lt;/p&gt;
&lt;h2&gt;The Promise vs. the Reality&lt;/h2&gt;
&lt;p&gt;Natural language analytics is the most requested feature in every data platform survey. Business users want to ask questions in plain English and get accurate answers. No SQL. No tickets. No waiting.&lt;/p&gt;
&lt;p&gt;The technology exists. Large language models can generate SQL from natural language with impressive accuracy. But accuracy on syntax isn&apos;t accuracy on meaning. An LLM can write grammatically correct SQL that returns the wrong answer because it doesn&apos;t understand your business definitions.&lt;/p&gt;
&lt;p&gt;A semantic layer provides those definitions. Without one, AI analytics is a demonstration that works in a meeting but fails in production.&lt;/p&gt;
&lt;h2&gt;Five Ways AI Goes Wrong Without a Semantic Layer&lt;/h2&gt;
&lt;p&gt;&lt;img src=&quot;/assets/images/semantic_layer/05/ai-hallucination.png&quot; alt=&quot;AI agent confused by raw data — hallucinating metrics and joins&quot;&gt;&lt;/p&gt;
&lt;h3&gt;Metric Hallucination&lt;/h3&gt;
&lt;p&gt;Your LLM decides that Revenue = &lt;code&gt;SUM(amount)&lt;/code&gt; from the &lt;code&gt;transactions&lt;/code&gt; table. But your actual Revenue formula is &lt;code&gt;SUM(order_total) WHERE status = &apos;completed&apos; AND refunded = FALSE&lt;/code&gt; from the &lt;code&gt;orders&lt;/code&gt; table. The AI&apos;s number is plausible. It&apos;s also wrong by 15%. Nobody catches it because it looks reasonable.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;The fix&lt;/strong&gt;: Canonical metric definitions in virtual datasets. The AI references the view, not its own invented formula.&lt;/p&gt;
&lt;h3&gt;Join Confusion&lt;/h3&gt;
&lt;p&gt;There are three paths from &lt;code&gt;orders&lt;/code&gt; to &lt;code&gt;customers&lt;/code&gt;: via &lt;code&gt;customer_id&lt;/code&gt;, via &lt;code&gt;billing_address_id&lt;/code&gt;, and via &lt;code&gt;shipping_address_id&lt;/code&gt;. For revenue analysis, you want &lt;code&gt;customer_id&lt;/code&gt;. The LLM picks &lt;code&gt;billing_address_id&lt;/code&gt; because it seems logical. The numbers are close enough that the mistake survives review.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;The fix&lt;/strong&gt;: Pre-defined join relationships in the semantic model. The AI follows the approved path.&lt;/p&gt;
&lt;h3&gt;Column Misinterpretation&lt;/h3&gt;
&lt;p&gt;A column called &lt;code&gt;date&lt;/code&gt; appears in the &lt;code&gt;orders&lt;/code&gt; table. Is it the order date, ship date, or invoice date? The LLM guesses order date. It&apos;s actually the ship date. Every time-based query is off by 2-5 days.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;The fix&lt;/strong&gt;: Wiki descriptions on every column. The semantic layer tells the AI that &lt;code&gt;date&lt;/code&gt; is &lt;code&gt;ShipDate&lt;/code&gt; and &lt;code&gt;OrderDate&lt;/code&gt; is the field to use for time-based revenue analysis.&lt;/p&gt;
&lt;h3&gt;Security Bypass&lt;/h3&gt;
&lt;p&gt;Your BI dashboard applies row-level security so regional managers only see their region&apos;s data. The AI agent queries the raw table directly, bypassing the BI layer. A regional manager asks about &amp;quot;their&amp;quot; revenue and sees the entire company&apos;s numbers.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;The fix&lt;/strong&gt;: Fine-Grained Access Control enforced at the semantic layer. The AI queries views, not raw tables. Security policies travel with the data regardless of the access path.&lt;/p&gt;
&lt;h3&gt;Inconsistent Results&lt;/h3&gt;
&lt;p&gt;The same question asked twice generates different SQL because the LLM&apos;s output is probabilistic. Monday&apos;s answer: $4.2M. Wednesday&apos;s answer: $4.5M. Both are &amp;quot;correct&amp;quot; given the SQL generated. Neither matches Finance&apos;s number.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;The fix&lt;/strong&gt;: Deterministic definitions in the semantic layer. The same question always resolves to the same view, the same formula, the same result.&lt;/p&gt;
&lt;h2&gt;How a Semantic Layer Grounds AI&lt;/h2&gt;
&lt;p&gt;&lt;img src=&quot;/assets/images/semantic_layer/05/ai-with-context.png&quot; alt=&quot;AI agent successfully using a semantic layer to produce accurate results&quot;&gt;&lt;/p&gt;
&lt;p&gt;Each failure maps to a specific semantic layer component:&lt;/p&gt;
&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Failure Mode&lt;/th&gt;
&lt;th&gt;Semantic Layer Fix&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Metric hallucination&lt;/td&gt;
&lt;td&gt;Virtual datasets with canonical formulas&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Join confusion&lt;/td&gt;
&lt;td&gt;Pre-defined join relationships&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Column misinterpretation&lt;/td&gt;
&lt;td&gt;Wiki descriptions on every field&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Security bypass&lt;/td&gt;
&lt;td&gt;Access policies enforced at the view level&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Inconsistent results&lt;/td&gt;
&lt;td&gt;Deterministic definitions (same question = same SQL)&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;
&lt;p&gt;This is why platforms that take AI analytics seriously embed the semantic layer directly into the query engine. &lt;a href=&quot;https://www.dremio.com/blog/agentic-analytics-semantic-layer/?utm_source=ev_buffer&amp;amp;utm_medium=influencer&amp;amp;utm_campaign=next-gen-dremio&amp;amp;utm_term=blog-021826-02-18-2026&amp;amp;utm_content=alexmerced&quot;&gt;Dremio&apos;s approach&lt;/a&gt; combines virtual datasets, Wikis, Labels, and Fine-Grained Access Control into a single layer that both humans and AI agents consume. The AI doesn&apos;t just generate SQL. It consults the semantic layer to understand what the data means, which formulas to use, and what the querying user is allowed to see.&lt;/p&gt;
&lt;h2&gt;What AI-Ready Architecture Looks Like&lt;/h2&gt;
&lt;p&gt;An AI-ready data platform doesn&apos;t just connect an LLM to a database. It puts a structured context layer in between:&lt;/p&gt;
&lt;ol&gt;
&lt;li&gt;&lt;strong&gt;Semantic layer&lt;/strong&gt; defines metrics, documents columns, and enforces security&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;AI agent&lt;/strong&gt; reads the semantic layer to understand business context&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Query engine&lt;/strong&gt; executes the AI-generated SQL with full optimization (caching, reflections, pushdowns)&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Results&lt;/strong&gt; are returned in business terms through the same interface humans use&lt;/li&gt;
&lt;/ol&gt;
&lt;p&gt;Without step 1, the AI is just a SQL autocomplete tool with no business understanding. It writes syntactically valid queries that produce semantically wrong answers. The semantic layer is the difference between a toy demo and a production-grade AI analytics system.&lt;/p&gt;
&lt;h2&gt;What to Do Next&lt;/h2&gt;
&lt;p&gt;If your AI analytics initiative is producing unreliable results, don&apos;t upgrade the model. Audit the context the model has access to. Can it read your metric definitions? Column descriptions? Security policies? If the answer is no, the fix isn&apos;t a better LLM. It&apos;s a semantic layer.&lt;/p&gt;
&lt;p&gt;&lt;a href=&quot;https://www.dremio.com/get-started?utm_source=ev_buffer&amp;amp;utm_medium=influencer&amp;amp;utm_campaign=next-gen-dremio&amp;amp;utm_term=blog-021826-02-18-2026&amp;amp;utm_content=alexmerced&quot;&gt;Try Dremio Cloud free for 30 days&lt;/a&gt;&lt;/p&gt;
</content:encoded><author>Alex Merced</author></item><item><title>Idempotent Pipelines: Build Once, Run Safely Forever</title><link>https://iceberglakehouse.com/posts/2026-02-debp-idempotent-pipelines/</link><guid isPermaLink="true">https://iceberglakehouse.com/posts/2026-02-debp-idempotent-pipelines/</guid><description>
![Pipeline running multiple times and converging to the same result](/assets/images/debp/04/idempotent-pipeline.png)

A pipeline runs, processes 100,...</description><pubDate>Wed, 18 Feb 2026 12:00:00 GMT</pubDate><content:encoded>&lt;p&gt;&lt;img src=&quot;/assets/images/debp/04/idempotent-pipeline.png&quot; alt=&quot;Pipeline running multiple times and converging to the same result&quot;&gt;&lt;/p&gt;
&lt;p&gt;A pipeline runs, processes 100,000 records, and loads them into the target table. Then it fails on a downstream step. The orchestrator retries the entire job. Now the table has 200,000 records — 100,000 of them duplicates. Revenue reports double. Dashboards misfire. Someone spends the next four hours manually deduplicating records and explaining to stakeholders why the numbers were wrong.&lt;/p&gt;
&lt;p&gt;This is the cost of not building idempotent pipelines.&lt;/p&gt;
&lt;h2&gt;What Idempotency Means for Pipelines&lt;/h2&gt;
&lt;p&gt;An idempotent operation produces the same result no matter how many times you execute it. For data pipelines, that means: running the same job twice — or ten times — leaves the target data in the exact same state as running it once.&lt;/p&gt;
&lt;p&gt;This property matters because retries are inevitable. Orchestrators retry failed tasks. Backfill jobs reprocess historical data. Network glitches cause at-least-once delivery. Engineers manually rerun jobs during debugging. Without idempotency, every one of these events risks data corruption.&lt;/p&gt;
&lt;p&gt;Idempotency is not about preventing retries. It&apos;s about making retries safe.&lt;/p&gt;
&lt;h2&gt;The Partition Overwrite Pattern&lt;/h2&gt;
&lt;p&gt;The simplest and most reliable idempotency pattern for batch pipelines: overwrite the entire partition.&lt;/p&gt;
&lt;p&gt;Instead of appending rows, your pipeline replaces the complete partition for the time period being processed. For a daily pipeline processing January 15th:&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-sql&quot;&gt;-- Delete existing data for this partition
DELETE FROM target_table WHERE event_date = &apos;2024-01-15&apos;;

-- Insert fresh data for this partition
INSERT INTO target_table
SELECT * FROM staging_table WHERE event_date = &apos;2024-01-15&apos;;
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;If the job reruns, it deletes and recreates the same partition — resulting in the same data. Many table formats support INSERT OVERWRITE or REPLACE PARTITION as an atomic operation, which is even safer because it avoids a window where the partition is empty.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;When to use it:&lt;/strong&gt; Daily, hourly, or other time-partitioned batch pipelines. This covers the majority of data warehouse loading patterns.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Limitation:&lt;/strong&gt; You need a clear partitioning key. For non-time-series data, partition overwrite may not apply.&lt;/p&gt;
&lt;h2&gt;The Upsert/MERGE Pattern&lt;/h2&gt;
&lt;p&gt;For data that doesn&apos;t partition cleanly — or for change data capture (CDC) workloads — use MERGE (also called upsert):&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-sql&quot;&gt;MERGE INTO target_table t
USING staging_table s
ON t.order_id = s.order_id
WHEN MATCHED THEN UPDATE SET
  t.status = s.status,
  t.updated_at = s.updated_at
WHEN NOT MATCHED THEN INSERT (order_id, status, updated_at)
VALUES (s.order_id, s.status, s.updated_at);
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;&lt;img src=&quot;/assets/images/debp/04/merge-pattern.png&quot; alt=&quot;Merge pattern: staging records matched against target by business key, updating or inserting&quot;&gt;&lt;/p&gt;
&lt;p&gt;If the merge runs twice with the same staging data, the result is identical. Existing records update to the same values. New records insert once because they already exist on the second run.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;When to use it:&lt;/strong&gt; CDC pipelines, entity-centric data (customers, products, accounts), and slowly changing dimensions.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Requirement:&lt;/strong&gt; A reliable business key that uniquely identifies each record. Without one, merges produce inconsistent results.&lt;/p&gt;
&lt;h2&gt;Event Deduplication for Streaming&lt;/h2&gt;
&lt;p&gt;Streaming systems typically guarantee at-least-once delivery, which means the same event can be delivered and processed multiple times. Your pipeline needs to handle this.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Process-level deduplication.&lt;/strong&gt; Maintain a set (in-memory, in a key-value store, or in the target database) of recently processed event IDs. Before processing each event, check if its ID has been seen. Skip duplicates.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Write-level deduplication.&lt;/strong&gt; Use MERGE or conditional INSERT:&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-sql&quot;&gt;INSERT INTO events (event_id, payload, processed_at)
SELECT event_id, payload, NOW()
FROM incoming_events
WHERE event_id NOT IN (SELECT event_id FROM events);
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;&lt;strong&gt;Windowed deduplication.&lt;/strong&gt; For high-volume streams, maintain dedup state only for a window (e.g., last 24 hours). Events outside the window are assumed to be unique — a practical tradeoff between memory usage and dedup completeness.&lt;/p&gt;
&lt;h2&gt;Anti-Patterns That Break Idempotency&lt;/h2&gt;
&lt;p&gt;&lt;strong&gt;Blind INSERT/APPEND.&lt;/strong&gt; Every retry adds duplicate rows. This is the default behavior in most systems and the most common cause of data inflation.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Auto-incrementing surrogate keys.&lt;/strong&gt; If your pipeline generates IDs at processing time (not from the source data), duplicates get different IDs and look like distinct records.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Timestamps as dedup keys.&lt;/strong&gt; Using &lt;code&gt;processed_at&lt;/code&gt; as part of the primary key means the same source record processed at different times produces different target records.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;&amp;quot;We&apos;ll dedup later.&amp;quot;&lt;/strong&gt; Deferring deduplication to a cleanup job means every consumer between the load and the cleanup sees dirty data.&lt;/p&gt;
&lt;p&gt;&lt;img src=&quot;/assets/images/debp/04/idempotency-antipatterns.png&quot; alt=&quot;Anti-patterns: blind append creating duplicates, timestamp-based keys, deferred dedup&quot;&gt;&lt;/p&gt;
&lt;h2&gt;What to Do Next&lt;/h2&gt;
&lt;p&gt;Identify your five most frequently retried or backfilled pipelines. Check whether they use INSERT or MERGE. If they use INSERT, switch to partition overwrite or MERGE. Run the pipeline twice intentionally and verify the target table has the same row count both times. That&apos;s your idempotency test.&lt;/p&gt;
&lt;p&gt;&lt;a href=&quot;https://www.dremio.com/get-started?utm_source=ev_buffer&amp;amp;utm_medium=influencer&amp;amp;utm_campaign=next-gen-dremio&amp;amp;utm_term=blog-021826-02-18-2026&amp;amp;utm_content=alexmerced&quot;&gt;Try Dremio Cloud free for 30 days&lt;/a&gt;&lt;/p&gt;
</content:encoded><author>Alex Merced</author></item><item><title>Data Modeling for the Lakehouse: What Changes</title><link>https://iceberglakehouse.com/posts/2026-02-dm-data-modeling-lakehouse/</link><guid isPermaLink="true">https://iceberglakehouse.com/posts/2026-02-dm-data-modeling-lakehouse/</guid><description>
![Traditional data warehouse model vs. open lakehouse model with flexible schema and views](/assets/images/data_modeling/04/lakehouse-data-modeling.p...</description><pubDate>Wed, 18 Feb 2026 12:00:00 GMT</pubDate><content:encoded>&lt;p&gt;&lt;img src=&quot;/assets/images/data_modeling/04/lakehouse-data-modeling.png&quot; alt=&quot;Traditional data warehouse model vs. open lakehouse model with flexible schema and views&quot;&gt;&lt;/p&gt;
&lt;p&gt;Traditional data modeling assumed you controlled the database. You defined schemas up front, enforced foreign keys at write time, and optimized with indexes. The lakehouse changes every one of those assumptions.&lt;/p&gt;
&lt;p&gt;Data lives in open file formats on object storage. Schemas evolve without rewriting data. Queries run through engines that may not enforce relational constraints. The modeling discipline is the same, but the mechanics are different.&lt;/p&gt;
&lt;h2&gt;What&apos;s Different About a Lakehouse&lt;/h2&gt;
&lt;p&gt;A lakehouse stores data as files — typically Parquet — on object storage like S3 or Azure Blob. An open table format like Apache Iceberg adds structure: schema definitions, partition metadata, snapshot history, and transactional guarantees.&lt;/p&gt;
&lt;p&gt;This architecture gives you more flexibility than a traditional RDBMS, but also more responsibility. There are no foreign key constraints enforced at write time. No triggers. No stored procedures. Referential integrity is your problem to solve in pipelines and views, not something the storage engine handles for you.&lt;/p&gt;
&lt;p&gt;The tradeoff is worth it: open formats, engine portability, cheap storage, and the ability to run multiple compute engines (Spark, Dremio, Flink, Trino) against the same data.&lt;/p&gt;
&lt;h2&gt;Schema-on-Read Changes the Rules&lt;/h2&gt;
&lt;p&gt;&lt;img src=&quot;/assets/images/data_modeling/04/schema-on-read.png&quot; alt=&quot;Schema-on-write rigid table vs. schema-on-read flexible view layer&quot;&gt;&lt;/p&gt;
&lt;p&gt;In a traditional warehouse, you define the schema before writing data (schema-on-write). Every row must conform to the schema or the insert fails. This guarantees consistency but makes changes expensive. Adding a column means an ALTER TABLE. Changing a data type might require rewriting the entire table.&lt;/p&gt;
&lt;p&gt;In a lakehouse, you can also store data first and apply structure at query time (schema-on-read). Iceberg supports schema evolution natively — add columns, rename columns, widen data types, and reorder fields without rewriting underlying files.&lt;/p&gt;
&lt;p&gt;This flexibility changes how you model:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Bronze layer&lt;/strong&gt;: Accept data as-is from sources. Apply minimal typing. Don&apos;t reject records that don&apos;t match a rigid schema.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Silver layer&lt;/strong&gt;: Apply business logic, joins, and type enforcement through SQL views.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Gold layer&lt;/strong&gt;: Serve consumption-ready datasets with stable, documented schemas.&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;The model evolves at the view layer, not the storage layer. This makes iteration faster and migration cheaper.&lt;/p&gt;
&lt;h2&gt;The Medallion Architecture as a Modeling Pattern&lt;/h2&gt;
&lt;p&gt;The Medallion Architecture (Bronze → Silver → Gold) is the most common data modeling pattern in lakehouse environments. Each layer is a set of SQL views or managed tables:&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Bronze (Preparation):&lt;/strong&gt;&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Maps raw source data to typed columns&lt;/li&gt;
&lt;li&gt;Renames ambiguous column names&lt;/li&gt;
&lt;li&gt;Applies basic data type casting&lt;/li&gt;
&lt;li&gt;One view per source table&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;&lt;strong&gt;Silver (Business Logic):&lt;/strong&gt;&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Joins related entities (orders + customers + products)&lt;/li&gt;
&lt;li&gt;Applies business rules (revenue = quantity × price WHERE status = &apos;completed&apos;)&lt;/li&gt;
&lt;li&gt;Filters invalid or duplicate records&lt;/li&gt;
&lt;li&gt;Implements the logical data model&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;&lt;strong&gt;Gold (Application):&lt;/strong&gt;&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Tailored views for specific use cases&lt;/li&gt;
&lt;li&gt;Executive dashboards, Sales reports, AI agent context&lt;/li&gt;
&lt;li&gt;Minimal transformation — mostly selecting from Silver views&lt;/li&gt;
&lt;li&gt;Documented with business-friendly names and descriptions&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;In &lt;a href=&quot;https://www.dremio.com/blog/agentic-analytics-semantic-layer/?utm_source=ev_buffer&amp;amp;utm_medium=influencer&amp;amp;utm_campaign=next-gen-dremio&amp;amp;utm_term=blog-021826-02-18-2026&amp;amp;utm_content=alexmerced&quot;&gt;Dremio&lt;/a&gt;, these layers are implemented as virtual datasets (SQL views) organized in Spaces. Each view is documented with Wikis, tagged with Labels, and governed with Fine-Grained Access Control. The logical model lives in the platform, not in scattered dbt files or tribal knowledge.&lt;/p&gt;
&lt;h2&gt;Physical Modeling for Iceberg Tables&lt;/h2&gt;
&lt;p&gt;When you do create physical Iceberg tables (as opposed to views), the modeling considerations differ from traditional RDBMS:&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Partitioning matters more than indexing.&lt;/strong&gt; Iceberg uses partition pruning instead of traditional B-tree indexes. Choose partition columns based on your most common query filters — typically date columns. Iceberg&apos;s hidden partitioning means users don&apos;t need to know the partition scheme to write efficient queries.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Sort order affects scan performance.&lt;/strong&gt; Within each partition, Iceberg can sort data by specified columns. Sorting by a frequently filtered column (like &lt;code&gt;customer_id&lt;/code&gt; or &lt;code&gt;region&lt;/code&gt;) enables min/max pruning that skips irrelevant files.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Compaction replaces vacuum.&lt;/strong&gt; Small files accumulate from streaming inserts. Regular compaction rewrites many small files into fewer large files, improving scan performance.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Schema evolution is non-destructive.&lt;/strong&gt; Adding a column to an Iceberg table doesn&apos;t rewrite existing files. Old files return &lt;code&gt;null&lt;/code&gt; for the new column. This makes the physical model more adaptable than traditional databases.&lt;/p&gt;
&lt;h2&gt;Challenges to Watch For&lt;/h2&gt;
&lt;p&gt;&lt;strong&gt;No referential integrity enforcement.&lt;/strong&gt; The lakehouse won&apos;t stop you from inserting an order with a &lt;code&gt;customer_id&lt;/code&gt; that doesn&apos;t exist in the customers table. Build data quality checks in your pipelines.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Schema drift across sources.&lt;/strong&gt; When sources change their schemas unexpectedly, your Bronze layer must handle it. Design Bronze views to be tolerant of new or missing columns.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Over-reliance on views.&lt;/strong&gt; Views are powerful, but deeply nested views (View D reads from View C reads from View B reads from View A) create performance and debugging challenges. Keep the chain to three levels when possible.&lt;/p&gt;
&lt;h2&gt;What to Do Next&lt;/h2&gt;
&lt;p&gt;&lt;img src=&quot;/assets/images/data_modeling/04/medallion-architecture.png&quot; alt=&quot;Layered view architecture from raw data through business logic to consumption-ready outputs&quot;&gt;&lt;/p&gt;
&lt;p&gt;If you&apos;re moving from a traditional warehouse to a lakehouse, start by recreating your most-used tables as Iceberg tables and your most-used transformations as SQL views. Organize those views into Bronze, Silver, and Gold layers. Measure whether query performance meets your SLAs — and if it doesn&apos;t, add Reflections to optimize the heavy queries without changing the logical model.&lt;/p&gt;
&lt;p&gt;&lt;a href=&quot;https://www.dremio.com/get-started?utm_source=ev_buffer&amp;amp;utm_medium=influencer&amp;amp;utm_campaign=next-gen-dremio&amp;amp;utm_term=blog-021826-02-18-2026&amp;amp;utm_content=alexmerced&quot;&gt;Try Dremio Cloud free for 30 days&lt;/a&gt;&lt;/p&gt;
</content:encoded><author>Alex Merced</author></item><item><title>Semantic Layer vs. Data Catalog: Complementary, Not Competing</title><link>https://iceberglakehouse.com/posts/2026-02-sl-semantic-layer-vs-data-catalog/</link><guid isPermaLink="true">https://iceberglakehouse.com/posts/2026-02-sl-semantic-layer-vs-data-catalog/</guid><description>
![Data catalog and semantic layer — complementary systems bridged together](/assets/images/semantic_layer/04/catalog-vs-semantic.png)

&quot;We already ha...</description><pubDate>Wed, 18 Feb 2026 12:00:00 GMT</pubDate><content:encoded>&lt;p&gt;&lt;img src=&quot;/assets/images/semantic_layer/04/catalog-vs-semantic.png&quot; alt=&quot;Data catalog and semantic layer — complementary systems bridged together&quot;&gt;&lt;/p&gt;
&lt;p&gt;&amp;quot;We already have a data catalog, so we don&apos;t need a semantic layer.&amp;quot; This is one of the most common misconceptions in modern data architecture. Catalogs and semantic layers both deal with metadata. They both improve data accessibility. But they solve fundamentally different problems.&lt;/p&gt;
&lt;p&gt;Swapping one for the other leaves a critical gap in your stack.&lt;/p&gt;
&lt;h2&gt;What a Data Catalog Does&lt;/h2&gt;
&lt;p&gt;A data catalog is a searchable inventory of your organization&apos;s data assets. Think of it as a library card system for data. It tells you what data exists, where it lives, who owns it, and how it flows through your systems.&lt;/p&gt;
&lt;p&gt;Key functions:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Discovery&lt;/strong&gt;: Find tables, views, files, and dashboards by searching keywords, tags, or owners&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Lineage&lt;/strong&gt;: Trace how data moves from source to destination, including every transformation along the way&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Governance metadata&lt;/strong&gt;: Track data quality scores, classification (PII, confidential), and compliance status&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Documentation&lt;/strong&gt;: Store descriptions of assets, often crowd-sourced from data producers and consumers&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;A data catalog is fundamentally a &lt;strong&gt;passive system&lt;/strong&gt;. You search it, browse it, and read from it. It doesn&apos;t change how queries execute or how metrics are calculated. It organizes information &lt;em&gt;about&lt;/em&gt; data.&lt;/p&gt;
&lt;h2&gt;What a Semantic Layer Does&lt;/h2&gt;
&lt;p&gt;A semantic layer defines what data &lt;strong&gt;means&lt;/strong&gt; and how to &lt;strong&gt;use it correctly&lt;/strong&gt;. It&apos;s an active system that sits between your raw data and the tools querying it.&lt;/p&gt;
&lt;p&gt;Key functions:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Metric definitions&lt;/strong&gt;: Revenue, Churn Rate, Active Users — calculated one way, everywhere&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Query translation&lt;/strong&gt;: Converts business questions into optimized SQL&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Access enforcement&lt;/strong&gt;: Row-level security and column masking applied at query time&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Documentation&lt;/strong&gt;: Wikis and labels attached to views and columns&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;A semantic layer &lt;strong&gt;actively participates&lt;/strong&gt; in every query. When a user asks &amp;quot;What was revenue by region?&amp;quot;, the semantic layer translates &amp;quot;revenue&amp;quot; into the correct SQL formula, joins the right tables, applies security filters, and returns the result.&lt;/p&gt;
&lt;h2&gt;Side-by-Side Comparison&lt;/h2&gt;
&lt;p&gt;&lt;img src=&quot;/assets/images/semantic_layer/04/catalog-vs-semantic-action.png&quot; alt=&quot;Data catalog vs. semantic layer in action — search vs. query&quot;&gt;&lt;/p&gt;
&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Dimension&lt;/th&gt;
&lt;th&gt;Data Catalog&lt;/th&gt;
&lt;th&gt;Semantic Layer&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Primary question answered&lt;/td&gt;
&lt;td&gt;&amp;quot;What data do we have?&amp;quot;&lt;/td&gt;
&lt;td&gt;&amp;quot;What does this data mean?&amp;quot;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;System behavior&lt;/td&gt;
&lt;td&gt;Passive (search &amp;amp; browse)&lt;/td&gt;
&lt;td&gt;Active (query translation)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Scope&lt;/td&gt;
&lt;td&gt;All metadata across assets&lt;/td&gt;
&lt;td&gt;Business definitions, metrics, security&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Lineage&lt;/td&gt;
&lt;td&gt;Tracks data flow&lt;/td&gt;
&lt;td&gt;Defines calculation logic&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Query execution&lt;/td&gt;
&lt;td&gt;Does not execute queries&lt;/td&gt;
&lt;td&gt;Translates and optimizes queries&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Access control&lt;/td&gt;
&lt;td&gt;Documents policies&lt;/td&gt;
&lt;td&gt;Enforces policies at query time&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;
&lt;p&gt;The catalog tells you a table called &lt;code&gt;orders&lt;/code&gt; exists in the &lt;code&gt;production&lt;/code&gt; schema. The semantic layer tells you that &amp;quot;Revenue&amp;quot; means &lt;code&gt;SUM(orders.total) WHERE status = &apos;completed&apos;&lt;/code&gt;, joins it to &lt;code&gt;customers&lt;/code&gt; on &lt;code&gt;customer_id&lt;/code&gt;, and filters results based on the querying user&apos;s role.&lt;/p&gt;
&lt;h2&gt;Why You Need Both&lt;/h2&gt;
&lt;p&gt;&lt;strong&gt;A catalog without a semantic layer&lt;/strong&gt;: Users find data but don&apos;t know how to use it correctly. They discover the &lt;code&gt;orders&lt;/code&gt; table but write their own revenue formula, which differs from the formula Finance uses. Data is discoverable but inconsistently interpreted.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;A semantic layer without a catalog&lt;/strong&gt;: Users get accurate, governed queries for the datasets the semantic layer covers. But they can&apos;t discover datasets outside the layer. New data sources, experimental tables, and raw files remain invisible until someone manually adds views.&lt;/p&gt;
&lt;p&gt;The best architectures integrate both. The catalog handles discovery and lineage across &lt;em&gt;everything&lt;/em&gt;. The semantic layer handles meaning, calculation, and governance for the business-critical datasets that drive decisions.&lt;/p&gt;
&lt;h2&gt;What Integration Looks Like&lt;/h2&gt;
&lt;p&gt;&lt;img src=&quot;/assets/images/semantic_layer/04/catalog-architecture.png&quot; alt=&quot;Catalog and semantic layer combined in an integrated architecture&quot;&gt;&lt;/p&gt;
&lt;p&gt;An integrated system gives you a single interface where data discovery and business context exist side by side. You search the catalog to find a dataset. You see its semantic layer definition — the metric formulas, documentation, labels, and access policies — alongside the catalog metadata (lineage, quality, ownership).&lt;/p&gt;
&lt;p&gt;Dremio achieves this with its &lt;a href=&quot;https://www.dremio.com/blog/5-ways-dremio-delivers-an-apache-iceberg-lakehouse-without-the-headaches/?utm_source=ev_buffer&amp;amp;utm_medium=influencer&amp;amp;utm_campaign=next-gen-dremio&amp;amp;utm_term=blog-021826-02-18-2026&amp;amp;utm_content=alexmerced&quot;&gt;Open Catalog&lt;/a&gt; (built on Apache Polaris, the open-source Iceberg REST catalog standard) combined with its semantic layer features:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Open Catalog&lt;/strong&gt; provides the inventory: tables, views, sources, and their lineage&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Virtual datasets&lt;/strong&gt; (SQL views) define business logic and metric calculations&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Wikis&lt;/strong&gt; document what each dataset and column means&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Labels&lt;/strong&gt; tag data for governance and discoverability (PII, Finance, Certified)&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;FGAC&lt;/strong&gt; enforces row/column security at query time&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;AI agents benefit from this integration directly. They use the catalog to navigate available datasets (what tables exist in the &amp;quot;Sales&amp;quot; space?) and the semantic layer to generate accurate queries (what does &amp;quot;Revenue&amp;quot; mean, and who can see which rows?). Remove either piece, and the AI is either blind to available data or confidently generating wrong SQL.&lt;/p&gt;
&lt;h2&gt;What to Do Next&lt;/h2&gt;
&lt;p&gt;Open your current data catalog and pick a business-critical table. Can you see how its key metric is calculated? Who can access which rows? What the column names mean in business terms? If the catalog only shows you &lt;em&gt;that the table exists&lt;/em&gt;, you&apos;ve identified the gap a semantic layer fills.&lt;/p&gt;
&lt;p&gt;&lt;a href=&quot;https://www.dremio.com/get-started?utm_source=ev_buffer&amp;amp;utm_medium=influencer&amp;amp;utm_campaign=next-gen-dremio&amp;amp;utm_term=blog-021826-02-18-2026&amp;amp;utm_content=alexmerced&quot;&gt;Try Dremio Cloud free for 30 days&lt;/a&gt;&lt;/p&gt;
</content:encoded><author>Alex Merced</author></item><item><title>Data Quality Is a Pipeline Problem, Not a Dashboard Problem</title><link>https://iceberglakehouse.com/posts/2026-02-debp-data-quality-first/</link><guid isPermaLink="true">https://iceberglakehouse.com/posts/2026-02-debp-data-quality-first/</guid><description>
![Data quality checks enforced at the pipeline validation stage before data reaches consumers](/assets/images/debp/03/data-quality-pipeline.png)

Whe...</description><pubDate>Wed, 18 Feb 2026 11:00:00 GMT</pubDate><content:encoded>&lt;p&gt;&lt;img src=&quot;/assets/images/debp/03/data-quality-pipeline.png&quot; alt=&quot;Data quality checks enforced at the pipeline validation stage before data reaches consumers&quot;&gt;&lt;/p&gt;
&lt;p&gt;When an analyst finds null values in a revenue column, the typical response is to add a calculated field in the BI tool: &lt;code&gt;IF revenue IS NULL THEN 0&lt;/code&gt;. That &amp;quot;fix&amp;quot; doesn&apos;t fix anything. It masks a problem at the source — and every downstream consumer has to independently discover and patch the same issue.&lt;/p&gt;
&lt;p&gt;Data quality is a pipeline problem. It should be enforced where data enters your system, not where it exits as a chart.&lt;/p&gt;
&lt;h2&gt;The Dashboard Isn&apos;t Where Quality Gets Fixed&lt;/h2&gt;
&lt;p&gt;Quality problems that surface in dashboards have already propagated through every layer of your stack: raw tables, transformed models, aggregations, caches, and API endpoints. By the time an analyst spots a zero-revenue row, the bad record has been used to train ML models, trigger automated alerts, and populate executive reports.&lt;/p&gt;
&lt;p&gt;Fixing quality at the point of consumption is reactive, fragmented, and unrepeatable. Every team applies different patches. Every new consumer rediscovers the same problems.&lt;/p&gt;
&lt;p&gt;Fixing quality at the point of ingestion is proactive, centralized, and consistent. Every downstream consumer benefits from the same validated data.&lt;/p&gt;
&lt;h2&gt;Six Dimensions of Data Quality&lt;/h2&gt;
&lt;p&gt;Not all quality problems are the same. Categorizing them helps you build targeted checks:&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Completeness.&lt;/strong&gt; Are required fields populated? A customer record missing an email address might be acceptable. A transaction record missing an amount is not.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Accuracy.&lt;/strong&gt; Do values reflect reality? An age of 250 is syntactically valid but factually wrong. Accuracy checks require domain knowledge and range validation.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Consistency.&lt;/strong&gt; Do the same facts agree across sources? If your CRM says a customer is in Texas and your billing system says California, you have a consistency problem.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Timeliness.&lt;/strong&gt; Did the data arrive when expected? A daily feed that arrives 6 hours late might still be correct — but any dashboards refreshed before it arrived showed stale numbers.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Uniqueness.&lt;/strong&gt; Are there duplicate records? Double-counted revenue is worse than no revenue. Deduplication on business keys (order ID, event ID) is essential.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Validity.&lt;/strong&gt; Do values conform to expected formats and ranges? Dates in the future, negative quantities, email addresses without @ signs — structural validation catches these before they corrupt downstream logic.&lt;/p&gt;
&lt;p&gt;&lt;img src=&quot;/assets/images/debp/03/quality-dimensions.png&quot; alt=&quot;Six dimensions: completeness, accuracy, consistency, timeliness, uniqueness, validity&quot;&gt;&lt;/p&gt;
&lt;h2&gt;Enforce Quality Inside the Pipeline&lt;/h2&gt;
&lt;p&gt;Add a validation stage between ingestion and transformation. This stage checks every record against defined quality rules and routes failures to a quarantine table.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Schema validation.&lt;/strong&gt; Check column names, data types, and required vs. optional fields. If the source adds or removes a column, catch it here — not when a transformation SQL query fails.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Range and format checks.&lt;/strong&gt; Ensure numeric values fall within expected ranges (0 ≤ price ≤ 1,000,000). Validate date formats, email patterns, and enum values against allowed lists.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Referential checks.&lt;/strong&gt; Verify that foreign key values exist in their reference tables. An order referencing a non-existent customer ID means either the order is invalid or the customer pipeline is behind.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Volume checks.&lt;/strong&gt; Compare the row count of the incoming batch against historical baselines. A daily feed that usually delivers 50,000 rows but arrives with 500 rows should trigger an alert, not proceed silently.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Freshness checks.&lt;/strong&gt; Validate that event timestamps fall within the expected window. A batch of events all timestamped from three days ago may indicate a delayed replay, not current data.&lt;/p&gt;
&lt;h2&gt;Quarantine, Don&apos;t Drop&lt;/h2&gt;
&lt;p&gt;When a record fails validation, don&apos;t drop it. Route it to a quarantine table with metadata: which check failed, when, and the original record content.&lt;/p&gt;
&lt;p&gt;Dropping bad records silently creates invisible data loss. Your row counts won&apos;t match, your aggregations will undercount, and no one will know why.&lt;/p&gt;
&lt;p&gt;Quarantined records give you:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Visibility.&lt;/strong&gt; You know how many records failed and why.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Recovery.&lt;/strong&gt; When the quality rule was too strict (false positive), you can reprocess quarantined records.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Root cause analysis.&lt;/strong&gt; Patterns in quarantine (e.g., all failures from one source) help you fix the actual problem upstream.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Accountability.&lt;/strong&gt; You can report quality rates per source, per pipeline, per day.&lt;/li&gt;
&lt;/ul&gt;
&lt;h2&gt;Track Quality Like You Track Uptime&lt;/h2&gt;
&lt;p&gt;Pipeline monitoring typically covers: did the job run? Did it succeed? How long did it take? Quality monitoring adds: how many records passed validation? What percentage failed? Which checks triggered the most failures?&lt;/p&gt;
&lt;p&gt;Build quality metrics into your monitoring dashboards:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Pass/fail ratio&lt;/strong&gt; per pipeline, per day&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Failure breakdown&lt;/strong&gt; by quality dimension (completeness, accuracy, etc.)&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Trend lines&lt;/strong&gt; to catch gradual degradation before it becomes critical&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;SLA tracking&lt;/strong&gt; for freshness and completeness targets&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;&lt;img src=&quot;/assets/images/debp/03/quality-monitoring.png&quot; alt=&quot;Quality monitoring: pass/fail ratios, trend lines, and SLA tracking alongside pipeline metrics&quot;&gt;&lt;/p&gt;
&lt;p&gt;Alert on quality regressions the same way you alert on pipeline failures. A pipeline that runs successfully but produces 30% invalid records is worse than one that fails outright — because it&apos;s silently wrong.&lt;/p&gt;
&lt;h2&gt;What to Do Next&lt;/h2&gt;
&lt;p&gt;Audit your most important pipeline. Add a validation stage with checks for completeness, uniqueness, and volume. Route failures to a quarantine table. Within a week, you&apos;ll know more about your data quality than any dashboard could tell you.&lt;/p&gt;
&lt;p&gt;&lt;a href=&quot;https://www.dremio.com/get-started?utm_source=ev_buffer&amp;amp;utm_medium=influencer&amp;amp;utm_campaign=next-gen-dremio&amp;amp;utm_term=blog-021826-02-18-2026&amp;amp;utm_content=alexmerced&quot;&gt;Try Dremio Cloud free for 30 days&lt;/a&gt;&lt;/p&gt;
</content:encoded><author>Alex Merced</author></item><item><title>Star Schema vs. Snowflake Schema: When to Use Each</title><link>https://iceberglakehouse.com/posts/2026-02-dm-star-schema-vs-snowflake/</link><guid isPermaLink="true">https://iceberglakehouse.com/posts/2026-02-dm-star-schema-vs-snowflake/</guid><description>
![Star schema with central fact table surrounded by denormalized dimension tables](/assets/images/data_modeling/03/star-vs-snowflake.png)

Both star ...</description><pubDate>Wed, 18 Feb 2026 11:00:00 GMT</pubDate><content:encoded>&lt;p&gt;&lt;img src=&quot;/assets/images/data_modeling/03/star-vs-snowflake.png&quot; alt=&quot;Star schema with central fact table surrounded by denormalized dimension tables&quot;&gt;&lt;/p&gt;
&lt;p&gt;Both star schemas and snowflake schemas are dimensional models. They both organize data into fact tables (measurable events) and dimension tables (context about those events). The difference is how they structure the dimensions.&lt;/p&gt;
&lt;p&gt;That structural difference affects query performance, storage efficiency, SQL complexity, and how easily BI tools and AI agents can interpret your data. Here&apos;s how to choose.&lt;/p&gt;
&lt;h2&gt;The Two Patterns of Dimensional Modeling&lt;/h2&gt;
&lt;p&gt;Dimensional modeling separates data into two types:&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Fact tables&lt;/strong&gt; store measurable events — a sale, a page view, a shipment, a login. Each row represents one event. Columns include numeric measures (revenue, quantity, duration) and foreign keys pointing to dimension tables.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Dimension tables&lt;/strong&gt; provide context for facts — who (customer), what (product), when (date), where (location), how (channel). Dimensions describe the &amp;quot;business words&amp;quot; people use to filter, group, and label their analysis.&lt;/p&gt;
&lt;p&gt;Star and snowflake schemas differ in how they organize those dimension tables.&lt;/p&gt;
&lt;h2&gt;Star Schema: Denormalized Dimensions&lt;/h2&gt;
&lt;p&gt;In a star schema, each dimension is a single, denormalized table. All attributes for a dimension live in one place.&lt;/p&gt;
&lt;p&gt;A product dimension contains the product name, category, subcategory, department, and brand — all in one table. This means some values repeat. Every product in the &amp;quot;Electronics&amp;quot; category stores the string &amp;quot;Electronics&amp;quot; in its row.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Advantages:&lt;/strong&gt;&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Fewer joins per query. A typical star schema query joins the fact table to 3-5 dimension tables. That&apos;s it.&lt;/li&gt;
&lt;li&gt;Simpler SQL. Analysts write shorter, more readable queries.&lt;/li&gt;
&lt;li&gt;Faster query performance. Fewer joins means less work for the query engine.&lt;/li&gt;
&lt;li&gt;Better BI tool compatibility. Most BI tools expect star schemas and generate optimal SQL against them.&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;&lt;strong&gt;Tradeoff:&lt;/strong&gt; Data redundancy in dimensions. If the &amp;quot;Electronics&amp;quot; department changes its name, you update it in every row that references it.&lt;/p&gt;
&lt;h2&gt;Snowflake Schema: Normalized Dimensions&lt;/h2&gt;
&lt;p&gt;In a snowflake schema, dimensions are normalized into sub-tables. Instead of one product dimension, you have separate tables for Product, Category, Subcategory, and Department, linked by foreign keys.&lt;/p&gt;
&lt;p&gt;&lt;img src=&quot;/assets/images/data_modeling/03/snowflake-schema-detail.png&quot; alt=&quot;Snowflake schema with fact table and normalized, branching dimension tables&quot;&gt;&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Advantages:&lt;/strong&gt;&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Less storage redundancy. Each value stored once. &amp;quot;Electronics&amp;quot; appears in one row of the Department table.&lt;/li&gt;
&lt;li&gt;Single source of truth per attribute. Rename a department in one row instead of thousands.&lt;/li&gt;
&lt;li&gt;Aligns with OLTP normalization practices. Familiar to engineers coming from transactional database backgrounds.&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;&lt;strong&gt;Tradeoff:&lt;/strong&gt; More joins per query. A query that would join 4 tables in a star schema might join 8-12 tables in a snowflake schema. SQL gets longer, more complex, and harder for analysts to write without help.&lt;/p&gt;
&lt;h2&gt;Side-by-Side Comparison&lt;/h2&gt;
&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Aspect&lt;/th&gt;
&lt;th&gt;Star Schema&lt;/th&gt;
&lt;th&gt;Snowflake Schema&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Dimension structure&lt;/td&gt;
&lt;td&gt;Denormalized (flat)&lt;/td&gt;
&lt;td&gt;Normalized (branching)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Tables per query&lt;/td&gt;
&lt;td&gt;Fewer (4-6 typical)&lt;/td&gt;
&lt;td&gt;More (8-12 typical)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Query performance&lt;/td&gt;
&lt;td&gt;Faster&lt;/td&gt;
&lt;td&gt;Slower (more joins)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;SQL complexity&lt;/td&gt;
&lt;td&gt;Simpler&lt;/td&gt;
&lt;td&gt;More complex&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Storage efficiency&lt;/td&gt;
&lt;td&gt;Lower (some redundancy)&lt;/td&gt;
&lt;td&gt;Higher&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;BI tool compatibility&lt;/td&gt;
&lt;td&gt;Better&lt;/td&gt;
&lt;td&gt;Harder&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;ETL/pipeline complexity&lt;/td&gt;
&lt;td&gt;Simpler loads&lt;/td&gt;
&lt;td&gt;More complex loads&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Self-service friendliness&lt;/td&gt;
&lt;td&gt;High&lt;/td&gt;
&lt;td&gt;Low&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Update granularity&lt;/td&gt;
&lt;td&gt;Update many rows&lt;/td&gt;
&lt;td&gt;Update one row&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;
&lt;h2&gt;When to Choose Which&lt;/h2&gt;
&lt;p&gt;&lt;strong&gt;Choose a star schema when:&lt;/strong&gt;&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Your primary workload is analytics and reporting&lt;/li&gt;
&lt;li&gt;Business users run ad-hoc queries or use BI tools&lt;/li&gt;
&lt;li&gt;Query performance matters more than storage costs&lt;/li&gt;
&lt;li&gt;You want AI agents to generate accurate SQL (fewer joins = fewer mistakes)&lt;/li&gt;
&lt;li&gt;Your dimensions are small enough that redundancy is negligible&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;&lt;strong&gt;Choose a snowflake schema when:&lt;/strong&gt;&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Dimensions are very large and redundancy has real storage costs&lt;/li&gt;
&lt;li&gt;Regulatory requirements demand a single canonical source per attribute&lt;/li&gt;
&lt;li&gt;Only ETL engineers (not analysts) write queries against the model&lt;/li&gt;
&lt;li&gt;You need strict referential integrity across dimension hierarchies&lt;/li&gt;
&lt;/ul&gt;
&lt;h2&gt;Why Star Schema Usually Wins&lt;/h2&gt;
&lt;p&gt;Three changes in modern data platforms have tilted the balance toward star schemas:&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Storage is cheap.&lt;/strong&gt; Object storage costs a fraction of a cent per gigabyte per month. The storage savings from normalizing dimensions rarely justify the query complexity cost.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Columnar formats compress redundancy well.&lt;/strong&gt; Parquet and ORC store data in columns. Repeated values like &amp;quot;Electronics&amp;quot; compress to nearly nothing. The physical storage overhead of a denormalized dimension is much smaller than it appears in row-oriented thinking.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;AI and self-service need simplicity.&lt;/strong&gt; When an AI agent generates SQL against your data model, fewer tables and fewer joins reduce the chance of hallucinated join paths. When a business analyst builds a report, fewer joins reduce the chance of wrong results.&lt;/p&gt;
&lt;p&gt;Platforms like &lt;a href=&quot;https://www.dremio.com/blog/agentic-analytics-semantic-layer/?utm_source=ev_buffer&amp;amp;utm_medium=influencer&amp;amp;utm_campaign=next-gen-dremio&amp;amp;utm_term=blog-021826-02-18-2026&amp;amp;utm_content=alexmerced&quot;&gt;Dremio&lt;/a&gt; make this choice even easier. Virtual datasets let you model star schemas as SQL views without physically copying or denormalizing data. Reflections automatically optimize query performance in the background. You get the simplicity of a star schema with optimized physical performance, regardless of how the underlying data is stored.&lt;/p&gt;
&lt;h2&gt;What to Do Next&lt;/h2&gt;
&lt;p&gt;&lt;img src=&quot;/assets/images/data_modeling/03/star-schema-optimization.png&quot; alt=&quot;Star schema query execution flowing through a query engine with automatic optimization&quot;&gt;&lt;/p&gt;
&lt;p&gt;Take your most-used fact table. Count the joins required to build a complete report. If you&apos;re joining more than five dimension tables, or if dimension tables themselves require sub-joins, consider flattening your dimensions into a star schema. Measure the query performance difference. In most cases, the improvement is significant and the storage increase is negligible.&lt;/p&gt;
&lt;p&gt;&lt;a href=&quot;https://www.dremio.com/get-started?utm_source=ev_buffer&amp;amp;utm_medium=influencer&amp;amp;utm_campaign=next-gen-dremio&amp;amp;utm_term=blog-021826-02-18-2026&amp;amp;utm_content=alexmerced&quot;&gt;Try Dremio Cloud free for 30 days&lt;/a&gt;&lt;/p&gt;
</content:encoded><author>Alex Merced</author></item><item><title>Semantic Layer vs. Metrics Layer: What&apos;s the Difference?</title><link>https://iceberglakehouse.com/posts/2026-02-sl-semantic-layer-vs-metrics-layer/</link><guid isPermaLink="true">https://iceberglakehouse.com/posts/2026-02-sl-semantic-layer-vs-metrics-layer/</guid><description>
![Semantic layer vs metrics layer — the metrics layer is a subset](/assets/images/semantic_layer/03/semantic-vs-metrics.png)

Both terms appear in ev...</description><pubDate>Wed, 18 Feb 2026 11:00:00 GMT</pubDate><content:encoded>&lt;p&gt;&lt;img src=&quot;/assets/images/semantic_layer/03/semantic-vs-metrics.png&quot; alt=&quot;Semantic layer vs metrics layer — the metrics layer is a subset&quot;&gt;&lt;/p&gt;
&lt;p&gt;Both terms appear in every modern data architecture diagram. They&apos;re used interchangeably in conference talks, Slack threads, and vendor marketing. And almost nobody defines them precisely.&lt;/p&gt;
&lt;p&gt;Here&apos;s the difference, why it matters, and what it means for how you build your data platform.&lt;/p&gt;
&lt;h2&gt;What a Metrics Layer Does&lt;/h2&gt;
&lt;p&gt;A metrics layer has one job: define how business metrics are calculated and make those definitions available to every tool in your stack.&lt;/p&gt;
&lt;p&gt;Take Revenue. Without a metrics layer, the formula lives in a dashboard filter, a dbt model, a Python notebook, and three different analysts&apos; heads. With a metrics layer, the formula is defined once:&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;Revenue = SUM(order_total) WHERE status = &apos;completed&apos; AND refunded = FALSE
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Every dashboard, API endpoint, and AI agent that needs &amp;quot;Revenue&amp;quot; pulls from this single definition. Change the formula in one place, and it updates everywhere.&lt;/p&gt;
&lt;p&gt;Metrics layers are typically code-defined. &lt;a href=&quot;https://docs.getdbt.com/docs/build/about-metricflow&quot;&gt;dbt&apos;s semantic layer&lt;/a&gt; uses YAML specifications. Cube.js uses JavaScript schemas. The metric definition includes the calculation, the time dimension, the allowed filters, and the grain.&lt;/p&gt;
&lt;p&gt;This is valuable. But it&apos;s incomplete.&lt;/p&gt;
&lt;h2&gt;What a Semantic Layer Does&lt;/h2&gt;
&lt;p&gt;A semantic layer does everything a metrics layer does, plus more. It covers the full abstraction between raw data and the people (and machines) querying it.&lt;/p&gt;
&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Capability&lt;/th&gt;
&lt;th&gt;Metrics Layer&lt;/th&gt;
&lt;th&gt;Semantic Layer&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Metric definitions (KPI calculations)&lt;/td&gt;
&lt;td&gt;Yes&lt;/td&gt;
&lt;td&gt;Yes&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Documentation (table/column descriptions)&lt;/td&gt;
&lt;td&gt;Sometimes&lt;/td&gt;
&lt;td&gt;Yes&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Labels and tags (governance, discoverability)&lt;/td&gt;
&lt;td&gt;No&lt;/td&gt;
&lt;td&gt;Yes&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Join relationships (pre-defined paths)&lt;/td&gt;
&lt;td&gt;Limited&lt;/td&gt;
&lt;td&gt;Yes&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Access policies (row/column security)&lt;/td&gt;
&lt;td&gt;No&lt;/td&gt;
&lt;td&gt;Yes&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Query optimization (caching, pre-aggregation)&lt;/td&gt;
&lt;td&gt;No&lt;/td&gt;
&lt;td&gt;Often&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;
&lt;p&gt;A metrics layer tells you &lt;em&gt;how to calculate&lt;/em&gt; a number. A semantic layer tells you &lt;em&gt;what the data means&lt;/em&gt;, &lt;em&gt;how to calculate it&lt;/em&gt;, &lt;em&gt;who can see it&lt;/em&gt;, &lt;em&gt;how to join it&lt;/em&gt;, and &lt;em&gt;where it came from&lt;/em&gt;.&lt;/p&gt;
&lt;h2&gt;The Relationship: Subset, Not Alternative&lt;/h2&gt;
&lt;p&gt;&lt;img src=&quot;/assets/images/semantic_layer/03/metrics-subset.png&quot; alt=&quot;The metrics layer as a subset within the broader semantic layer&quot;&gt;&lt;/p&gt;
&lt;p&gt;A metrics layer is a component of a semantic layer. Not a replacement.&lt;/p&gt;
&lt;p&gt;Think of it like a spreadsheet. The metrics layer is the formulas: revenue calculations, growth rates, ratios. The semantic layer is the entire workbook: formulas, column headers, sheet labels, formatting, and sharing permissions. You can&apos;t have a useful workbook with just formulas. And you can&apos;t have a complete semantic layer without metric definitions.&lt;/p&gt;
&lt;p&gt;The confusion arose because different vendors built different pieces first. dbt built the metrics layer and called it a &amp;quot;semantic layer.&amp;quot; BI tools like Looker built semantic models (LookML) focused on relationships and query patterns. Platforms like &lt;a href=&quot;https://www.dremio.com/blog/agentic-analytics-semantic-layer/?utm_source=ev_buffer&amp;amp;utm_medium=influencer&amp;amp;utm_campaign=next-gen-dremio&amp;amp;utm_term=blog-021826-02-18-2026&amp;amp;utm_content=alexmerced&quot;&gt;Dremio&lt;/a&gt; built a full semantic layer that includes views, documentation, governance, and AI context in one integrated system.&lt;/p&gt;
&lt;h2&gt;Why the Distinction Matters&lt;/h2&gt;
&lt;p&gt;If you build a metrics layer but skip the rest of the semantic layer, you leave three gaps:&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;No documentation means no AI accuracy.&lt;/strong&gt; When an AI agent generates SQL, it needs more than metric formulas. It needs to know what each column represents, which tables to join, and what filters are valid. Metric definitions alone don&apos;t provide that. Wikis, labels, and column descriptions do. Without them, AI agents hallucinate joins and misinterpret fields.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;No security means enforcement happens ad hoc.&lt;/strong&gt; A metrics layer doesn&apos;t include row-level security or column masking. Those policies get applied separately in each BI tool, each notebook, each API. One missed policy, and sensitive data leaks to the wrong role.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;No join paths means redundant work.&lt;/strong&gt; If the metrics layer defines &amp;quot;Revenue&amp;quot; but doesn&apos;t define how to connect the Orders table to the Customers table, every consumer figures out the join independently. Some get it right. Some don&apos;t. You get conflicting results from a formula that was supposed to be centralized.&lt;/p&gt;
&lt;h2&gt;What This Looks Like in Practice&lt;/h2&gt;
&lt;p&gt;A platform with a full semantic layer, like Dremio, provides:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Virtual datasets (SQL views)&lt;/strong&gt; that define business logic across federated sources&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Wikis&lt;/strong&gt; that document tables and columns in human- and AI-readable format&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Labels&lt;/strong&gt; that tag data for governance (PII, Finance, Certified)&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Fine-Grained Access Control&lt;/strong&gt; that enforces row/column security at the view level&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Reflections&lt;/strong&gt; that automatically optimize performance for the most-queried views&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;AI-generated metadata&lt;/strong&gt; that auto-populates descriptions and label suggestions&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;Compare that to a standalone metrics layer, which gives you metric definitions and (sometimes) basic documentation. The metrics layer is the engine. The semantic layer is the complete vehicle.&lt;/p&gt;
&lt;p&gt;&lt;img src=&quot;/assets/images/semantic_layer/03/when-to-choose.png&quot; alt=&quot;Choosing between a metrics layer and a full semantic layer&quot;&gt;&lt;/p&gt;
&lt;h2&gt;What to Do Next&lt;/h2&gt;
&lt;p&gt;If you already have a metrics layer, audit what&apos;s missing. Do your metric definitions include documentation? Labels? Security policies? Join paths? If not, you have a piece of the semantic layer, not the whole thing.&lt;/p&gt;
&lt;p&gt;Completing the picture means either extending your metrics layer with those capabilities, or adopting a platform that provides them natively.&lt;/p&gt;
&lt;p&gt;&lt;a href=&quot;https://www.dremio.com/get-started?utm_source=ev_buffer&amp;amp;utm_medium=influencer&amp;amp;utm_campaign=next-gen-dremio&amp;amp;utm_term=blog-021826-02-18-2026&amp;amp;utm_content=alexmerced&quot;&gt;Try Dremio Cloud free for 30 days&lt;/a&gt;&lt;/p&gt;
</content:encoded><author>Alex Merced</author></item><item><title>How to Design Reliable Data Pipelines</title><link>https://iceberglakehouse.com/posts/2026-02-debp-design-data-pipelines/</link><guid isPermaLink="true">https://iceberglakehouse.com/posts/2026-02-debp-design-data-pipelines/</guid><description>
![Data pipeline architecture with four layers flowing from ingestion through staging, transformation, and serving](/assets/images/debp/02/pipeline-ar...</description><pubDate>Wed, 18 Feb 2026 10:00:00 GMT</pubDate><content:encoded>&lt;p&gt;&lt;img src=&quot;/assets/images/debp/02/pipeline-architecture.png&quot; alt=&quot;Data pipeline architecture with four layers flowing from ingestion through staging, transformation, and serving&quot;&gt;&lt;/p&gt;
&lt;p&gt;Most pipeline failures aren&apos;t caused by bad code. They&apos;re caused by no architecture. A script that reads from an API, transforms JSON, and writes to a database works fine on day one. On day ninety it fails at 3 AM because the API changed its response format, and the only way to recover is to rerun the entire pipeline from scratch — hoping that reprocessing three months of data doesn&apos;t create duplicates.&lt;/p&gt;
&lt;p&gt;Reliable pipelines are designed, not debugged into existence.&lt;/p&gt;
&lt;h2&gt;Reliability Is a Design Property, Not a Bug-Fix&lt;/h2&gt;
&lt;p&gt;You don&apos;t make a pipeline reliable by adding try-catch blocks after it breaks. You make it reliable by building reliability into the architecture from the start. That means:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Resumability.&lt;/strong&gt; After a failure, you restart from where it stopped, not from the beginning.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Idempotency.&lt;/strong&gt; Running the same job twice produces the same result.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Observability.&lt;/strong&gt; You know what the pipeline processed, how long it took, and where it is right now.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Isolation.&lt;/strong&gt; One failing stage doesn&apos;t cascade into unrelated stages.&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;These properties don&apos;t come from choosing the right framework. They come from how you structure the pipeline.&lt;/p&gt;
&lt;h2&gt;The Four Architecture Layers&lt;/h2&gt;
&lt;p&gt;Every well-designed pipeline has four distinct layers, even if they run in the same job:&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Ingestion.&lt;/strong&gt; Pull raw data from sources and land it unchanged. Don&apos;t transform here. Don&apos;t filter. Don&apos;t join. Store the raw data exactly as it arrived, with metadata (timestamp, source, batch ID). This gives you a replayable audit trail.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Staging.&lt;/strong&gt; Validate the raw data. Check for schema compliance, null values in required fields, duplicate records, and data type mismatches. Records that fail validation go to a quarantine table or dead-letter queue — they don&apos;t silently disappear.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Transformation.&lt;/strong&gt; Apply business logic: joins, aggregations, calculations, enrichments. This is where raw events become metrics, where customer records merge across sources, where timestamps convert to business periods. Keep business logic in one layer, not spread across ingestion and loading scripts.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Serving.&lt;/strong&gt; Organize the transformed data for consumers. Analysts need star schemas. ML models need feature tables. APIs need denormalized lookups. The serving layer shapes data for its audience without changing the underlying transformation logic.&lt;/p&gt;
&lt;p&gt;&lt;img src=&quot;/assets/images/debp/02/four-layers.png&quot; alt=&quot;Stages: ingest raw data, validate in staging, apply business logic, serve to consumers&quot;&gt;&lt;/p&gt;
&lt;h2&gt;Build a DAG, Not a Script&lt;/h2&gt;
&lt;p&gt;A script runs steps in order: step 1, step 2, step 3. If step 2 fails, you rerun from step 1. If step 3 needs a new input, you rewrite the script.&lt;/p&gt;
&lt;p&gt;A directed acyclic graph (DAG) models dependencies explicitly. Step 3 depends on step 2 and step 4. Step 2 and step 4 can run in parallel. If step 2 fails, you rerun step 2 — not steps 1, 4, or 3.&lt;/p&gt;
&lt;p&gt;DAG-based thinking gives you:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Parallelism.&lt;/strong&gt; Independent stages run concurrently, cutting wall-clock time.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Targeted retries.&lt;/strong&gt; Failed stages retry alone, not the entire workflow.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Clear dependencies.&lt;/strong&gt; You can see exactly what feeds into a given output.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Incremental development.&lt;/strong&gt; Add new stages without touching existing ones.&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;Even if your orchestrator doesn&apos;t enforce DAGs, design your pipeline as one. Document which stages depend on which outputs. Make each stage read from a defined input location and write to a defined output location.&lt;/p&gt;
&lt;h2&gt;Dependency Management&lt;/h2&gt;
&lt;p&gt;Implicit dependencies are the most common source of pipeline fragility. &amp;quot;This pipeline assumes table X exists because another pipeline created it&amp;quot; is an implicit dependency. When the other pipeline is delayed, skipped, or renamed, your pipeline breaks.&lt;/p&gt;
&lt;p&gt;Make dependencies explicit:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Declare data dependencies.&lt;/strong&gt; If stage B reads the output of stage A, model that relationship in your orchestration. Don&apos;t rely on timing (&amp;quot;A usually finishes by 6 AM&amp;quot;).&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Use sensors or triggers.&lt;/strong&gt; Wait for data to arrive before starting a stage. Check for a file, a partition, or a row count — don&apos;t check the clock.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Version your interfaces.&lt;/strong&gt; When a producer changes its output schema, consumers should detect the change before they process stale or malformed data.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Document ownership.&lt;/strong&gt; Every dataset should have an owner. When you depend on someone else&apos;s table, you should know who to contact when it changes.&lt;/li&gt;
&lt;/ul&gt;
&lt;h2&gt;Failure Handling Patterns&lt;/h2&gt;
&lt;p&gt;&lt;strong&gt;Retry with backoff.&lt;/strong&gt; Most transient failures (network timeouts, API throttling, lock contention) resolve themselves. Retry 3-5 times with exponential backoff (e.g., 1s, 5s, 25s) before marking a stage as failed.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Dead-letter queues.&lt;/strong&gt; Records that cannot be processed (corrupt payloads, unexpected schemas, values out of range) go to a quarantine area. Log why they failed. Review them periodically. Don&apos;t drop them silently.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Circuit breakers.&lt;/strong&gt; If a downstream system returns errors consistently, stop sending requests after N failures. Resume with a health check. This prevents cascading failures and buffer exhaustion.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Checkpointing.&lt;/strong&gt; After processing each batch or partition, record what was completed. On failure, resume from the last checkpoint. This is the difference between a 5-minute recovery and a 5-hour reprocessing job.&lt;/p&gt;
&lt;p&gt;&lt;img src=&quot;/assets/images/debp/02/failure-patterns.png&quot; alt=&quot;Failure handling: retry, dead-letter queue, circuit breaker, checkpoint&quot;&gt;&lt;/p&gt;
&lt;h2&gt;What to Do Next&lt;/h2&gt;
&lt;p&gt;Map your current pipelines against the four architecture layers. Identify which layers are missing or mixed together. The most common gap: ingestion and transformation are in the same script, making it impossible to replay raw data or isolate failures. Separate them, and reliability follows.&lt;/p&gt;
&lt;p&gt;&lt;a href=&quot;https://www.dremio.com/get-started?utm_source=ev_buffer&amp;amp;utm_medium=influencer&amp;amp;utm_campaign=next-gen-dremio&amp;amp;utm_term=blog-021826-02-18-2026&amp;amp;utm_content=alexmerced&quot;&gt;Try Dremio Cloud free for 30 days&lt;/a&gt;&lt;/p&gt;
</content:encoded><author>Alex Merced</author></item><item><title>Conceptual, Logical, and Physical Data Models Explained</title><link>https://iceberglakehouse.com/posts/2026-02-dm-types-of-data-models/</link><guid isPermaLink="true">https://iceberglakehouse.com/posts/2026-02-dm-types-of-data-models/</guid><description>
![Three layers of data modeling from business concepts to database implementation](/assets/images/data_modeling/02/types-of-data-models.png)

Most da...</description><pubDate>Wed, 18 Feb 2026 10:00:00 GMT</pubDate><content:encoded>&lt;p&gt;&lt;img src=&quot;/assets/images/data_modeling/02/types-of-data-models.png&quot; alt=&quot;Three layers of data modeling from business concepts to database implementation&quot;&gt;&lt;/p&gt;
&lt;p&gt;Most data teams jump straight from a stakeholder request to creating database tables. They skip the planning steps that prevent misalignment, redundancy, and rework. The result: tables that make sense to the engineer who built them but confuse everyone else.&lt;/p&gt;
&lt;p&gt;Data modeling addresses this by working at three levels of abstraction. Each level answers a different question, for a different audience, at a different stage of the design process.&lt;/p&gt;
&lt;h2&gt;Why Three Levels Exist&lt;/h2&gt;
&lt;p&gt;A single data model can&apos;t serve every purpose. Business stakeholders need to see what data the system captures and how concepts relate. Data architects need to define precise structures, data types, and rules. Database engineers need to optimize storage and performance for a specific platform.&lt;/p&gt;
&lt;p&gt;Trying to capture all of this in one diagram creates a document that&apos;s too abstract for engineers and too technical for the business. Three levels solve this by separating concerns.&lt;/p&gt;
&lt;h2&gt;The Conceptual Data Model&lt;/h2&gt;
&lt;p&gt;A conceptual data model defines the big picture. It identifies the major entities your system needs to track and the relationships between them.&lt;/p&gt;
&lt;p&gt;For an e-commerce platform, a conceptual model might look like this:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Customer&lt;/strong&gt; places &lt;strong&gt;Order&lt;/strong&gt;&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Order&lt;/strong&gt; contains &lt;strong&gt;Line Item&lt;/strong&gt;&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Line Item&lt;/strong&gt; references &lt;strong&gt;Product&lt;/strong&gt;&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Product&lt;/strong&gt; belongs to &lt;strong&gt;Category&lt;/strong&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;There are no column names, no data types, no keys. The conceptual model exists to answer one question: &amp;quot;Do we agree on what data the system needs?&amp;quot;&lt;/p&gt;
&lt;p&gt;This model is created collaboratively with business stakeholders. Its value is alignment. When the finance team says &amp;quot;customer&amp;quot; and the marketing team says &amp;quot;customer,&amp;quot; the conceptual model ensures they mean the same thing.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Skip this level&lt;/strong&gt;, and you build a database that captures the wrong entities or misses key relationships. Fixing structural errors after the database is in production costs 10x more than catching them at conception.&lt;/p&gt;
&lt;h2&gt;The Logical Data Model&lt;/h2&gt;
&lt;p&gt;The logical model adds precision to the conceptual model. It defines:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Attributes&lt;/strong&gt; for each entity (customer_id, customer_name, email, signup_date)&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Data types&lt;/strong&gt; (INTEGER, VARCHAR(255), DATE)&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Primary keys&lt;/strong&gt; (customer_id uniquely identifies each customer)&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Foreign keys&lt;/strong&gt; (order.customer_id references customer.customer_id)&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Normalization rules&lt;/strong&gt; (eliminate redundancy up to Third Normal Form)&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;The logical model is intentionally DBMS-independent. It works whether you implement it in PostgreSQL, MySQL, Snowflake, or Apache Iceberg tables. This separation matters because it lets you evaluate the design on its own merits before committing to a specific technology.&lt;/p&gt;
&lt;p&gt;&lt;img src=&quot;/assets/images/data_modeling/02/logical-model-detail.png&quot; alt=&quot;Logical model showing entities with attributes, data types, and relationship keys&quot;&gt;&lt;/p&gt;
&lt;p&gt;Normalization is the primary discipline at this level. The logical model eliminates data redundancy by splitting entities into their most atomic forms. A customer&apos;s address doesn&apos;t live in the orders table — it lives in its own table, referenced by a foreign key.&lt;/p&gt;
&lt;h2&gt;The Physical Data Model&lt;/h2&gt;
&lt;p&gt;The physical model translates the logical model into the exact implementation for a specific database engine. This is where theoretical design meets operational reality.&lt;/p&gt;
&lt;p&gt;A physical model specifies:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Table names and column definitions (&lt;code&gt;customers&lt;/code&gt;, &lt;code&gt;orders&lt;/code&gt;, &lt;code&gt;line_items&lt;/code&gt;)&lt;/li&gt;
&lt;li&gt;Data types specific to the DBMS (&lt;code&gt;BIGINT&lt;/code&gt; vs. &lt;code&gt;INTEGER&lt;/code&gt;, &lt;code&gt;TIMESTAMP_TZ&lt;/code&gt; vs. &lt;code&gt;TIMESTAMP&lt;/code&gt;)&lt;/li&gt;
&lt;li&gt;Indexes for query performance (B-tree on &lt;code&gt;customer_id&lt;/code&gt;, hash on &lt;code&gt;email&lt;/code&gt;)&lt;/li&gt;
&lt;li&gt;Partitioning strategies (partition &lt;code&gt;orders&lt;/code&gt; by &lt;code&gt;order_date&lt;/code&gt; using monthly ranges)&lt;/li&gt;
&lt;li&gt;Compression and file format choices (Parquet with Snappy compression for Iceberg)&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;The physical model is where performance tuning happens. You might denormalize at this level — joining the customer name into the orders table to avoid an expensive join at query time — even though the logical model keeps them separate.&lt;/p&gt;
&lt;p&gt;In a lakehouse architecture, the physical model also includes Iceberg table properties: partition specs (time-based or value-based), sort orders for query optimization, and file format settings.&lt;/p&gt;
&lt;h2&gt;How the Three Levels Connect&lt;/h2&gt;
&lt;p&gt;Each level feeds the next:&lt;/p&gt;
&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Aspect&lt;/th&gt;
&lt;th&gt;Conceptual&lt;/th&gt;
&lt;th&gt;Logical&lt;/th&gt;
&lt;th&gt;Physical&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Abstraction&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;High&lt;/td&gt;
&lt;td&gt;Medium&lt;/td&gt;
&lt;td&gt;Low&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Audience&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Business stakeholders&lt;/td&gt;
&lt;td&gt;Data architects&lt;/td&gt;
&lt;td&gt;Database engineers&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Entities&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Named&lt;/td&gt;
&lt;td&gt;Defined with attributes&lt;/td&gt;
&lt;td&gt;Tables with typed columns&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Relationships&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Named&lt;/td&gt;
&lt;td&gt;With cardinality and keys&lt;/td&gt;
&lt;td&gt;Foreign key constraints&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Data types&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;None&lt;/td&gt;
&lt;td&gt;Generic (INTEGER, VARCHAR)&lt;/td&gt;
&lt;td&gt;DBMS-specific (BIGINT, TEXT)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Normalization&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Not applicable&lt;/td&gt;
&lt;td&gt;Applied (3NF)&lt;/td&gt;
&lt;td&gt;May denormalize for performance&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Performance&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Not considered&lt;/td&gt;
&lt;td&gt;Not considered&lt;/td&gt;
&lt;td&gt;Indexes, partitions, caching&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;
&lt;p&gt;In platforms like &lt;a href=&quot;https://www.dremio.com/blog/agentic-analytics-semantic-layer/?utm_source=ev_buffer&amp;amp;utm_medium=influencer&amp;amp;utm_campaign=next-gen-dremio&amp;amp;utm_term=blog-021826-02-18-2026&amp;amp;utm_content=alexmerced&quot;&gt;Dremio&lt;/a&gt;, you can implement all three levels using virtual datasets organized in a Medallion Architecture. Bronze views represent the physical layer (raw data mapped to typed columns). Silver views represent the logical layer (joins, business keys, normalized relationships). Gold views represent the conceptual layer (business entities ready for consumption, documented with Wikis and tagged with Labels).&lt;/p&gt;
&lt;h2&gt;Common Mistakes&lt;/h2&gt;
&lt;p&gt;&lt;strong&gt;Skipping the conceptual model.&lt;/strong&gt; Engineers jump to table creation and miss requirement gaps that surface months later when a stakeholder asks &amp;quot;Why don&apos;t we track X?&amp;quot;&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Building logical models tied to a DBMS.&lt;/strong&gt; If your logical model includes PostgreSQL-specific syntax, it&apos;s a physical model disguised as a logical one. This makes migration and evaluation harder.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Over-normalizing for analytics.&lt;/strong&gt; Third Normal Form is correct for transactional systems. But analytics workloads benefit from wider, flatter tables that reduce join counts. Know when to denormalize.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Under-documenting all levels.&lt;/strong&gt; A model without documentation is a puzzle. Column names like &lt;code&gt;c_id&lt;/code&gt;, &lt;code&gt;dt&lt;/code&gt;, and &lt;code&gt;amt&lt;/code&gt; save keystrokes and cost hours of confusion.&lt;/p&gt;
&lt;h2&gt;What to Do Next&lt;/h2&gt;
&lt;p&gt;&lt;img src=&quot;/assets/images/data_modeling/02/data-model-stakeholders.png&quot; alt=&quot;Data models feeding into AI, dashboards, and governance systems&quot;&gt;&lt;/p&gt;
&lt;p&gt;Audit your current data platform against all three levels. Can you show a business stakeholder what entities your system tracks (conceptual)? Can you show an architect the precise attributes and relationships (logical)? Can you explain why the tables are partitioned and indexed the way they are (physical)?&lt;/p&gt;
&lt;p&gt;If any of those questions draws a blank, you have a gap worth filling.&lt;/p&gt;
&lt;p&gt;&lt;a href=&quot;https://www.dremio.com/get-started?utm_source=ev_buffer&amp;amp;utm_medium=influencer&amp;amp;utm_campaign=next-gen-dremio&amp;amp;utm_term=blog-021826-02-18-2026&amp;amp;utm_content=alexmerced&quot;&gt;Try Dremio Cloud free for 30 days&lt;/a&gt;&lt;/p&gt;
</content:encoded><author>Alex Merced</author></item><item><title>How to Build a Semantic Layer: A Step-by-Step Guide</title><link>https://iceberglakehouse.com/posts/2026-02-sl-how-to-build-semantic-layer/</link><guid isPermaLink="true">https://iceberglakehouse.com/posts/2026-02-sl-how-to-build-semantic-layer/</guid><description>
![Building a semantic layer — Bronze, Silver, and Gold tiers](/assets/images/semantic_layer/02/build-semantic-layer.png)

Most teams start building a...</description><pubDate>Wed, 18 Feb 2026 10:00:00 GMT</pubDate><content:encoded>&lt;p&gt;&lt;img src=&quot;/assets/images/semantic_layer/02/build-semantic-layer.png&quot; alt=&quot;Building a semantic layer — Bronze, Silver, and Gold tiers&quot;&gt;&lt;/p&gt;
&lt;p&gt;Most teams start building a semantic layer the wrong way: they open their BI tool, create a few calculated fields, and call it done. Six months later, three dashboards define &amp;quot;churn&amp;quot; differently, nobody trusts the numbers, and the data team is debugging metric discrepancies instead of building new features.&lt;/p&gt;
&lt;p&gt;A well-built semantic layer prevents all of that. Here&apos;s how to do it right.&lt;/p&gt;
&lt;h2&gt;Start With Metrics, Not Data Models&lt;/h2&gt;
&lt;p&gt;&lt;img src=&quot;/assets/images/semantic_layer/02/metric-alignment.png&quot; alt=&quot;Stakeholders aligning on unified metric definitions&quot;&gt;&lt;/p&gt;
&lt;p&gt;Before writing a single line of SQL, sit down with stakeholders from Sales, Finance, Marketing, and Product. Agree on the top 5-10 business metrics your organization uses to make decisions.&lt;/p&gt;
&lt;p&gt;For each metric, document:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;The calculation&lt;/strong&gt;: Revenue = SUM(order_total) WHERE status = &apos;completed&apos; AND refunded = FALSE&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;The owner&lt;/strong&gt;: Who is accountable for this definition?&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;The grain&lt;/strong&gt;: Daily? Monthly? Per customer?&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;The refresh cadence&lt;/strong&gt;: Real-time? Daily batch? Weekly?&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;This exercise is harder than it sounds. You will discover that &amp;quot;Monthly Active Users&amp;quot; has three competing definitions. That&apos;s the point. The semantic layer can&apos;t resolve disagreements that haven&apos;t been surfaced yet.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Output&lt;/strong&gt;: A metric glossary. This becomes the source document for everything you build next.&lt;/p&gt;
&lt;h2&gt;Map Your Data Sources&lt;/h2&gt;
&lt;p&gt;Inventory every system that feeds into your analytics:&lt;/p&gt;
&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Source Type&lt;/th&gt;
&lt;th&gt;Examples&lt;/th&gt;
&lt;th&gt;Access Pattern&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Transactional databases&lt;/td&gt;
&lt;td&gt;PostgreSQL, MySQL, SQL Server&lt;/td&gt;
&lt;td&gt;Federated query (read-only)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Cloud data lakes&lt;/td&gt;
&lt;td&gt;S3 (Parquet/Iceberg), Azure Data Lake&lt;/td&gt;
&lt;td&gt;Direct scan or catalog&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;SaaS platforms&lt;/td&gt;
&lt;td&gt;Salesforce, HubSpot, Stripe&lt;/td&gt;
&lt;td&gt;API extraction or replication&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Spreadsheets&lt;/td&gt;
&lt;td&gt;Google Sheets, Excel&lt;/td&gt;
&lt;td&gt;One-time import or scheduled sync&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;
&lt;p&gt;Not all sources need to be replicated into a central store. Federation lets you query data where it lives without the cost and complexity of ETL pipelines. Platforms like &lt;a href=&quot;https://www.dremio.com/get-started?utm_source=ev_buffer&amp;amp;utm_medium=influencer&amp;amp;utm_campaign=next-gen-dremio&amp;amp;utm_term=blog-021826-02-18-2026&amp;amp;utm_content=alexmerced&quot;&gt;Dremio&lt;/a&gt; connect to dozens of sources and present them in a single namespace, so your semantic layer can span everything without data movement.&lt;/p&gt;
&lt;h2&gt;Design the Three-Layer View Structure&lt;/h2&gt;
&lt;p&gt;&lt;img src=&quot;/assets/images/semantic_layer/02/three-layer-arch.png&quot; alt=&quot;Bronze, Silver, and Gold data layers in the Medallion Architecture&quot;&gt;&lt;/p&gt;
&lt;p&gt;The most effective semantic layer architecture uses three layers of SQL views, commonly called the Medallion Architecture.&lt;/p&gt;
&lt;h3&gt;Bronze Layer (Preparation)&lt;/h3&gt;
&lt;p&gt;Create one view per raw source table. Apply no business logic. Just make the data human-readable:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Rename cryptic columns: &lt;code&gt;col_7&lt;/code&gt; → &lt;code&gt;OrderDate&lt;/code&gt;, &lt;code&gt;cust_id&lt;/code&gt; → &lt;code&gt;CustomerID&lt;/code&gt;&lt;/li&gt;
&lt;li&gt;Cast types to standard formats: strings to dates, integers to decimals&lt;/li&gt;
&lt;li&gt;Normalize timestamps to UTC&lt;/li&gt;
&lt;li&gt;Avoid using SQL reserved words as column names (&lt;code&gt;Timestamp&lt;/code&gt;, &lt;code&gt;Date&lt;/code&gt;, &lt;code&gt;Role&lt;/code&gt; will force double-quoting in every downstream query. Use &lt;code&gt;EventTimestamp&lt;/code&gt;, &lt;code&gt;TransactionDate&lt;/code&gt;, &lt;code&gt;UserRole&lt;/code&gt; instead.)&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;Bronze views should be boring. Their only job is to make raw data safe to work with.&lt;/p&gt;
&lt;h3&gt;Silver Layer (Business Logic)&lt;/h3&gt;
&lt;p&gt;This is where your metric glossary becomes code. Silver views join Bronze views, deduplicate records, filter invalid data, and apply business rules.&lt;/p&gt;
&lt;p&gt;Example:&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-sql&quot;&gt;CREATE VIEW silver.orders_enriched AS
SELECT
    o.OrderID,
    o.OrderDate,
    o.Total AS OrderTotal,
    c.Region,
    c.Segment
FROM bronze.orders_raw o
JOIN bronze.customers_raw c ON o.CustomerID = c.CustomerID
WHERE o.Total &amp;gt; 0 AND o.Status = &apos;completed&apos;;
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Each Silver view encodes exactly one business concept. &amp;quot;Revenue&amp;quot; is defined in one place. Every dashboard, notebook, and AI agent that needs revenue queries this view. No exceptions.&lt;/p&gt;
&lt;h3&gt;Gold Layer (Application)&lt;/h3&gt;
&lt;p&gt;Gold views are pre-aggregated for specific consumers. A BI dashboard gets &lt;code&gt;monthly_revenue_by_region&lt;/code&gt;. An AI agent gets &lt;code&gt;customer_360_summary&lt;/code&gt;. A finance report gets &lt;code&gt;quarterly_financial_summary&lt;/code&gt;.&lt;/p&gt;
&lt;p&gt;Gold views don&apos;t add new business logic. They aggregate and reshape Silver views for performance and usability.&lt;/p&gt;
&lt;h2&gt;Document Everything — or Let AI Help&lt;/h2&gt;
&lt;p&gt;An undocumented semantic layer is a semantic layer nobody uses. Every table and every column should have a description that explains:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;What the data represents&lt;/li&gt;
&lt;li&gt;Where it comes from&lt;/li&gt;
&lt;li&gt;Any known limitations or caveats&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;This is tedious work. Modern platforms accelerate it with AI. Dremio&apos;s generative AI, for example, can auto-generate Wiki descriptions by sampling table data, and suggest Labels (tags like &amp;quot;PII,&amp;quot; &amp;quot;Finance,&amp;quot; &amp;quot;Certified&amp;quot;) for governance and discoverability. The AI provides a 70% first draft. Your data team fills in the domain-specific context.&lt;/p&gt;
&lt;p&gt;This documentation serves two audiences: human analysts browsing the catalog, and AI agents that need context to generate accurate SQL. Both benefit from rich, accurate descriptions.&lt;/p&gt;
&lt;h2&gt;Enforce Access Policies at the Layer&lt;/h2&gt;
&lt;p&gt;Security should be embedded in the semantic layer, not applied after the fact in each tool. Two patterns:&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Row-Level Security&lt;/strong&gt;: Filter what data a user can see based on their role. A regional manager sees only their region&apos;s data. The SQL view applies the filter automatically.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Column Masking&lt;/strong&gt;: Mask sensitive columns (SSN, email, salary) for roles that don&apos;t need them. Analysts see &lt;code&gt;****@email.com&lt;/code&gt;. Data engineers see the full value.&lt;/p&gt;
&lt;p&gt;The advantage of enforcing policies at the semantic layer: every downstream query inherits the rules, whether the query comes from a dashboard, a Python notebook, or an AI agent. No gaps.&lt;/p&gt;
&lt;h2&gt;Start Small, Then Expand&lt;/h2&gt;
&lt;p&gt;Don&apos;t try to model your entire data landscape on day one. Start with:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;3-5 core metrics from your glossary&lt;/li&gt;
&lt;li&gt;The 2-3 source systems those metrics depend on&lt;/li&gt;
&lt;li&gt;One Bronze → Silver → Gold pipeline per metric&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;Validate by running the same question across two different tools (a BI dashboard and a SQL notebook, for example). If both return the same number, the semantic layer is working. If they don&apos;t, fix the Silver view definition before adding more.&lt;/p&gt;
&lt;p&gt;Once the first metrics are stable, expand incrementally. Add new sources, new Silver views, new Gold views. Each addition is low-risk because the layered structure isolates changes.&lt;/p&gt;
&lt;h2&gt;What to Do Next&lt;/h2&gt;
&lt;p&gt;Pick the metric your organization argues about the most. Define it explicitly in a Silver view. Test it against the current dashboards. If the numbers match, you&apos;ve validated the approach. If they don&apos;t, you&apos;ve just found the inconsistency that&apos;s been silently costing your organization trust.&lt;/p&gt;
&lt;p&gt;&lt;a href=&quot;https://www.dremio.com/get-started?utm_source=ev_buffer&amp;amp;utm_medium=influencer&amp;amp;utm_campaign=next-gen-dremio&amp;amp;utm_term=blog-021826-02-18-2026&amp;amp;utm_content=alexmerced&quot;&gt;Try Dremio Cloud free for 30 days&lt;/a&gt;&lt;/p&gt;
</content:encoded><author>Alex Merced</author></item><item><title>How to Think Like a Data Engineer</title><link>https://iceberglakehouse.com/posts/2026-02-debp-think-like-data-engineer/</link><guid isPermaLink="true">https://iceberglakehouse.com/posts/2026-02-debp-think-like-data-engineer/</guid><description>
![Data flowing through a system of interconnected pipeline stages from sources to consumers](/assets/images/debp/01/data-engineer-mindset.png)

The m...</description><pubDate>Wed, 18 Feb 2026 09:00:00 GMT</pubDate><content:encoded>&lt;p&gt;&lt;img src=&quot;/assets/images/debp/01/data-engineer-mindset.png&quot; alt=&quot;Data flowing through a system of interconnected pipeline stages from sources to consumers&quot;&gt;&lt;/p&gt;
&lt;p&gt;The median lifespan of a popular data tool is about three years. The tool you master today may be deprecated or replaced by the time your next project ships. What doesn&apos;t change are the principles underneath: how data flows, how systems fail, how contracts between producers and consumers work, and how to decompose messy requirements into clean, maintainable pipelines.&lt;/p&gt;
&lt;p&gt;Thinking like a data engineer means solving problems at the systems level, not the tool level. It means asking &amp;quot;what could go wrong?&amp;quot; before asking &amp;quot;what framework should I use?&amp;quot;&lt;/p&gt;
&lt;h2&gt;Tools Change — Principles Don&apos;t&lt;/h2&gt;
&lt;p&gt;Every year brings a new orchestrator, a new streaming framework, a new columnar format. Teams that build their expertise around a specific tool struggle when the landscape shifts. Teams that build expertise around principles — idempotency, schema contracts, data quality at the source, composable stages — adopt new tools without starting over.&lt;/p&gt;
&lt;p&gt;The question is never &amp;quot;How do I do this in Tool X?&amp;quot; The question is &amp;quot;What problem am I solving, and what properties does the solution need to have?&amp;quot; Once you answer that, the tool choice becomes a constraint-matching exercise.&lt;/p&gt;
&lt;h2&gt;The Five Questions Framework&lt;/h2&gt;
&lt;p&gt;&lt;img src=&quot;/assets/images/debp/01/five-questions-framework.png&quot; alt=&quot;Five-question framework: Sources, Destinations, Transformations, Failure Modes, Monitoring&quot;&gt;&lt;/p&gt;
&lt;p&gt;Before designing any pipeline, answer five questions:&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;1. What data exists?&lt;/strong&gt; Identify every source: databases, APIs, event streams, files. Note the format (JSON, CSV, Parquet, Avro), volume (rows per day), freshness (real-time, hourly, daily), and reliability (does this source go down?).&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;2. Where does it need to go?&lt;/strong&gt; Identify every consumer: dashboards, ML models, downstream systems, analysts. Note what format they need, how fresh the data must be, and what SLAs they expect.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;3. What transformations are needed?&lt;/strong&gt; Map the gap between source shape and consumer shape. This includes cleaning (nulls, duplicates, encoding), enriching (joining lookup data), and aggregating (daily summaries, running totals).&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;4. What can go wrong?&lt;/strong&gt; List failure modes: late data, schema changes in the source, duplicate events, null values in required fields, API rate limits, network partitions, out-of-order events. For each failure mode, define the expected behavior — skip, retry, alert, or quarantine.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;5. How will you know if it&apos;s working?&lt;/strong&gt; Define observability: row counts in vs. row counts out, freshness checks, schema validation, anomaly detection. If you can&apos;t answer this question before building the pipeline, you&apos;ll be debugging in production.&lt;/p&gt;
&lt;h2&gt;Think in Systems, Not Scripts&lt;/h2&gt;
&lt;p&gt;A script processes data from A to B. A system handles what happens when A is late, B is down, the data shape changes, the volume doubles, and the on-call engineer needs to understand what happened at 3 AM.&lt;/p&gt;
&lt;p&gt;Thinking in systems means:&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Composability.&lt;/strong&gt; Break pipelines into discrete stages that can be developed, tested, and monitored independently. An ingestion stage should not also handle transformation and loading. When a stage fails, you restart that stage, not the entire pipeline.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Contracts.&lt;/strong&gt; Define what each stage produces: column names, data types, value ranges, freshness guarantees. When a producer changes its output, the contract violation is caught immediately — not three stages downstream when a dashboard shows wrong numbers.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;State management.&lt;/strong&gt; Track what has been processed. Know where to resume after a failure. Avoid reprocessing data unnecessarily by maintaining checkpoints, watermarks, or change data capture (CDC) positions.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Isolation.&lt;/strong&gt; One failing pipeline should not take down others. Shared resources (connection pools, compute clusters, storage) need limits per-pipeline to prevent noisy-neighbor problems.&lt;/p&gt;
&lt;h2&gt;Design for Failure First&lt;/h2&gt;
&lt;p&gt;The default assumption should be: every component will fail. Networks drop. APIs return errors. Source schemas change without warning. Storage fills up. The pipeline that handles none of these cases works in development and breaks in production.&lt;/p&gt;
&lt;p&gt;Practical failure-first design:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Retry with backoff.&lt;/strong&gt; Transient errors (network timeouts, API rate limits) often resolve themselves. Retry with exponential backoff before alerting.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Dead-letter queues.&lt;/strong&gt; Records that can&apos;t be processed (malformed, unexpected schema) go to a separate queue for inspection — not dropped silently.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Idempotent writes.&lt;/strong&gt; Running a pipeline job twice should produce the same end-state. Use upserts, deduplication, or transaction-based writes instead of blind appends.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Circuit breakers.&lt;/strong&gt; If a downstream system is unresponsive, stop sending data after N failures instead of filling up buffers and crashing.&lt;/li&gt;
&lt;/ul&gt;
&lt;h2&gt;Anti-Patterns That Signal Inexperience&lt;/h2&gt;
&lt;p&gt;&lt;strong&gt;Choosing the tool before understanding the problem.&lt;/strong&gt; &amp;quot;We should use Kafka&amp;quot; is not a good starting point. &amp;quot;We need sub-second event delivery with at-least-once guarantees&amp;quot; is. The tool choice follows from the requirements, not the other way around.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Monolithic pipelines.&lt;/strong&gt; One script that reads from a database, cleans data, joins three tables, aggregates, and writes to a warehouse. When any step fails, the entire pipeline fails. When any step needs a change, the entire pipeline needs retesting.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;No error handling.&lt;/strong&gt; &lt;code&gt;try: process() except: pass&lt;/code&gt; is not error handling. Every expected failure mode should have an explicit response: retry, skip and log, alert, or halt.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;No monitoring.&lt;/strong&gt; If the only way you learn about a pipeline failure is when an analyst asks &amp;quot;why is the dashboard empty?&amp;quot;, your observability is broken.&lt;/p&gt;
&lt;p&gt;&lt;img src=&quot;/assets/images/debp/01/anti-patterns.png&quot; alt=&quot;Anti-patterns: monolithic pipeline, no monitoring, tool-first thinking&quot;&gt;&lt;/p&gt;
&lt;h2&gt;What to Do Next&lt;/h2&gt;
&lt;p&gt;Pick your most critical pipeline. Walk through the Five Questions Framework. Can you answer all five clearly and completely? If not, the gaps are your immediate priorities. Write down the answers, share them with your team, and use them as the specification for your next refactor.&lt;/p&gt;
&lt;p&gt;&lt;a href=&quot;https://www.dremio.com/get-started?utm_source=ev_buffer&amp;amp;utm_medium=influencer&amp;amp;utm_campaign=next-gen-dremio&amp;amp;utm_term=blog-021826-02-18-2026&amp;amp;utm_content=alexmerced&quot;&gt;Try Dremio Cloud free for 30 days&lt;/a&gt;&lt;/p&gt;
</content:encoded><author>Alex Merced</author></item><item><title>What Is Data Modeling? A Complete Guide</title><link>https://iceberglakehouse.com/posts/2026-02-dm-what-is-data-modeling/</link><guid isPermaLink="true">https://iceberglakehouse.com/posts/2026-02-dm-what-is-data-modeling/</guid><description>
![Data entities connected by relationship lines forming a structured data model](/assets/images/data_modeling/01/data-modeling-overview.png)

Every d...</description><pubDate>Wed, 18 Feb 2026 09:00:00 GMT</pubDate><content:encoded>&lt;p&gt;&lt;img src=&quot;/assets/images/data_modeling/01/data-modeling-overview.png&quot; alt=&quot;Data entities connected by relationship lines forming a structured data model&quot;&gt;&lt;/p&gt;
&lt;p&gt;Every database, data warehouse, and data lakehouse starts with the same question: how should this data be organized? Data modeling answers that question by creating a structured blueprint of your data — what it contains, how it relates, and what it means.&lt;/p&gt;
&lt;p&gt;A data model is not a diagram you draw once and forget. It&apos;s a living definition of your business logic, encoded in the structure of your tables, columns, and relationships. Get it right, and every downstream consumer — dashboards, reports, AI agents, applications — works from the same shared understanding. Get it wrong, and you spend months untangling conflicting definitions of &amp;quot;customer,&amp;quot; &amp;quot;revenue,&amp;quot; and &amp;quot;active user.&amp;quot;&lt;/p&gt;
&lt;h2&gt;What Data Modeling Actually Means&lt;/h2&gt;
&lt;p&gt;Data modeling is the process of defining entities, attributes, and relationships for a dataset. Entities represent real-world objects or concepts (Customers, Orders, Products). Attributes describe those entities (customer name, order date, product price). Relationships define how entities connect (a customer &lt;em&gt;places&lt;/em&gt; an order, an order &lt;em&gt;contains&lt;/em&gt; products).&lt;/p&gt;
&lt;p&gt;The goal is to create a representation precise enough that a database can store the data reliably, and clear enough that a human — or an AI agent — can understand what the data means.&lt;/p&gt;
&lt;p&gt;Think of it as an architectural blueprint. You wouldn&apos;t build a house without one, and you shouldn&apos;t build a data platform without a data model.&lt;/p&gt;
&lt;h2&gt;The Three Levels of Data Modeling&lt;/h2&gt;
&lt;p&gt;&lt;img src=&quot;/assets/images/data_modeling/01/three-levels-data-model.png&quot; alt=&quot;Conceptual, logical, and physical data models as three layers of increasing detail&quot;&gt;&lt;/p&gt;
&lt;p&gt;Data models operate at three levels of abstraction, each serving a different audience:&lt;/p&gt;
&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Level&lt;/th&gt;
&lt;th&gt;Audience&lt;/th&gt;
&lt;th&gt;Purpose&lt;/th&gt;
&lt;th&gt;Contains&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Conceptual&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Business stakeholders&lt;/td&gt;
&lt;td&gt;Define &lt;em&gt;what&lt;/em&gt; data is needed&lt;/td&gt;
&lt;td&gt;Entities, relationships, business rules&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Logical&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Data architects&lt;/td&gt;
&lt;td&gt;Define &lt;em&gt;how&lt;/em&gt; data is structured&lt;/td&gt;
&lt;td&gt;Attributes, data types, normalization rules, keys&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Physical&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Database engineers&lt;/td&gt;
&lt;td&gt;Define &lt;em&gt;where and how&lt;/em&gt; data is stored&lt;/td&gt;
&lt;td&gt;Tables, columns, indexes, partitions, constraints&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;
&lt;p&gt;&lt;strong&gt;Conceptual models&lt;/strong&gt; capture business requirements without technical details. A conceptual model might say &amp;quot;Customers place Orders, and Orders contain Products.&amp;quot; It doesn&apos;t specify column types or index strategies. Its job is to align business stakeholders and technical teams on what data the system needs.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Logical models&lt;/strong&gt; add precision. They define attributes (customer_id, customer_name, email), assign data types (INTEGER, VARCHAR, TIMESTAMP), and specify normalization rules. A logical model is independent of any specific database engine — it works whether you implement it in PostgreSQL, Snowflake, or Apache Iceberg.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Physical models&lt;/strong&gt; are implementation-specific. They define table names, column types for a specific DBMS, primary and foreign keys, indexes for query performance, and partitioning strategies. This is where theoretical design meets operational reality — storage formats, compression codecs, and file organization all matter here.&lt;/p&gt;
&lt;h2&gt;Common Data Modeling Techniques&lt;/h2&gt;
&lt;p&gt;Several techniques exist for organizing data. Each fits different use cases:&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Entity-Relationship (ER) Modeling&lt;/strong&gt; is the most widely used technique for transactional systems. It maps entities, attributes, and their relationships using formal diagrams. Most OLTP databases — the systems that power applications — start with an ER model.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Dimensional Modeling&lt;/strong&gt; organizes data into facts (measurable events like sales transactions) and dimensions (context like date, product, and customer). Star schemas and snowflake schemas are the two primary patterns. This technique dominates data warehousing and analytics.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Data Vault Modeling&lt;/strong&gt; separates data into Hubs (business keys), Links (relationships), and Satellites (descriptive attributes with history). It&apos;s designed for environments where sources change frequently and full audit history matters.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Graph Modeling&lt;/strong&gt; represents data as nodes (entities) and edges (relationships). It&apos;s useful when the relationships between data points are as important as the data itself — social networks, recommendation engines, fraud detection.&lt;/p&gt;
&lt;h2&gt;Why Data Modeling Matters More Than Ever&lt;/h2&gt;
&lt;p&gt;&lt;img src=&quot;/assets/images/data_modeling/01/data-model-downstream.png&quot; alt=&quot;Data model feeding into dashboards, AI agents, and governance systems&quot;&gt;&lt;/p&gt;
&lt;p&gt;Three trends have made data modeling more critical, not less:&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;AI needs structure to be accurate.&lt;/strong&gt; When an AI agent generates SQL, it relies on well-defined tables, clear column names, and documented relationships. A poorly modeled dataset forces the agent to guess which table contains &amp;quot;revenue&amp;quot; and which join path connects &amp;quot;customers&amp;quot; to &amp;quot;orders.&amp;quot; Those guesses create hallucinated queries that return wrong numbers.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Self-service analytics depends on understandable data.&lt;/strong&gt; Business users exploring data in a BI tool can only self-serve if the data model is intuitive. When tables are named &lt;code&gt;stg_src_cust_v2_final&lt;/code&gt; with columns like &lt;code&gt;c1&lt;/code&gt;, &lt;code&gt;c2&lt;/code&gt;, &lt;code&gt;c3&lt;/code&gt;, even experienced analysts give up and file a ticket instead.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Compliance requires traceable definitions.&lt;/strong&gt; Regulations like GDPR and CCPA demand that organizations know what personal data they store, where it flows, and who can access it. A well-documented data model provides that traceability. Without one, compliance audits turn into archaeology projects.&lt;/p&gt;
&lt;p&gt;Platforms like &lt;a href=&quot;https://www.dremio.com/blog/agentic-analytics-semantic-layer/?utm_source=ev_buffer&amp;amp;utm_medium=influencer&amp;amp;utm_campaign=next-gen-dremio&amp;amp;utm_term=blog-021826-02-18-2026&amp;amp;utm_content=alexmerced&quot;&gt;Dremio&lt;/a&gt; address this by letting you implement data models as virtual datasets (SQL views) organized in a Medallion Architecture — Bronze for raw data preparation, Silver for business logic and joins, Gold for application-specific outputs. The model exists as a logical layer without requiring physical data copies, and Wikis, Labels, and Fine-Grained Access Control add documentation and governance directly to the model.&lt;/p&gt;
&lt;h2&gt;What to Do Next&lt;/h2&gt;
&lt;p&gt;Pick your five most-queried tables. For each one, answer three questions: What does each column mean? How does this table connect to other tables? Who is allowed to see which rows? If you can&apos;t answer all three confidently, your data model has gaps.&lt;/p&gt;
&lt;p&gt;Filling those gaps means defining clear entities, documenting attributes, and specifying relationships — the core of data modeling.&lt;/p&gt;
&lt;p&gt;&lt;a href=&quot;https://www.dremio.com/get-started?utm_source=ev_buffer&amp;amp;utm_medium=influencer&amp;amp;utm_campaign=next-gen-dremio&amp;amp;utm_term=blog-021826-02-18-2026&amp;amp;utm_content=alexmerced&quot;&gt;Try Dremio Cloud free for 30 days&lt;/a&gt;&lt;/p&gt;
</content:encoded><author>Alex Merced</author></item><item><title>What Is a Semantic Layer? A Complete Guide</title><link>https://iceberglakehouse.com/posts/2026-02-sl-what-is-a-semantic-layer/</link><guid isPermaLink="true">https://iceberglakehouse.com/posts/2026-02-sl-what-is-a-semantic-layer/</guid><description>
![Semantic layer concept — translating raw data into business terms](/assets/images/semantic_layer/01/semantic-layer-concept.png)

Ask three teams in...</description><pubDate>Wed, 18 Feb 2026 09:00:00 GMT</pubDate><content:encoded>&lt;p&gt;&lt;img src=&quot;/assets/images/semantic_layer/01/semantic-layer-concept.png&quot; alt=&quot;Semantic layer concept — translating raw data into business terms&quot;&gt;&lt;/p&gt;
&lt;p&gt;Ask three teams in your company how they calculate &amp;quot;revenue&amp;quot; and you&apos;ll get three answers. Sales counts bookings. Finance counts recognized revenue. Marketing counts pipeline value. All three call it &amp;quot;revenue.&amp;quot; All three get different numbers. Nobody knows which one is right.&lt;/p&gt;
&lt;p&gt;This is the problem a semantic layer solves.&lt;/p&gt;
&lt;h2&gt;What a Semantic Layer Actually Is&lt;/h2&gt;
&lt;p&gt;A semantic layer is a logical abstraction between your raw data and the people (or AI agents) querying it. It maps technical database objects — tables, columns, join paths — to business-friendly terms like &amp;quot;Revenue,&amp;quot; &amp;quot;Active Customer,&amp;quot; or &amp;quot;Churn Rate.&amp;quot;&lt;/p&gt;
&lt;p&gt;It&apos;s not a database. It doesn&apos;t store data. It&apos;s a layer of definitions, calculations, and context that ensures every query against your data produces consistent results, regardless of which tool or person runs it.&lt;/p&gt;
&lt;p&gt;The concept isn&apos;t new. Business Objects introduced &amp;quot;universes&amp;quot; in the 1990s — metadata models that let users drag and drop business concepts instead of writing SQL. What&apos;s changed is scope. Modern semantic layers are universal (not tied to one BI tool), AI-aware (they provide context to language models), and governance-integrated (they enforce access policies alongside definitions).&lt;/p&gt;
&lt;h2&gt;What a Semantic Layer Contains&lt;/h2&gt;
&lt;p&gt;&lt;img src=&quot;/assets/images/semantic_layer/01/sl-components.png&quot; alt=&quot;Five key components of a semantic layer connected to a central hub&quot;&gt;&lt;/p&gt;
&lt;p&gt;A complete semantic layer includes six components:&lt;/p&gt;
&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Component&lt;/th&gt;
&lt;th&gt;What It Does&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Virtual datasets (Views)&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;SQL-defined business logic applied once and reused everywhere&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Metric definitions&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Canonical calculations for KPIs (e.g., MRR = SUM of active subscription revenue)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Documentation&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Human- and machine-readable descriptions of tables, columns, and relationships&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Labels and tags&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Categorization for governance (PII, Finance) and discovery&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Join relationships&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Pre-defined join paths so users don&apos;t need to know foreign keys&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Access policies&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Row-level security and column masking enforced at the layer&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;
&lt;p&gt;The key insight: these components serve both human analysts and AI agents. When an AI generates SQL from a natural language question, it consults this same layer to understand what &amp;quot;revenue&amp;quot; means, which tables to join, and which columns to filter.&lt;/p&gt;
&lt;h2&gt;How It Works in Practice&lt;/h2&gt;
&lt;p&gt;Here&apos;s what happens when someone queries data through a semantic layer:&lt;/p&gt;
&lt;ol&gt;
&lt;li&gt;A user (or AI agent) asks: &amp;quot;What was revenue by region last quarter?&amp;quot;&lt;/li&gt;
&lt;li&gt;The semantic layer translates:
&lt;ul&gt;
&lt;li&gt;&amp;quot;Revenue&amp;quot; → &lt;code&gt;SUM(orders.total) WHERE orders.status = &apos;completed&apos;&lt;/code&gt;&lt;/li&gt;
&lt;li&gt;&amp;quot;Region&amp;quot; → &lt;code&gt;customers.region&lt;/code&gt;&lt;/li&gt;
&lt;li&gt;&amp;quot;Last quarter&amp;quot; → &lt;code&gt;WHERE order_date BETWEEN &apos;2025-10-01&apos; AND &apos;2025-12-31&apos;&lt;/code&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;li&gt;The query engine generates optimized SQL against the underlying data sources&lt;/li&gt;
&lt;li&gt;Results are returned using business terms, not raw column names&lt;/li&gt;
&lt;/ol&gt;
&lt;p&gt;The user never writes SQL. The AI never guesses at column names. The metric definition is applied identically whether the query runs in a dashboard, a Python notebook, or a chat interface.&lt;/p&gt;
&lt;h2&gt;Why It Matters Now More Than Ever&lt;/h2&gt;
&lt;p&gt;Three trends are making semantic layers essential, not optional.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;AI agents need business context.&lt;/strong&gt; Large language models generating SQL will hallucinate column names, use incorrect aggregation logic, and join tables wrong unless they have explicit definitions to work from. A semantic layer provides that grounding. This is why platforms like &lt;a href=&quot;https://www.dremio.com/blog/agentic-analytics-semantic-layer/?utm_source=ev_buffer&amp;amp;utm_medium=influencer&amp;amp;utm_campaign=next-gen-dremio&amp;amp;utm_term=blog-021826-02-18-2026&amp;amp;utm_content=alexmerced&quot;&gt;Dremio embed a semantic layer directly into the query engine&lt;/a&gt; — it&apos;s the context that makes the AI accurate instead of confidently wrong.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Self-service analytics demands accessibility.&lt;/strong&gt; Business users want to query data without filing a ticket. Exposing raw database schemas to non-technical users creates more problems than it solves. A semantic layer presents data in terms people already understand.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Governance requires centralized definitions.&lt;/strong&gt; GDPR, CCPA, and industry regulations require organizations to know what data they have, who can access it, and how it&apos;s used. A semantic layer centralizes these definitions and enforces access policies in one place instead of across dozens of tools.&lt;/p&gt;
&lt;p&gt;&lt;img src=&quot;/assets/images/semantic_layer/01/without-vs-with.png&quot; alt=&quot;Without vs. with a semantic layer — from metric chaos to alignment&quot;&gt;&lt;/p&gt;
&lt;h2&gt;Common Misconceptions&lt;/h2&gt;
&lt;p&gt;&lt;strong&gt;&amp;quot;It&apos;s just a data catalog.&amp;quot;&lt;/strong&gt; A data catalog is an inventory — it tells you what data exists. A semantic layer defines what data &lt;em&gt;means&lt;/em&gt; and how to calculate it. You need both. They&apos;re complementary, not interchangeable. (See: Semantic Layer vs. Data Catalog)&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;&amp;quot;It&apos;s just a BI tool feature.&amp;quot;&lt;/strong&gt; Some BI tools include semantic models (Looker&apos;s LookML, Power BI&apos;s datasets). But these are tool-specific. If your organization uses three BI tools, you maintain three separate semantic models. A universal semantic layer defines metrics once and serves them to every tool.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;&amp;quot;It adds a performance penalty.&amp;quot;&lt;/strong&gt; Modern semantic layers don&apos;t just translate queries — they optimize them. Dremio, for example, uses &lt;a href=&quot;https://www.dremio.com/blog/5-ways-dremio-reflections-outsmart-traditional-materialized-views/?utm_source=ev_buffer&amp;amp;utm_medium=influencer&amp;amp;utm_campaign=next-gen-dremio&amp;amp;utm_term=blog-021826-02-18-2026&amp;amp;utm_content=alexmerced&quot;&gt;Reflections&lt;/a&gt; (pre-computed, physically optimized data copies) to accelerate queries that pass through its semantic layer. The result is often faster than querying raw tables directly.&lt;/p&gt;
&lt;h2&gt;What to Do Next&lt;/h2&gt;
&lt;p&gt;Pick your organization&apos;s five most important metrics. Ask two different teams how each one is calculated. If the answers don&apos;t match, that&apos;s your signal. You don&apos;t have a semantic layer problem — you have a trust problem, and a semantic layer is how you fix it.&lt;/p&gt;
&lt;p&gt;&lt;a href=&quot;https://www.dremio.com/get-started?utm_source=ev_buffer&amp;amp;utm_medium=influencer&amp;amp;utm_campaign=next-gen-dremio&amp;amp;utm_term=blog-021826-02-18-2026&amp;amp;utm_content=alexmerced&quot;&gt;Try Dremio Cloud free for 30 days&lt;/a&gt;&lt;/p&gt;
</content:encoded><author>Alex Merced</author></item><item><title>A 2026 Introduction to Apache Iceberg</title><link>https://iceberglakehouse.com/posts/2026-02-intro-to-Apache-Iceberg/</link><guid isPermaLink="true">https://iceberglakehouse.com/posts/2026-02-intro-to-Apache-Iceberg/</guid><description>
Apache Iceberg is an open-source table format for large analytic datasets. It defines how data files stored on object storage (S3, ADLS, GCS) are org...</description><pubDate>Fri, 13 Feb 2026 09:00:00 GMT</pubDate><content:encoded>&lt;p&gt;Apache Iceberg is an open-source table format for large analytic datasets. It defines how data files stored on object storage (S3, ADLS, GCS) are organized into a logical table with a schema, partition layout, and consistent point-in-time snapshots. If you&apos;ve heard the term &amp;quot;data lakehouse,&amp;quot; Iceberg is the layer that makes it possible by bringing warehouse-grade reliability to data lake storage.&lt;/p&gt;
&lt;p&gt;This post covers what Iceberg is, how its metadata works under the hood, what changed across specification versions 1 through 3, what&apos;s being proposed for v4, and how to get started using Iceberg tables with &lt;a href=&quot;https://www.dremio.com/get-started&quot;&gt;Dremio&lt;/a&gt; in about ten minutes.&lt;/p&gt;
&lt;h2&gt;Where Iceberg Came From&lt;/h2&gt;
&lt;p&gt;Before Iceberg, most data lake tables used the Hive table format. Hive tracks data by directory paths: one directory per partition, with files inside. That works fine for small tables, but it breaks down at scale. Listing files across thousands of partition directories takes minutes. Schema changes require careful coordination. There&apos;s no isolation between readers and writers, so concurrent queries can return inconsistent results.&lt;/p&gt;
&lt;p&gt;Netflix hit all of these problems in production around 2017. Ryan Blue and Dan Weeks designed Iceberg to solve them by tracking individual files instead of directories, using file-level metadata instead of a central metastore, and requiring atomic commits for every change. Netflix open-sourced the project, and it entered the Apache Incubator in 2018. By May 2020, Iceberg graduated to an Apache Top-Level Project. Today it&apos;s the de facto open table format, adopted by AWS, Google, Snowflake, Databricks, Dremio, Cloudera, and dozens of other vendors.&lt;/p&gt;
&lt;h2&gt;How Iceberg&apos;s Metadata Works&lt;/h2&gt;
&lt;p&gt;Iceberg replaces directory listings with a tree of metadata files. Each layer in the tree stores progressively finer details about the table&apos;s contents.&lt;/p&gt;
&lt;p&gt;&lt;img src=&quot;https://dev-to-uploads.s3.amazonaws.com/uploads/articles/nw96gvuaqp4f6gi5leks.png&quot; alt=&quot;Image description&quot;&gt;&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Catalog Pointer:&lt;/strong&gt; The catalog (Polaris, Glue, Nessie, or any REST catalog implementation) stores a single pointer to the current metadata file. This is the entry point.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Metadata File (JSON):&lt;/strong&gt; Contains the current schema, partition specs, sort orders, snapshot list, and table properties. Every write creates a new metadata file and atomically swaps the catalog pointer.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Manifest List (Avro):&lt;/strong&gt; One per snapshot. Lists all manifest files belonging to that snapshot, along with partition-level summary stats. Query engines use these stats to skip entire manifests that can&apos;t match a query&apos;s filter.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Manifest Files (Avro):&lt;/strong&gt; Each manifest tracks a set of data files and stores per-file statistics: file path, partition tuple, record count, and column-level min, max, and null counts. These stats enable file-level pruning during scan planning.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Data Files (Parquet/ORC/Avro):&lt;/strong&gt; The actual rows, stored in columnar format. Iceberg itself is format-agnostic, though Parquet is the most common choice.&lt;/p&gt;
&lt;p&gt;This structure means scan planning is O(1) in metadata lookups rather than O(n) in partition directories. That&apos;s the core architectural advantage.&lt;/p&gt;
&lt;h2&gt;Spec Versions: V1 Through V3 (and V4 Proposals)&lt;/h2&gt;
&lt;p&gt;&lt;img src=&quot;https://dev-to-uploads.s3.amazonaws.com/uploads/articles/drqplx888rtwbfyo9czk.png&quot; alt=&quot;Image description&quot;&gt;&lt;/p&gt;
&lt;h3&gt;Version 1: Analytic Tables (2017–2020)&lt;/h3&gt;
&lt;p&gt;V1 established the fundamentals: immutable data files, snapshot-based tracking, manifest-level file stats, hidden partitioning, and schema evolution via unique column IDs. Operations were limited to appends and full-partition overwrites.&lt;/p&gt;
&lt;h3&gt;Version 2: Row-Level Deletes (~2022)&lt;/h3&gt;
&lt;p&gt;V2 added delete files that encode which rows to remove from existing data files. Position delete files list specific (file, row-number) pairs. Equality delete files specify column values that identify deleted rows. This made UPDATE, DELETE, and MERGE possible without rewriting entire data files. V2 also introduced sequence numbers for ordering concurrent writes and resolving commit conflicts through optimistic concurrency.&lt;/p&gt;
&lt;h3&gt;Version 3: Extended Capabilities (May 2025)&lt;/h3&gt;
&lt;p&gt;V3 brought several major additions:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Deletion Vectors:&lt;/strong&gt; Binary bitmaps stored in Puffin files that replace position deletes. More compact in storage and faster to apply during reads. At most one deletion vector per data file.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Row Lineage:&lt;/strong&gt; Per-snapshot tracking of row-level identity (&lt;code&gt;first-row-id&lt;/code&gt;, &lt;code&gt;added-rows&lt;/code&gt;). This enables efficient change data capture (CDC) pipelines directly on Iceberg tables.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;New Data Types:&lt;/strong&gt; &lt;code&gt;variant&lt;/code&gt; for semi-structured data, &lt;code&gt;geometry&lt;/code&gt; and &lt;code&gt;geography&lt;/code&gt; for geospatial workloads, and nanosecond-precision timestamps (&lt;code&gt;timestamp_ns&lt;/code&gt;, &lt;code&gt;timestamptz_ns&lt;/code&gt;).&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Default Values:&lt;/strong&gt; Columns can specify &lt;code&gt;write-default&lt;/code&gt; and &lt;code&gt;initial-default&lt;/code&gt; values, making schema evolution smoother.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Multi-Argument Transforms:&lt;/strong&gt; Partition and sort transforms can accept multiple input columns.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Table Encryption Keys:&lt;/strong&gt; Built-in support for encrypting data at rest.&lt;/li&gt;
&lt;/ul&gt;
&lt;h3&gt;Version 4: Active Proposals (2025–2026)&lt;/h3&gt;
&lt;p&gt;The community is actively discussing several changes for a future v4 spec:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Single-file commits&lt;/strong&gt; would consolidate all metadata changes into one file per commit, reducing I/O overhead for high-write workloads.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Parquet for metadata&lt;/strong&gt; would replace Avro-encoded metadata files with Parquet, enabling columnar reads of metadata (only load the fields you need).&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Relative path support&lt;/strong&gt; would store file references relative to the table root, simplifying table migration and replication without metadata rewrites.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Improved column statistics&lt;/strong&gt; would add more granular stats for better query planning and change detection.&lt;/li&gt;
&lt;/ul&gt;
&lt;h2&gt;Key Features Worth Knowing&lt;/h2&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;ACID Transactions:&lt;/strong&gt; Every commit is atomic with serializable isolation. Readers never see partial writes.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Schema Evolution:&lt;/strong&gt; Add, drop, rename, or reorder columns safely. Iceberg uses unique field IDs, so renaming a column doesn&apos;t break older data files.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Partition Evolution:&lt;/strong&gt; Change your partitioning strategy without rewriting existing data. Old and new partition layouts coexist. Queries filter on data values, not partition columns.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Hidden Partitioning:&lt;/strong&gt; Users query raw values (&lt;code&gt;WHERE order_date = &apos;2025-06-15&apos;&lt;/code&gt;). Iceberg applies transforms (&lt;code&gt;month&lt;/code&gt;, &lt;code&gt;day&lt;/code&gt;, &lt;code&gt;bucket&lt;/code&gt;, &lt;code&gt;truncate&lt;/code&gt;) automatically. No synthetic partition columns in the schema.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Time Travel:&lt;/strong&gt; Query any previous snapshot by ID or timestamp. Roll back to a known-good state in one command.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Branching and Tagging:&lt;/strong&gt; Named references to specific snapshots, useful for write-audit-publish workflows and staging environments.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Multi-Engine Access:&lt;/strong&gt; The same Iceberg table is readable and writable from Spark, Flink, Trino, Dremio, DuckDB, Snowflake, BigQuery, Presto, and others.&lt;/li&gt;
&lt;/ul&gt;
&lt;h2&gt;The Value of the REST Catalog Spec&lt;/h2&gt;
&lt;p&gt;Iceberg&apos;s REST Catalog Specification defines an HTTP API for table management. Any engine that speaks HTTP can create, list, read, and commit to Iceberg tables without importing a Java SDK. That&apos;s significant because it makes catalog access language-agnostic (Python, Rust, Go, JavaScript) and cloud-agnostic (AWS, GCP, Azure). It also enables server-side features like credential vending (short-lived storage tokens per request), commit deconfliction, and multi-table transactions.&lt;/p&gt;
&lt;p&gt;Several projects implement the REST Catalog spec: &lt;a href=&quot;https://polaris.apache.org/&quot;&gt;Apache Polaris&lt;/a&gt;, Project Nessie, Unity Catalog, AWS Glue (via adapter), and Snowflake Open Catalog. This means you can pick a catalog implementation without locking in your query engines. Every engine points at the same REST endpoint.&lt;/p&gt;
&lt;h2&gt;Getting Started: Apache Iceberg on Dremio&lt;/h2&gt;
&lt;p&gt;You can get hands-on with Iceberg tables right now using Dremio Cloud. Here&apos;s the quick path:&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;1. Sign up at &lt;a href=&quot;https://www.dremio.com/get-started&quot;&gt;dremio.com/get-started&lt;/a&gt;.&lt;/strong&gt; You&apos;ll get a free 30-day trial. Dremio creates a lakehouse project and an Open Catalog (powered by Apache Polaris) automatically at signup.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;2. Create an Iceberg table and insert data:&lt;/strong&gt;&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-sql&quot;&gt;CREATE FOLDER IF NOT EXISTS db;
CREATE FOLDER IF NOT EXISTS db.schema;

CREATE TABLE db.schema.sales (
  order_id INT,
  customer_name VARCHAR,
  product VARCHAR,
  quantity INT,
  order_date DATE,
  total_amount DECIMAL(10,2)
) PARTITION BY (MONTH(order_date));

INSERT INTO db.schema.sales VALUES
  (1, &apos;Alice Chen&apos;, &apos;Widget A&apos;, 10, &apos;2025-01-15&apos;, 150.00),
  (2, &apos;Bob Smith&apos;, &apos;Widget B&apos;, 5, &apos;2025-01-20&apos;, 75.00),
  (3, &apos;Carol Davis&apos;, &apos;Widget A&apos;, 8, &apos;2025-02-10&apos;, 120.00);
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Notice the &lt;code&gt;PARTITION BY (MONTH(order_date))&lt;/code&gt;. That&apos;s hidden partitioning in action. You query &lt;code&gt;order_date&lt;/code&gt; directly; Iceberg handles the partitioning.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;3. Query the metadata tables.&lt;/strong&gt; Dremio exposes Iceberg&apos;s metadata through &lt;code&gt;TABLE()&lt;/code&gt; functions. These let you inspect the internal state of your table without touching the raw metadata files:&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-sql&quot;&gt;-- View all snapshots (who committed what and when)
SELECT * FROM TABLE(table_snapshot(&apos;db.schema.sales&apos;));

-- View commit history
SELECT * FROM TABLE(table_history(&apos;db.schema.sales&apos;));

-- View manifest file details
SELECT * FROM TABLE(table_manifests(&apos;db.schema.sales&apos;));

-- View partition statistics
SELECT * FROM TABLE(table_partitions(&apos;db.schema.sales&apos;));
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;The &lt;code&gt;table_snapshot&lt;/code&gt; query shows each snapshot ID, timestamp, and the operation that created it (append, overwrite, delete). The &lt;code&gt;table_manifests&lt;/code&gt; query reveals how many data files and delete files exist in each manifest. Run these after each INSERT or DELETE to see how Iceberg tracks changes internally.&lt;/p&gt;
&lt;h2&gt;Go Deeper&lt;/h2&gt;
&lt;p&gt;This post covers the essentials, but Iceberg&apos;s spec and ecosystem run deep. If you want the full picture, three books cover the subject end to end:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;a href=&quot;https://hello.dremio.com/wp-apache-iceberg-the-definitive-guide-reg.html&quot;&gt;&lt;strong&gt;Apache Iceberg: The Definitive Guide&lt;/strong&gt;&lt;/a&gt; (O&apos;Reilly) by Tomer Shiran, Jason Hughes, and Alex Merced. Free download from Dremio.&lt;/li&gt;
&lt;li&gt;&lt;a href=&quot;https://hello.dremio.com/wp-apache-polaris-the-definitive-guide-reg.html&quot;&gt;&lt;strong&gt;Apache Polaris: The Definitive Guide&lt;/strong&gt;&lt;/a&gt; (O&apos;Reilly) by Alex Merced, Andrew Madson, and Tomer Shiran. Free download from Dremio.&lt;/li&gt;
&lt;li&gt;&lt;a href=&quot;https://www.manning.com/books/architecting-an-apache-iceberg-lakehouse&quot;&gt;&lt;strong&gt;Architecting an Apache Iceberg Lakehouse&lt;/strong&gt;&lt;/a&gt; (Manning) by Alex Merced. A hands-on guide to designing modular lakehouse architectures with Spark, Flink, Dremio, and Polaris.&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;Between these three resources and a free Dremio Cloud trial, you&apos;ll have everything you need to build on Apache Iceberg in production.&lt;/p&gt;
&lt;p&gt;&lt;a href=&quot;https://developer.dremio.com&quot;&gt;Dremio Developer Community&lt;/a&gt;&lt;/p&gt;
&lt;p&gt;Join the Dremio Developer Community Slack Community to learn more about Apache Iceberg, Data Lakehouses and Agentic Analytics.&lt;/p&gt;
</content:encoded><author>Alex Merced</author></item><item><title>RAG Isn’t a Modeling Problem. It’s a Data Engineering Problem.</title><link>https://iceberglakehouse.com/posts/2026-01-rag-isnt-the-problem/</link><guid isPermaLink="true">https://iceberglakehouse.com/posts/2026-01-rag-isnt-the-problem/</guid><description>**Get Data Lakehouse Books:**
- [Apache Iceberg: The Definitive Guide](https://drmevn.fyi/tableformatblog)
- [Apache Polaris: The Defintive Guide](htt...</description><pubDate>Tue, 20 Jan 2026 09:00:00 GMT</pubDate><content:encoded>&lt;p&gt;&lt;strong&gt;Get Data Lakehouse Books:&lt;/strong&gt;&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;a href=&quot;https://drmevn.fyi/tableformatblog&quot;&gt;Apache Iceberg: The Definitive Guide&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&quot;https://drmevn.fyi/tableformatblog-62P6t&quot;&gt;Apache Polaris: The Defintive Guide&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&quot;https://hubs.la/Q03GfY4f0&quot;&gt;Architecting an Apache Iceberg Lakehouse&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&quot;https://www.puppygraph.com/ebooks/apache-iceberg-digest-vol-1&quot;&gt;The Apache Iceberg Digest: Vol. 1&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;&lt;strong&gt;Lakehouse Community:&lt;/strong&gt;&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;a href=&quot;https://www.datalakehousehub.com&quot;&gt;Join the Data Lakehouse Community&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&quot;https://lakehouseblogs.com&quot;&gt;Data Lakehouse Blog Roll&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&quot;https://osscommunity.com&quot;&gt;OSS Community Listings&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&quot;https://developer.dremio.com&quot;&gt;Dremio Lakehouse Developer Hub&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;hr&gt;
&lt;p&gt;Retrieval-augmented generation looks deceptively simple.&lt;br&gt;
Embed documents.&lt;br&gt;
Store vectors.&lt;br&gt;
Retrieve context.&lt;br&gt;
Ask an LLM to answer questions.&lt;/p&gt;
&lt;p&gt;Early demos reinforce this illusion. A small corpus. Clean documents. Few users. Results look impressive. Many teams conclude that success depends on choosing the right model or the best vector database.&lt;/p&gt;
&lt;p&gt;&lt;img src=&quot;https://i.imgur.com/bchA2fV.png&quot; alt=&quot;Rag is not so easy&quot;&gt;&lt;/p&gt;
&lt;p&gt;That assumption breaks down fast.&lt;/p&gt;
&lt;p&gt;Once RAG systems move into real enterprise environments, progress stalls. Accuracy plateaus. Latency spikes. Answers lose trust. Security teams raise alarms. Engineering teams realize the bottleneck is not the model.&lt;/p&gt;
&lt;p&gt;It is the data.&lt;/p&gt;
&lt;p&gt;&lt;img src=&quot;https://i.imgur.com/5QA6zcM.png&quot; alt=&quot;Bottlenecks&quot;&gt;&lt;/p&gt;
&lt;p&gt;Most organizations do not suffer from a lack of embeddings. They suffer from fragmented data, unclear definitions, inconsistent permissions, and legacy systems never designed for AI access. RAG exposes these weaknesses immediately. It does not hide them.&lt;/p&gt;
&lt;p&gt;This is why RAG is turning into a data engineering problem first, and a modeling problem second.&lt;/p&gt;
&lt;h2&gt;Where RAG Systems Actually Break Down&lt;/h2&gt;
&lt;p&gt;Enterprise data is messy by default. It lives across warehouses, lakes, SaaS tools, document systems, and operational databases. Each source uses different schemas, naming conventions, and access rules. RAG systems must unify all of it before retrieval even begins.&lt;/p&gt;
&lt;p&gt;Data quality issues amplify the problem. Duplicate documents inflate embeddings. Stale records surface outdated answers. Inconsistent metadata makes relevance scoring unreliable. The model retrieves content correctly, but the content itself is wrong.&lt;/p&gt;
&lt;p&gt;Governance is the most underestimated failure point. Many RAG pipelines ignore permissions or apply them too late. This creates two bad outcomes. Either the system leaks sensitive data, or engineers restrict access so aggressively that answers become incomplete. Both outcomes erode trust.&lt;/p&gt;
&lt;p&gt;Semantic ambiguity adds another layer of friction. Business terms rarely mean one thing. “Revenue,” “active customer,” or “churn” vary by team and context. Vector similarity cannot resolve these differences. Without shared definitions, RAG systems retrieve text, not meaning.&lt;/p&gt;
&lt;p&gt;These failures have nothing to do with LLM quality. They stem from weak data foundations.&lt;/p&gt;
&lt;p&gt;&lt;img src=&quot;https://i.imgur.com/QCug4pf.png&quot; alt=&quot;The Real Data Problem&quot;&gt;&lt;/p&gt;
&lt;p&gt;As a result, teams over-engineer retrieval layers while under-investing in context. They tune indexes. Swap vector databases. Adjust chunk sizes. The core issues remain.&lt;/p&gt;
&lt;p&gt;RAG systems succeed when they start with governed, well-defined, and accessible data. When they do not, no amount of modeling innovation compensates for the gap.&lt;/p&gt;
&lt;h2&gt;Are Vector Databases Over-Engineered for Most Teams?&lt;/h2&gt;
&lt;p&gt;Vector databases became the default RAG component for a simple reason. They solved a real problem early. Fast similarity search over high-dimensional embeddings was hard to do well. Purpose-built systems filled that gap.&lt;/p&gt;
&lt;p&gt;The problem is that the industry quickly treated them as mandatory infrastructure.&lt;/p&gt;
&lt;p&gt;For many enterprise use cases, that assumption does not hold. Most RAG workloads do not start at billion-scale embeddings. They start with thousands or tens of thousands of documents. At that scale, established systems like Postgres with pgvector or search engines with vector support perform well enough.&lt;/p&gt;
&lt;p&gt;These platforms already exist in most organizations. They are governed. They are monitored. They are understood by operations teams. Adding vector search to them is often cheaper and faster than introducing a new system.&lt;/p&gt;
&lt;p&gt;Specialized vector databases still have a role. At large scale, with strict latency requirements and high concurrency, optimized ANN indexes and distributed architectures matter. The tipping point is real. It just arrives later than vendors suggest.&lt;/p&gt;
&lt;p&gt;The mistake is not using vector databases. The mistake is leading with them.&lt;/p&gt;
&lt;p&gt;&lt;img src=&quot;https://i.imgur.com/7y5H7hZ.png&quot; alt=&quot;The mistake is not using vector databases. The mistake is leading with them.&quot;&gt;&lt;/p&gt;
&lt;p&gt;When teams optimize the vector layer first, they ignore higher-impact problems. Data duplication. Permission enforcement. Metadata consistency. Hybrid retrieval logic. These issues dominate cost and complexity long before vector search performance does.&lt;/p&gt;
&lt;h2&gt;Hybrid Search Is the Norm, Not the Exception&lt;/h2&gt;
&lt;p&gt;Vector search alone is rarely sufficient. Keyword search alone is rarely sufficient. Production RAG systems need both.&lt;/p&gt;
&lt;p&gt;Keywords provide precision. Vectors provide semantic recall. Together, they outperform either approach in isolation. This pattern shows up consistently across enterprise deployments.&lt;/p&gt;
&lt;p&gt;Despite advances in embedding models, keyword search is not becoming obsolete. Embeddings still struggle with exact matches, rare identifiers, and domain-specific language. They also struggle when the query intent is narrow and literal.&lt;/p&gt;
&lt;p&gt;As a result, teams maintain two indexes. One lexical. One vector. They fuse results during retrieval or re-ranking. This adds operational cost, but it improves answer quality.&lt;/p&gt;
&lt;p&gt;&lt;img src=&quot;https://i.imgur.com/KpDU6Wu.png&quot; alt=&quot;Hybrid retrieval should be an assumption, not an optimization.&quot;&gt;&lt;/p&gt;
&lt;p&gt;Some hope that better models will eliminate this complexity. That is unlikely in the near term. Language is both semantic and symbolic. Search systems must reflect that reality.&lt;/p&gt;
&lt;p&gt;The practical takeaway is simple. Hybrid retrieval should be an assumption, not an optimization. Architectures that treat vector search as a drop-in replacement for text search fail under real workloads.&lt;/p&gt;
&lt;h2&gt;Latency Changes Every Design Decision&lt;/h2&gt;
&lt;p&gt;Real-time RAG systems operate under tight latency budgets. Users expect responses in seconds, not tens of seconds. Retrieval time competes directly with model inference time.&lt;/p&gt;
&lt;p&gt;To stay within budget, teams make trade-offs. They cache results. Use approximate search. Reduce embedding size. Retrieve fewer documents. Choose smaller or faster models.&lt;/p&gt;
&lt;p&gt;Each choice sacrifices something. Recall. Freshness. Completeness.&lt;/p&gt;
&lt;p&gt;&lt;img src=&quot;https://dev-to-uploads.s3.amazonaws.com/uploads/articles/jw9h3okkfm764ipq4kwy.png&quot; alt=&quot;Image description&quot;&gt;&lt;/p&gt;
&lt;p&gt;The best systems compensate by pushing intelligence closer to the data. Precomputed results. Materialized views. Semantic caching. These techniques reduce work at query time and stabilize performance.&lt;/p&gt;
&lt;p&gt;Once again, the bottleneck is not the model. It is the architecture around the data.&lt;/p&gt;
&lt;h2&gt;The Missing Layer: Semantic Context&lt;/h2&gt;
&lt;p&gt;Most RAG architectures treat embeddings as context. That is a mistake.&lt;/p&gt;
&lt;p&gt;Embeddings capture similarity, not meaning. They do not encode business logic, metric definitions, or governance rules. They do not understand which tables represent the same concept, or which fields are authoritative.&lt;/p&gt;
&lt;p&gt;&lt;img src=&quot;https://dev-to-uploads.s3.amazonaws.com/uploads/articles/4ax0vnqagp3ys56umjkj.png&quot; alt=&quot;Image description&quot;&gt;&lt;/p&gt;
&lt;p&gt;This is where many systems quietly fail. AI agents retrieve text fragments without understanding how those fragments relate. Answers may be syntactically correct but semantically wrong.&lt;/p&gt;
&lt;p&gt;&lt;img src=&quot;https://dev-to-uploads.s3.amazonaws.com/uploads/articles/7g3ewdoxn51kz0hcadce.png&quot; alt=&quot;Image description&quot;&gt;&lt;/p&gt;
&lt;p&gt;A semantic layer changes this dynamic. It provides shared definitions, governed access, and a consistent abstraction over raw data. Instead of retrieving arbitrary documents, AI agents retrieve &lt;em&gt;meaningful concepts&lt;/em&gt;.&lt;/p&gt;
&lt;p&gt;&lt;img src=&quot;https://dev-to-uploads.s3.amazonaws.com/uploads/articles/dlu6sjseno7sy7hxpqhi.png&quot; alt=&quot;Image description&quot;&gt;&lt;/p&gt;
&lt;p&gt;This reduces ambiguity. It improves trust. It lowers the cognitive load on both users and models.&lt;/p&gt;
&lt;p&gt;More importantly, it shifts RAG from document search to reasoning over data.&lt;/p&gt;
&lt;h2&gt;From RAG Pipelines to Agentic Architectures&lt;/h2&gt;
&lt;p&gt;As systems evolve, retrieval alone is not enough. AI agents need to ask follow-up questions, call tools, execute queries, and reason across steps.&lt;/p&gt;
&lt;p&gt;This requires structured access to data, not just text chunks. It also requires standard interfaces so agents can operate across clients and environments.&lt;/p&gt;
&lt;p&gt;Open protocols like MCP reflect this shift. They decouple AI agents from specific tools and allow shared context to be reused across applications. This moves RAG closer to a platform capability than a one-off pipeline.&lt;/p&gt;
&lt;p&gt;In this world, the value is not in where vectors live. The value is in how context is defined, governed, and exposed.&lt;/p&gt;
&lt;h2&gt;Conclusion: Stop Optimizing the Wrong Layer&lt;/h2&gt;
&lt;p&gt;RAG failures rarely come from weak models. They come from weak data foundations.&lt;/p&gt;
&lt;p&gt;&lt;img src=&quot;https://dev-to-uploads.s3.amazonaws.com/uploads/articles/hz8ll1791x9z8q6eguw5.png&quot; alt=&quot;Image description&quot;&gt;&lt;/p&gt;
&lt;p&gt;Enterprises over-invest in vector infrastructure while under-investing in semantics, governance, and architectural coherence. The result is expensive systems that scale poorly and fail to earn trust.&lt;/p&gt;
&lt;p&gt;The most resilient approaches treat RAG as a data platform problem. They start with open storage, shared definitions, hybrid retrieval, and performance optimizations that benefit every workload.&lt;/p&gt;
&lt;p&gt;&lt;img src=&quot;https://dev-to-uploads.s3.amazonaws.com/uploads/articles/xjcm3d7ov49ltujhaj9o.png&quot; alt=&quot;Image description&quot;&gt;&lt;/p&gt;
&lt;p&gt;This is where lakehouse-native architectures stand out. Platforms like Dremio focus on unifying data access, enforcing semantics, and accelerating queries across sources without duplication. When AI agents are layered on top of that foundation, retrieval becomes simpler, safer, and faster by default.&lt;/p&gt;
&lt;p&gt;&lt;img src=&quot;https://dev-to-uploads.s3.amazonaws.com/uploads/articles/ojzh2yl2i5ttb4rgdanp.png&quot; alt=&quot;Image description&quot;&gt;&lt;/p&gt;
&lt;p&gt;As models continue to improve, data problems will remain. Teams that solve for context, not just embeddings, will be the ones that scale AI beyond demos and into durable systems.&lt;/p&gt;
</content:encoded><author>Alex Merced</author></item><item><title>A Practical Guide to AI-Assisted Coding Tools</title><link>https://iceberglakehouse.com/posts/2026-01-a-practical-guide-to-ai-assisted-coding-tools/</link><guid isPermaLink="true">https://iceberglakehouse.com/posts/2026-01-a-practical-guide-to-ai-assisted-coding-tools/</guid><description>
**Get Data Lakehouse Books:**
- [Apache Iceberg: The Definitive Guide](https://drmevn.fyi/tableformatblog)
- [Apache Polaris: The Definitive Guide](h...</description><pubDate>Thu, 15 Jan 2026 09:00:00 GMT</pubDate><content:encoded>&lt;p&gt;&lt;strong&gt;Get Data Lakehouse Books:&lt;/strong&gt;&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;a href=&quot;https://drmevn.fyi/tableformatblog&quot;&gt;Apache Iceberg: The Definitive Guide&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&quot;https://drmevn.fyi/tableformatblog-62P6t&quot;&gt;Apache Polaris: The Definitive Guide&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&quot;https://hubs.la/Q03GfY4f0&quot;&gt;Architecting an Apache Iceberg Lakehouse&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&quot;https://www.puppygraph.com/ebooks/apache-iceberg-digest-vol-1&quot;&gt;The Apache Iceberg Digest: Vol. 1&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;&lt;strong&gt;Lakehouse Community:&lt;/strong&gt;&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;a href=&quot;https://www.datalakehousehub.com&quot;&gt;Join the Data Lakehouse Community&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&quot;https://lakehouseblogs.com&quot;&gt;Data Lakehouse Blog Roll&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&quot;https://osscommunity.com&quot;&gt;OSS Community Listings&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&quot;https://developer.dremio.com&quot;&gt;Dremio Lakehouse Developer Hub&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;hr&gt;
&lt;p&gt;AI-assisted coding is no longer a novelty. It is becoming a core part of how software gets built.&lt;/p&gt;
&lt;p&gt;For years, these tools were easy to describe. They were autocomplete engines. They helped you write boilerplate faster and saved a few keystrokes. Useful, but limited.&lt;/p&gt;
&lt;p&gt;That changed quickly.&lt;/p&gt;
&lt;p&gt;Over the last two years, large language models gained larger context windows, stronger reasoning, and the ability to use tools. At the same time, AI assistants moved closer to the developer workflow. They gained access to repositories, terminals, build systems, tests, and browsers. What emerged was not just better autocomplete, but something closer to a collaborator.&lt;/p&gt;
&lt;p&gt;Today, “AI coding tools” covers a wide range of products. Some live in the terminal and act as autonomous agents. Others are AI-native editors built around chat and planning. Many integrate directly into existing IDEs and quietly assist as you type. Each category solves different problems and comes with different tradeoffs.&lt;/p&gt;
&lt;p&gt;&lt;img src=&quot;https://dev-to-uploads.s3.amazonaws.com/uploads/articles/phtd3xiud0aue8np37bz.png&quot; alt=&quot;This creates confusion for developers trying to make sense of the space.&quot;&gt;&lt;/p&gt;
&lt;p&gt;This creates confusion for developers trying to make sense of the space. Should you use a CLI agent or an IDE plugin? When does an AI-first editor make sense? How much autonomy is helpful before it becomes risky? And how do pricing, privacy, and workflow fit into the decision?&lt;/p&gt;
&lt;p&gt;This blog is a practical guide to that landscape. We will categorize the major types of AI-assisted coding tools, compare how they work, and explain when each approach makes sense. The goal is not to crown a single “best” tool, but to give you a clear mental model for choosing the right one for your work.&lt;/p&gt;
&lt;h2&gt;The Core Taxonomy of AI Coding Tools&lt;/h2&gt;
&lt;p&gt;Before comparing individual products, it helps to understand how these tools differ at a structural level. Most confusion in this space comes from treating all AI coding tools as the same thing. They are not.&lt;/p&gt;
&lt;p&gt;There are three dimensions that matter most: how you interact with the tool, where it runs, and how much autonomy it has.&lt;/p&gt;
&lt;h3&gt;Interaction Model&lt;/h3&gt;
&lt;p&gt;Some tools are designed to assist while you type. These focus on inline suggestions and small edits. You stay in control at all times, and the AI reacts to your actions.&lt;/p&gt;
&lt;p&gt;Others are chat-driven. You describe what you want in natural language, and the tool responds with explanations, code snippets, or suggested changes. These are useful for learning, debugging, and reasoning about unfamiliar code.&lt;/p&gt;
&lt;p&gt;The newest category is agent-based. These tools accept a goal, break it into steps, and execute those steps across files and tools. They plan, act, and revise, often with minimal input once started.&lt;/p&gt;
&lt;h3&gt;Execution Surface&lt;/h3&gt;
&lt;p&gt;Where a tool lives shapes how powerful it can be.&lt;/p&gt;
&lt;p&gt;Terminal-based tools operate directly on your filesystem and development tools. They can run tests, modify many files, and integrate naturally with scripting and automation workflows.&lt;/p&gt;
&lt;p&gt;IDE-native editors are built around AI as a first-class concept. They blend editing, chat, execution, and preview into a single environment designed for iterative work with an assistant.&lt;/p&gt;
&lt;p&gt;IDE plugins integrate into existing editors. They trade raw power for familiarity and low friction. You get help without changing how you work.&lt;/p&gt;
&lt;p&gt;Browser-based tools prioritize accessibility and collaboration but are usually more constrained in what they can access or modify.&lt;/p&gt;
&lt;h3&gt;Autonomy Spectrum&lt;/h3&gt;
&lt;p&gt;Not all AI tools act independently.&lt;/p&gt;
&lt;p&gt;Some only suggest. You decide what to accept.&lt;/p&gt;
&lt;p&gt;Some perform tasks but wait for confirmation before each step.&lt;/p&gt;
&lt;p&gt;Others operate with high autonomy. They plan multi-step changes, run commands, and verify results before handing control back to you.&lt;/p&gt;
&lt;p&gt;More autonomy can mean more leverage. It also means more responsibility. Understanding where a tool sits on this spectrum is critical for using it safely and effectively.&lt;/p&gt;
&lt;p&gt;With these dimensions in mind, the rest of the landscape becomes much easier to navigate. Each tool is a different point in this design space, optimized for different types of work and different levels of trust.&lt;/p&gt;
&lt;p&gt;&lt;img src=&quot;https://dev-to-uploads.s3.amazonaws.com/uploads/articles/03qpl1zwmcggbe62ikac.png&quot; alt=&quot;Each tool is a different point in this design space, optimized for different types of work and different levels of trust.&quot;&gt;&lt;/p&gt;
&lt;h2&gt;Terminal-Based AI Coding Agents&lt;/h2&gt;
&lt;p&gt;Terminal-based AI coding agents are the most powerful and, at times, the most intimidating tools in this space. They live where your code actually runs. That gives them capabilities that IDE plugins cannot match.&lt;/p&gt;
&lt;p&gt;Instead of suggesting code, these tools operate directly on your project. They can read files, modify directories, run tests, execute build commands, and interact with version control. In practice, this means they behave less like autocomplete and more like junior engineers following instructions.&lt;/p&gt;
&lt;h3&gt;Why Terminal Agents Exist&lt;/h3&gt;
&lt;p&gt;The terminal is already the control plane for software development. It is where builds run, tests fail, migrations execute, and deployments start. By placing AI here, these tools gain first-class access to the real workflow rather than a simulated one.&lt;/p&gt;
&lt;p&gt;This makes them well-suited for tasks that span many files or steps. Examples include refactoring large codebases, fixing failing test suites, scaffolding new services, or migrating configurations. These are jobs that are slow and error-prone when done manually.&lt;/p&gt;
&lt;h3&gt;Representative Tools&lt;/h3&gt;
&lt;p&gt;Tools in this category include Claude Code, Gemini CLI, OpenCode, and Qodo CLI. While they differ in implementation, they share common traits.&lt;/p&gt;
&lt;p&gt;They accept high-level goals instead of line-level instructions. They reason about the repository as a whole. They can chain actions together without repeated prompting. Many of them support approval checkpoints so you can review actions before execution.&lt;/p&gt;
&lt;p&gt;Some focus on being general-purpose agents. Others emphasize customization, allowing teams to define their own agents for reviews, testing, or compliance checks.&lt;/p&gt;
&lt;h3&gt;Strengths and Tradeoffs&lt;/h3&gt;
&lt;p&gt;The strength of terminal agents is leverage. A single prompt can replace dozens of manual steps. They are especially effective for backend, infrastructure, and data engineering work, where tasks are procedural and tool-driven.&lt;/p&gt;
&lt;p&gt;The tradeoff is risk. These tools can change many files quickly. They can run commands that alter state. Used carelessly, they can introduce subtle bugs or destructive changes.&lt;/p&gt;
&lt;p&gt;Best practice is to treat terminal agents as powerful automation tools. Keep them scoped. Review diffs. Use version control aggressively. Start with low autonomy and increase it only when trust is earned.&lt;/p&gt;
&lt;p&gt;Terminal-based agents are not for every developer or every task. But when used well, they represent one of the biggest productivity jumps in modern software development.&lt;/p&gt;
&lt;p&gt;&lt;img src=&quot;https://dev-to-uploads.s3.amazonaws.com/uploads/articles/y678648xd3lxhzb5kq0v.png&quot; alt=&quot;Terminal-based agents are not for every developer or every task.&quot;&gt;&lt;/p&gt;
&lt;h2&gt;AI-Native IDEs and Editors&lt;/h2&gt;
&lt;p&gt;AI-native IDEs are built around the assumption that an assistant is always present. Instead of adding AI as a feature, these tools redesign the editor itself to make planning, execution, and iteration flow through the model.&lt;/p&gt;
&lt;p&gt;This changes how development feels. You do not switch between typing code and asking for help. The conversation and the code evolve together.&lt;/p&gt;
&lt;h3&gt;What Makes an IDE AI-Native&lt;/h3&gt;
&lt;p&gt;In an AI-native IDE, the assistant has persistent awareness of the project. It understands file structure, dependencies, and recent changes without being reminded each time.&lt;/p&gt;
&lt;p&gt;These editors usually combine several capabilities in one place. You can ask the assistant to design a feature, generate code across files, run the application, and inspect the results. Some can open a browser, preview a UI, or analyze logs as part of the same workflow.&lt;/p&gt;
&lt;p&gt;Another defining trait is planning. The assistant often explains what it is going to do before doing it. This makes complex changes easier to reason about and review.&lt;/p&gt;
&lt;h3&gt;Representative Tools&lt;/h3&gt;
&lt;p&gt;Examples in this category include Cursor, Windsurf, Antigravity, and Zed.&lt;/p&gt;
&lt;p&gt;Cursor extends the familiar VS Code experience with deep repository understanding and large-scale refactoring capabilities. Windsurf emphasizes agent-driven workflows that keep developers in flow. Antigravity pushes further into full agent autonomy, allowing models to plan, build, and verify changes using integrated tools. Zed focuses on speed, collaboration, and predictive editing, blending performance with AI assistance.&lt;/p&gt;
&lt;p&gt;While their design philosophies differ, all of them treat AI as a core part of the editing experience rather than an add-on.&lt;/p&gt;
&lt;h3&gt;When an AI-Native IDE Makes Sense&lt;/h3&gt;
&lt;p&gt;These tools shine when you are building features end to end. They work well for rapid prototyping, greenfield projects, and iterative product development.&lt;/p&gt;
&lt;p&gt;They are also a good fit for solo developers or small teams, where context switching is expensive and speed matters more than strict process. For some developers, they can replace multiple tools with a single environment.&lt;/p&gt;
&lt;p&gt;The downside is commitment. Adopting an AI-native IDE often means changing editors or workflows. For teams with established tooling or strict policies, that may be a barrier.&lt;/p&gt;
&lt;p&gt;When the fit is right, though, AI-native IDEs offer a glimpse of what development looks like when the assistant is not a helper, but a constant collaborator.&lt;/p&gt;
&lt;p&gt;&lt;img src=&quot;https://dev-to-uploads.s3.amazonaws.com/uploads/articles/w9d3kizpil6o489pm73q.png&quot; alt=&quot;When the fit is right, though, AI-native IDEs offer a glimpse of what development looks like when the assistant is not a helper, but a constant collaborator.&quot;&gt;&lt;/p&gt;
&lt;h2&gt;AI Assistants Embedded in Traditional IDEs&lt;/h2&gt;
&lt;p&gt;Not every developer wants to change editors or rethink their workflow. For many teams, the most practical entry point into AI-assisted coding is through tools that integrate directly into existing IDEs.&lt;/p&gt;
&lt;p&gt;These assistants focus on augmentation rather than replacement. They enhance familiar environments with AI capabilities while preserving established habits, shortcuts, and extensions.&lt;/p&gt;
&lt;h3&gt;The Copilot Model&lt;/h3&gt;
&lt;p&gt;This category is defined by inline assistance. The AI observes the code you are writing and offers suggestions in real time. You remain in control, accepting or rejecting changes as you go.&lt;/p&gt;
&lt;p&gt;Most tools in this group also include a chat interface. This allows you to ask questions about your code, request explanations, generate tests, or debug errors without leaving the editor. The interaction is conversational, but the execution remains manual.&lt;/p&gt;
&lt;p&gt;The emphasis is on incremental gains. These tools aim to make each coding session smoother rather than automate entire tasks.&lt;/p&gt;
&lt;h3&gt;Representative Tools&lt;/h3&gt;
&lt;p&gt;GitHub Copilot is the most well-known example. Others include Amazon CodeWhisperer and Amazon Q Developer, JetBrains AI Assistant, Tabnine, and Replit Ghostwriter.&lt;/p&gt;
&lt;p&gt;These tools support a wide range of IDEs such as VS Code, JetBrains products, and browser-based environments. They tend to work across many programming languages and frameworks, making them broadly applicable.&lt;/p&gt;
&lt;p&gt;Some lean toward individual productivity. Others emphasize enterprise features like policy enforcement, auditability, and security scanning.&lt;/p&gt;
&lt;h3&gt;Strengths and Limitations&lt;/h3&gt;
&lt;p&gt;The biggest strength of IDE-embedded assistants is low friction. Developers can adopt them with minimal change and see immediate benefits. They are well suited for day-to-day coding, learning new APIs, and reducing repetitive work.&lt;/p&gt;
&lt;p&gt;Their limitation is scope. They usually do not plan or execute multi-step changes on their own. They lack direct access to the terminal and external tools, which limits their autonomy.&lt;/p&gt;
&lt;p&gt;For many teams, this is a feature, not a flaw. Embedded assistants provide a safe, predictable way to bring AI into the development process without surrendering control.&lt;/p&gt;
&lt;p&gt;They are often the right choice when consistency, governance, and gradual adoption matter more than maximum automation.&lt;/p&gt;
&lt;p&gt;&lt;img src=&quot;https://dev-to-uploads.s3.amazonaws.com/uploads/articles/kzndqmqrsziyr76gbn78.png&quot; alt=&quot;Embedded assistants provide a safe, predictable way to bring AI into the development process without surrendering control.&quot;&gt;&lt;/p&gt;
&lt;p&gt;&lt;img src=&quot;https://dev-to-uploads.s3.amazonaws.com/uploads/articles/a0176p2xrqupus1utebh.png&quot; alt=&quot;Comparison of Approaches&quot;&gt;&lt;/p&gt;
&lt;h2&gt;Pricing Models and Economic Tradeoffs&lt;/h2&gt;
&lt;p&gt;AI-assisted coding tools vary widely in their pricing. Understanding these models is important, because cost often scales with autonomy, context size, and usage intensity.&lt;/p&gt;
&lt;p&gt;What looks inexpensive at first can become costly at scale. What looks expensive may replace significant engineering time.&lt;/p&gt;
&lt;h3&gt;Common Pricing Patterns&lt;/h3&gt;
&lt;p&gt;One common approach is free or freemium access for individuals. These tiers usually offer limited usage, smaller context windows, or restricted agent capabilities. They are designed to encourage experimentation and personal use.&lt;/p&gt;
&lt;p&gt;Another model is flat monthly subscriptions per developer. This is common for IDE plugins and AI-native editors. In exchange for a predictable cost, you get higher usage limits, access to stronger models, and better performance.&lt;/p&gt;
&lt;p&gt;Agentic tools often introduce credit-based pricing. Each task or action consumes credits based on model usage, context size, and tool execution. This aligns cost with work performed but requires more monitoring.&lt;/p&gt;
&lt;p&gt;Enterprise plans layer on governance features. These include audit logs, centralized billing, access controls, and private deployments. Pricing here reflects not just usage, but risk reduction and compliance.&lt;/p&gt;
&lt;h3&gt;Cost vs Capability Tradeoffs&lt;/h3&gt;
&lt;p&gt;More powerful tools cost more because they do more. Large context windows, multi-file reasoning, and autonomous execution all increase compute usage.&lt;/p&gt;
&lt;p&gt;Autocomplete-focused tools are usually the cheapest. Agent-based systems are the most expensive, especially when used heavily.&lt;/p&gt;
&lt;p&gt;Another factor is model flexibility. Tools that allow you to bring your own API keys shift costs directly to the underlying model provider. This can be cheaper or more expensive depending on how you use them.&lt;/p&gt;
&lt;p&gt;The right question is not “which tool is cheapest,” but “which tool replaces the most manual effort for my work.”&lt;/p&gt;
&lt;h3&gt;Individual vs Team Economics&lt;/h3&gt;
&lt;p&gt;For individuals, free tiers and modest subscriptions often deliver outsized value. Even small time savings justify the cost.&lt;/p&gt;
&lt;p&gt;For teams, the equation changes. A tool that saves minutes per developer per day may justify its cost. One that automates entire workflows may justify much more, but only if guardrails are in place.&lt;/p&gt;
&lt;p&gt;Understanding pricing early helps avoid mismatches between expectations, usage, and budget. AI tools are productivity multipliers, but only when their costs align with how they are used.&lt;/p&gt;
&lt;p&gt;&lt;img src=&quot;https://dev-to-uploads.s3.amazonaws.com/uploads/articles/g3yd90xw964xkly6ybd1.png&quot; alt=&quot;Understanding pricing early helps avoid mismatches between expectations, usage, and budget.&quot;&gt;&lt;/p&gt;
&lt;h2&gt;Workflow Patterns Enabled by AI Coding Tools&lt;/h2&gt;
&lt;p&gt;The real impact of AI-assisted coding is not in individual features, but in how workflows change. Once these tools are part of daily work, the structure of development itself begins to shift.&lt;/p&gt;
&lt;p&gt;Instead of writing everything by hand, developers increasingly describe intent, review outcomes, and refine results.&lt;/p&gt;
&lt;h3&gt;Common AI-Driven Workflows&lt;/h3&gt;
&lt;p&gt;One of the most common patterns is assisted implementation. Developers sketch function signatures or write descriptive comments, then let the AI fill in the logic. This is especially effective for boilerplate, data transformations, and repetitive patterns.&lt;/p&gt;
&lt;p&gt;Debugging is another strong use case. AI tools can explain error messages, trace logic across files, and suggest fixes based on context. This reduces time spent searching documentation or past issues.&lt;/p&gt;
&lt;p&gt;Test and documentation generation have also become routine. Many teams now generate unit tests, integration tests, and API docs as part of normal development, not as an afterthought.&lt;/p&gt;
&lt;h3&gt;Agentic Workflows&lt;/h3&gt;
&lt;p&gt;Agentic tools enable workflows that were previously impractical.&lt;/p&gt;
&lt;p&gt;A single prompt can scaffold a new service, refactor an entire module, or migrate configurations across environments. The agent plans the steps, applies changes, and verifies results before returning control.&lt;/p&gt;
&lt;p&gt;These workflows work best when tasks are well-scoped and repeatable. Infrastructure changes, dependency upgrades, and large-scale refactors are strong candidates.&lt;/p&gt;
&lt;p&gt;The key is oversight. Developers define the goal and constraints, then review the agent’s output carefully. Agentic workflows reward clarity and discipline.&lt;/p&gt;
&lt;h3&gt;Shifting the Role of the Developer&lt;/h3&gt;
&lt;p&gt;As AI takes on more mechanical work, the developer’s role shifts toward design, review, and decision-making.&lt;/p&gt;
&lt;p&gt;Time moves away from syntax and toward intent. Understanding systems and tradeoffs becomes more valuable than memorizing APIs.&lt;/p&gt;
&lt;p&gt;Teams that adapt their workflows intentionally see the biggest gains. Those that treat AI as a novelty often see uneven results.&lt;/p&gt;
&lt;p&gt;AI does not remove the need for good engineering practices. It amplifies them.&lt;/p&gt;
&lt;p&gt;&lt;img src=&quot;https://dev-to-uploads.s3.amazonaws.com/uploads/articles/7an09rcgy0wv0fjt6n9q.png&quot; alt=&quot;AI does not remove the need for good engineering practices. It amplifies them.&quot;&gt;&lt;/p&gt;
&lt;h2&gt;Skills Developers Need in the AI Coding Era&lt;/h2&gt;
&lt;p&gt;AI-assisted coding changes what it means to be effective as a developer. The most valuable skills are shifting away from speed of typing and toward clarity of thinking.&lt;/p&gt;
&lt;p&gt;Using these tools well is not about tricks. It is about communication, judgment, and system-level understanding.&lt;/p&gt;
&lt;h3&gt;Prompting as Specification&lt;/h3&gt;
&lt;p&gt;Prompting is best understood as writing specifications in natural language.&lt;/p&gt;
&lt;p&gt;Clear prompts describe intent, constraints, and context. Vague prompts produce vague results. The best outcomes come from treating the AI like a teammate who needs good requirements.&lt;/p&gt;
&lt;p&gt;Effective developers iterate. They refine prompts based on output, correct assumptions, and narrow scope. This feedback loop is fast, but it still requires attention.&lt;/p&gt;
&lt;h3&gt;Review and Verification&lt;/h3&gt;
&lt;p&gt;AI-generated code must be reviewed like any other contribution.&lt;/p&gt;
&lt;p&gt;Developers need to read diffs carefully, understand the logic, and verify behavior with tests. Blind trust leads to subtle bugs and security issues.&lt;/p&gt;
&lt;p&gt;Knowing how to ask the AI to explain its choices is a useful verification technique. If the explanation does not make sense, the code likely does not either.&lt;/p&gt;
&lt;h3&gt;System Thinking and Constraints&lt;/h3&gt;
&lt;p&gt;AI tools are strongest when they understand the system they are working in.&lt;/p&gt;
&lt;p&gt;Developers who can explain architecture, performance constraints, and operational requirements get better results. This includes knowing what not to automate.&lt;/p&gt;
&lt;p&gt;The more autonomy a tool has, the more important boundaries become. Skilled developers define those boundaries clearly.&lt;/p&gt;
&lt;p&gt;In the AI coding era, judgment matters more than ever. The tools move fast. It is the developer’s responsibility to steer them well.&lt;/p&gt;
&lt;p&gt;&lt;img src=&quot;https://dev-to-uploads.s3.amazonaws.com/uploads/articles/7njfidnjgz1mht7dym4q.png&quot; alt=&quot;The more autonomy a tool has, the more important boundaries become. Skilled developers define those boundaries clearly.&quot;&gt;&lt;/p&gt;
&lt;h2&gt;Security, Privacy, and Governance Considerations&lt;/h2&gt;
&lt;p&gt;As AI coding tools gain access to repositories, terminals, and infrastructure, security and governance move from secondary concerns to first-order design questions.&lt;/p&gt;
&lt;p&gt;The risks are not hypothetical. These tools can read proprietary code, modify critical systems, and generate output that looks correct but is not.&lt;/p&gt;
&lt;h3&gt;Code and Data Exposure&lt;/h3&gt;
&lt;p&gt;Most AI tools rely on remote models. This means code or prompts may leave your local environment.&lt;/p&gt;
&lt;p&gt;Developers and teams must understand what data is sent, how long it is retained, and whether it is used for training. Some tools explicitly guarantee no training on customer code. Others allow opt-outs or require enterprise agreements.&lt;/p&gt;
&lt;p&gt;For sensitive environments, tools that support local models or on-prem deployment reduce exposure. This often comes at the cost of convenience or model quality.&lt;/p&gt;
&lt;h3&gt;Autonomy and Guardrails&lt;/h3&gt;
&lt;p&gt;Agentic tools increase risk by design. They can execute commands, modify configurations, and affect production systems.&lt;/p&gt;
&lt;p&gt;Guardrails are essential. These include confirmation prompts, restricted permissions, read-only modes, and sandboxed environments. Version control is a non-negotiable safety net.&lt;/p&gt;
&lt;p&gt;The goal is not to eliminate autonomy, but to scope it carefully.&lt;/p&gt;
&lt;h3&gt;Organizational Governance&lt;/h3&gt;
&lt;p&gt;For teams, governance features matter as much as raw capability.&lt;/p&gt;
&lt;p&gt;Audit logs, access controls, usage monitoring, and policy enforcement help organizations understand how AI tools are being used. They also help prevent accidental misuse.&lt;/p&gt;
&lt;p&gt;Clear guidelines reduce risk. Teams should define which tools are allowed, what data they can access, and what level of autonomy is acceptable.&lt;/p&gt;
&lt;p&gt;AI-assisted coding can be safe and effective. It requires intentional design, not blind adoption.&lt;/p&gt;
&lt;p&gt;&lt;img src=&quot;https://dev-to-uploads.s3.amazonaws.com/uploads/articles/5n7d6ntsqu6ybadaqjda.png&quot; alt=&quot;AI-assisted coding can be safe and effective. It requires intentional design, not blind adoption.&quot;&gt;&lt;/p&gt;
&lt;h2&gt;How to Choose the Right Tool for You&lt;/h2&gt;
&lt;p&gt;With so many options, choosing an AI coding tool can feel overwhelming. The key is to match the tool to your role, environment, and tolerance for change.&lt;/p&gt;
&lt;p&gt;There is no universal best choice. There is only what fits your work.&lt;/p&gt;
&lt;h3&gt;By Role&lt;/h3&gt;
&lt;p&gt;Solo developers often benefit from AI-native IDEs or terminal agents. These tools reduce context switching and accelerate end-to-end work. They are well suited for prototyping, side projects, and greenfield development.&lt;/p&gt;
&lt;p&gt;Backend and platform engineers often gain the most from terminal-based agents. These tools align naturally with scripting, automation, and infrastructure tasks.&lt;/p&gt;
&lt;p&gt;Frontend and product-focused developers may prefer AI-native editors or IDE plugins that emphasize iteration, previews, and refactoring.&lt;/p&gt;
&lt;p&gt;Teams working in large codebases often start with IDE-embedded assistants. These tools improve productivity without disrupting existing processes.&lt;/p&gt;
&lt;h3&gt;By Environment&lt;/h3&gt;
&lt;p&gt;Startups and small teams can afford to experiment. Speed and leverage matter more than strict controls, making agentic tools attractive.&lt;/p&gt;
&lt;p&gt;Enterprises prioritize predictability and governance. Tools with clear data policies, audit logs, and controlled autonomy are easier to adopt.&lt;/p&gt;
&lt;p&gt;Highly regulated environments may require on-prem models or strict data isolation. This narrows the field but reduces risk.&lt;/p&gt;
&lt;h3&gt;By Autonomy and Trust&lt;/h3&gt;
&lt;p&gt;If you are new to AI-assisted coding, start with tools that suggest rather than act. Build intuition and confidence before increasing autonomy.&lt;/p&gt;
&lt;p&gt;As trust grows, introduce agents for well-scoped tasks. Avoid full autonomy in critical systems until guardrails are proven.&lt;/p&gt;
&lt;p&gt;The best choice is one that fits your current needs and can evolve with your workflow. AI tools are not static. Your adoption strategy should not be either.&lt;/p&gt;
&lt;p&gt;&lt;img src=&quot;https://dev-to-uploads.s3.amazonaws.com/uploads/articles/e1ys1lpbs4chio7eyscv.png&quot; alt=&quot;The best choice is one that fits your current needs and can evolve with your workflow.&quot;&gt;&lt;/p&gt;
&lt;h2&gt;The Future of AI-Assisted Coding&lt;/h2&gt;
&lt;p&gt;AI-assisted coding is still early, but the direction is clear. These tools are moving from helpers to participants in the development process.&lt;/p&gt;
&lt;p&gt;The distinction between editor, assistant, and agent is already starting to blur.&lt;/p&gt;
&lt;h3&gt;Convergence of Tools&lt;/h3&gt;
&lt;p&gt;IDE plugins are gaining agentic capabilities. Terminal agents are adding richer interfaces. AI-native IDEs are absorbing features from both.&lt;/p&gt;
&lt;p&gt;Over time, the market will likely converge around flexible systems that can operate at different levels of autonomy depending on context. One tool may act as an autocomplete engine in one moment and an autonomous agent in the next.&lt;/p&gt;
&lt;h3&gt;Interoperability and Protocols&lt;/h3&gt;
&lt;p&gt;As AI tools grow more capable, interoperability becomes essential.&lt;/p&gt;
&lt;p&gt;Standards for tool access, context sharing, and action execution are emerging. These allow models to interact with editors, terminals, and external systems in consistent ways.&lt;/p&gt;
&lt;p&gt;This reduces lock-in and makes it easier to mix tools, models, and workflows.&lt;/p&gt;
&lt;h3&gt;AI as a First-Class Team Member&lt;/h3&gt;
&lt;p&gt;The long-term shift is conceptual.&lt;/p&gt;
&lt;p&gt;AI tools are evolving from passive assistants into collaborators that can plan work, execute tasks, and verify results. This does not remove the need for human developers. It changes where their effort is spent.&lt;/p&gt;
&lt;p&gt;Design, judgment, and accountability remain human responsibilities. Execution increasingly becomes shared.&lt;/p&gt;
&lt;p&gt;The future of software development is not fully automated. It is more leveraged, more intentional, and more collaborative.&lt;/p&gt;
&lt;p&gt;&lt;img src=&quot;https://dev-to-uploads.s3.amazonaws.com/uploads/articles/3rbb00ghgypc0swt90o4.png&quot; alt=&quot;The future of software development is not fully automated. It is more leveraged, more intentional, and more collaborative.&quot;&gt;&lt;/p&gt;
&lt;h2&gt;Sample Prompts to Get Started&lt;/h2&gt;
&lt;p&gt;One of the hardest parts of using AI coding tools for the first time is knowing what to ask. The prompts below are designed to be simple, low-risk, and useful across most tools, whether you are using a terminal agent, an AI-native IDE, or an IDE plugin.&lt;/p&gt;
&lt;p&gt;Each prompt focuses on building or modifying something small while helping you learn how the tool behaves.&lt;/p&gt;
&lt;h3&gt;Prompt 1: Create a Simple Project Skeleton&lt;/h3&gt;
&lt;p&gt;Use this to test repo awareness and file creation.&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;Create a simple Python project for a command-line tool.&lt;/p&gt;
&lt;p&gt;It should include:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;A README&lt;/li&gt;
&lt;li&gt;A main entry file&lt;/li&gt;
&lt;li&gt;A basic argument parser&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;Do not add extra features.&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;This prompt helps you see how the tool structures files and how much initiative it takes.&lt;/p&gt;
&lt;h3&gt;Prompt 2: Implement a Small Feature From a Description&lt;/h3&gt;
&lt;p&gt;Use this to test code generation quality.&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;Add a function that reads a CSV file and prints the top 5 rows.&lt;/p&gt;
&lt;p&gt;Assume the file path is passed as a command-line argument.&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;This works well in IDE plugins and editors. Review the code carefully and run it.&lt;/p&gt;
&lt;h3&gt;Prompt 3: Explain Existing Code&lt;/h3&gt;
&lt;p&gt;Use this to test understanding and explanation.&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;Explain what this function does and identify any edge cases.&lt;/p&gt;
&lt;p&gt;Keep the explanation concise.&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;This is useful for learning unfamiliar code and validating AI understanding.&lt;/p&gt;
&lt;h3&gt;Prompt 4: Generate Tests&lt;/h3&gt;
&lt;p&gt;Use this to test correctness and coverage.&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;Write unit tests for this function.&lt;/p&gt;
&lt;p&gt;Use the existing testing framework.&lt;/p&gt;
&lt;p&gt;Cover normal cases and one edge case.&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;This helps establish a review habit and reinforces test-driven thinking.&lt;/p&gt;
&lt;h3&gt;Prompt 5: Refactor for Clarity&lt;/h3&gt;
&lt;p&gt;Use this to test refactoring behavior.&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;Refactor this code to improve readability.&lt;/p&gt;
&lt;p&gt;Do not change behavior.&lt;/p&gt;
&lt;p&gt;Keep the logic explicit.&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;Compare the diff to ensure intent is preserved.&lt;/p&gt;
&lt;h3&gt;Prompt 6: Simple Agentic Task (Terminal or AI-Native IDE)&lt;/h3&gt;
&lt;p&gt;Use this to test safe autonomy.&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;Add basic logging to this application.&lt;/p&gt;
&lt;p&gt;Use the existing logging library.&lt;/p&gt;
&lt;p&gt;Show me the changes before committing.&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;This prompt checks whether the agent plans steps and respects boundaries.&lt;/p&gt;
&lt;h3&gt;Prompt 7: Debug a Failure&lt;/h3&gt;
&lt;p&gt;Use this to test reasoning.&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;This test is failing.&lt;/p&gt;
&lt;p&gt;Explain why, then propose a fix.&lt;/p&gt;
&lt;p&gt;Do not apply the fix yet.&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;Only apply changes after reviewing the explanation.&lt;/p&gt;
&lt;h3&gt;How to Use These Prompts Safely&lt;/h3&gt;
&lt;p&gt;Start small. Run tools in a clean project or branch. Review every change.&lt;/p&gt;
&lt;p&gt;Pay attention to how the tool interprets ambiguity. If results are surprising, refine the prompt rather than forcing acceptance.&lt;/p&gt;
&lt;p&gt;Good prompts are clear, scoped, and explicit about constraints. Treat them like lightweight specifications.&lt;/p&gt;
&lt;p&gt;These examples are not about speed. They are about learning how the tool thinks before trusting it with more responsibility.&lt;/p&gt;
&lt;h2&gt;Conclusion&lt;/h2&gt;
&lt;p&gt;AI-assisted coding is no longer a single category of tools. It is an ecosystem with distinct approaches, tradeoffs, and philosophies.&lt;/p&gt;
&lt;p&gt;Terminal agents offer raw power and automation. AI-native IDEs rethink how development flows. IDE-embedded assistants provide steady gains with minimal disruption. Each has a place, and each serves different kinds of work.&lt;/p&gt;
&lt;p&gt;The most important takeaway is intentionality. The value of these tools depends less on which one you choose and more on how you use it. Clear goals, strong review practices, and appropriate guardrails matter more than novelty.&lt;/p&gt;
&lt;p&gt;AI does not replace good engineering. It rewards it.&lt;/p&gt;
&lt;p&gt;Developers who understand their systems, communicate intent clearly, and exercise judgment will see the greatest benefit. Those who treat AI as a shortcut risk confusion and fragility.&lt;/p&gt;
&lt;p&gt;The opportunity is significant. Used well, AI-assisted coding can reduce toil, accelerate learning, and free time for higher-level thinking. The tools are ready. The challenge now is&lt;/p&gt;
</content:encoded><author>Alex Merced</author></item><item><title>Building Pangolin - My Holiday Break, an AI IDE, and a Lakehouse Catalog for the Curious</title><link>https://iceberglakehouse.com/posts/2026-01-the-story-of-pangolin-catalog/</link><guid isPermaLink="true">https://iceberglakehouse.com/posts/2026-01-the-story-of-pangolin-catalog/</guid><description>
**Get Data Lakehouse Books:**
- [Apache Iceberg: The Definitive Guide](https://drmevn.fyi/tableformatblog)
- [Apache Polaris: The Defintive Guide](ht...</description><pubDate>Thu, 15 Jan 2026 09:00:00 GMT</pubDate><content:encoded>&lt;p&gt;&lt;strong&gt;Get Data Lakehouse Books:&lt;/strong&gt;&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;a href=&quot;https://drmevn.fyi/tableformatblog&quot;&gt;Apache Iceberg: The Definitive Guide&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&quot;https://drmevn.fyi/tableformatblog-62P6t&quot;&gt;Apache Polaris: The Defintive Guide&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&quot;https://hubs.la/Q03GfY4f0&quot;&gt;Architecting an Apache Iceberg Lakehouse&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&quot;https://www.puppygraph.com/ebooks/apache-iceberg-digest-vol-1&quot;&gt;The Apache Iceberg Digest: Vol. 1&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;&lt;strong&gt;Lakehouse Community:&lt;/strong&gt;&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;a href=&quot;https://www.datalakehousehub.com&quot;&gt;Join the Data Lakehouse Community&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&quot;https://lakehouseblogs.com&quot;&gt;Data Lakehouse Blog Roll&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&quot;https://osscommunity.com&quot;&gt;OSS Community Listings&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&quot;https://developer.dremio.com&quot;&gt;Dremio Lakehouse Developer Hub&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;hr&gt;
&lt;h2&gt;1. Introduction: A Holiday, an Agent, and an Idea&lt;/h2&gt;
&lt;p&gt;In December 2025, Google released something that changed how I code—&lt;strong&gt;Antigravity IDE&lt;/strong&gt;. It wasn’t just another brilliant editor. It came packed with AI agents that could write code, test it, refactor it, and even debug alongside you. Naturally, I had to try it out.&lt;/p&gt;
&lt;p&gt;I didn’t jump right into building a big project. Instead, I used it to make some tooling for &lt;a href=&quot;https://www.dremio.com&quot;&gt;Dremio&lt;/a&gt; and &lt;a href=&quot;https://iceberg.apache.org&quot;&gt;Apache Iceberg&lt;/a&gt;, both technologies I work with frequently. That experience set the foundation for something bigger: &lt;a href=&quot;https://pangolincatalog.org&quot;&gt;&lt;strong&gt;Pangolin&lt;/strong&gt;&lt;/a&gt;, an open-source, feature-rich lakehouse catalog.&lt;/p&gt;
&lt;p&gt;This blog tells the story of how Pangolin came to be. It’s not a pitch for production use. It’s a working concept, a glimpse into what’s possible.&lt;/p&gt;
&lt;p&gt;&lt;img src=&quot;https://i.imgur.com/ZnXcL8e.png&quot; alt=&quot;The Pangolin Journey Begins&quot;&gt;&lt;/p&gt;
&lt;h2&gt;2. First Steps: Learning to Trust the Agent&lt;/h2&gt;
&lt;p&gt;Before Pangolin, I started small. I needed to understand how to work with the Antigravity coding agent in a way that felt predictable and collaborative. So I created four tools:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;a href=&quot;https://github.com/developer-advocacy-dremio/dremio-cloud-dremioframe&quot;&gt;&lt;strong&gt;dremioframe&lt;/strong&gt;&lt;/a&gt;: A DataFrame-style API for building Dremio SQL queries in Python.&lt;/li&gt;
&lt;li&gt;&lt;a href=&quot;https://github.com/AlexMercedCoder/iceframe&quot;&gt;&lt;strong&gt;iceframe&lt;/strong&gt;&lt;/a&gt;: A similar API, but for building Iceberg-compatible queries using local compute.&lt;/li&gt;
&lt;li&gt;&lt;a href=&quot;https://github.com/developer-advocacy-dremio/dremio-python-cli/blob/main/readme.md&quot;&gt;&lt;strong&gt;dremio-cli&lt;/strong&gt;&lt;/a&gt;: A command-line tool for interacting with Dremio.&lt;/li&gt;
&lt;li&gt;&lt;a href=&quot;https://github.com/AlexMercedCoder/iceberg-cli&quot;&gt;&lt;strong&gt;iceberg-cli&lt;/strong&gt;&lt;/a&gt;: A CLI that filled in the gaps left by &lt;code&gt;pyiceberg&lt;/code&gt;.&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;These tools weren’t just functional; they became learning tools. I practiced writing clear prompts, specifying inputs and outputs, and most importantly, asking the agent to generate and refine unit, live, and regression tests. I also got better at pushing back when something didn’t work.&lt;/p&gt;
&lt;p&gt;Once I felt confident in that workflow, writing specs, prompting the agent, challenging assumptions, and getting results, I was ready to build something bigger.&lt;/p&gt;
&lt;p&gt;&lt;img src=&quot;https://i.imgur.com/CcYEQjR.png&quot; alt=&quot;Learning to Work with Google&apos;s Antigravity&quot;&gt;&lt;/p&gt;
&lt;h2&gt;3. Rethinking the Lakehouse Catalog&lt;/h2&gt;
&lt;p&gt;Catalogs are central to the Iceberg ecosystem. They’re how engines discover, manage, and track tables. But most catalogs out there either focus on infrastructure or metadata—not both.&lt;/p&gt;
&lt;p&gt;Some great projects inspired Pangolin:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;a href=&quot;https://projectnessie.org&quot;&gt;&lt;strong&gt;Project Nessie&lt;/strong&gt;&lt;/a&gt;: Created at Dremio, Nessie brought Git-like versioning to data catalogs. It’s a brilliant idea that still powers tools like &lt;a href=&quot;https://www.bauplanlabs.com&quot;&gt;Bauplan&lt;/a&gt;. But Nessie doesn’t support features like multi-tenancy or catalog federation.&lt;/li&gt;
&lt;li&gt;&lt;a href=&quot;https://polaris.apache.org&quot;&gt;&lt;strong&gt;Apache Polaris&lt;/strong&gt;&lt;/a&gt;: Polaris, co-created by Dremio and Snowflake and now an Apache Incubator project, is well on its way to becoming the open standard. It supports RBAC, catalog federation, generic assets, and upcoming table services that proxy metadata processing.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Business metadata platforms (DataHub, Atlan, Collibra, etc.)&lt;/strong&gt;: These tools focus on discovery and access workflows, and some now support Iceberg. But they bolt onto a catalog—they don’t start as one.&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;That got me thinking: &lt;em&gt;What if a single open source catalog could do it all?&lt;/em&gt;&lt;/p&gt;
&lt;p&gt;Pangolin became my experiment to find out.&lt;/p&gt;
&lt;p&gt;&lt;img src=&quot;https://i.imgur.com/vAYnnV5.png&quot; alt=&quot;What if a data lakehouse catalog had it all?&quot;&gt;&lt;/p&gt;
&lt;h2&gt;4. Feature List: The Dream Catalog&lt;/h2&gt;
&lt;p&gt;Before writing a single line of code, I wrote down everything I wanted this catalog to do, the features I admired in other tools, the gaps I noticed, and a few experiments I just wanted to try.&lt;/p&gt;
&lt;p&gt;Here’s what ended up on the list:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Catalog versioning&lt;/strong&gt;, with support for branching and merging, but scoped-branches don&apos;t have to affect all tables.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Catalog federation&lt;/strong&gt;, so one catalog can reference others.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Generic asset support&lt;/strong&gt;, to register Delta tables, CSV datasets, or even external databases alongside Iceberg tables.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Business metadata&lt;/strong&gt;, including access requests and grant workflows.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Multi-tenancy&lt;/strong&gt;, so each team can work in its own isolated space.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;RBAC and TBAC (tag-based access control)&lt;/strong&gt;, to control access based on roles and tags.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;No-auth mode&lt;/strong&gt;, to make it easy to spin up and test locally.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Credential vending&lt;/strong&gt;, with built-in support for AWS, Azure, GCP, and S3-compatible systems.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Pluggable backends&lt;/strong&gt;, starting with PostgreSQL and MongoDB for metadata persistence.&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;That’s a lot. But I didn’t set out to build a polished product—I just wanted to see if it was possible.&lt;/p&gt;
&lt;p&gt;&lt;img src=&quot;https://i.imgur.com/4hxzR01.png&quot; alt=&quot;The features I wanted for Pangolin Catalog&quot;&gt;&lt;/p&gt;
&lt;h2&gt;5. Choosing the Stack: Why Rust, Python, and Svelte&lt;/h2&gt;
&lt;p&gt;With the feature list in hand, the next decision was the tech stack. I know Python and JavaScript like the back of my hand, which would’ve made it easy to move fast. But I wanted something that would scale better—and maybe be a little less error-prone.&lt;/p&gt;
&lt;p&gt;I considered three languages for the backend: &lt;strong&gt;Java&lt;/strong&gt;, &lt;strong&gt;Go&lt;/strong&gt;, and &lt;strong&gt;Rust&lt;/strong&gt;.&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;
&lt;p&gt;Java is the standard in the data world. But writing clean, scalable Java means understanding the JVM inside and out. I know it—but not enough to move quickly.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;Go is simple and efficient. Rust is strict and safe. Between the two, I picked &lt;strong&gt;Rust&lt;/strong&gt;.&lt;/p&gt;
&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;Rust’s compiler errors are frustrating at first but turn into a superpower. The strong typing and detailed feedback also pair well with AI agents; errors are easier to reason about and fix through prompting.&lt;/p&gt;
&lt;p&gt;For the rest of the stack:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Rust&lt;/strong&gt; powers the backend and CLI.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Python&lt;/strong&gt; powers the SDK and scripting layer.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Svelte&lt;/strong&gt; powers the UI—lightweight and reactive, but more complex than I expected once the feature count grew.&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;All in, I ended up with a full stack that balanced experimentation and real-world usability. The only problem was... building it all over a holiday break.&lt;/p&gt;
&lt;p&gt;&lt;img src=&quot;https://i.imgur.com/oiv9XeG.png&quot; alt=&quot;The Tech Chosen to Build Pangolin Catalog&quot;&gt;&lt;/p&gt;
&lt;h2&gt;6. Building It: 100 Hours, Three Interfaces, and a Lot of Feedback Loops&lt;/h2&gt;
&lt;p&gt;Once I committed to the stack, the pace picked up fast. I spent roughly 100 hours on Pangolin, which ended up taking most of my holiday break. The backend came together first, followed by the Rust-based CLI and then the Python SDK.&lt;/p&gt;
&lt;p&gt;The backend covered all the core ideas: catalogs, tenants, assets, access rules, and credential vending. Rust helped here. The compiler forced clarity. Each time something felt vague, the type system pushed back until the design made sense.&lt;/p&gt;
&lt;p&gt;The Python SDK turned out better than I expected. It didn’t just wrap the API. It made some features practical. Generic assets are a good example. Through the SDK, those assets became usable for sharing database connections, Delta tables, CSV datasets, and other non-Iceberg data without much friction.&lt;/p&gt;
&lt;p&gt;The hardest part was the UI.&lt;/p&gt;
&lt;p&gt;With so many features, state management became tricky fast. I used Antigravity’s browser agent early on, and it helped catch basic issues. Once the UI grew more complex, manual testing worked better. I spent a lot of time clicking through edge cases, capturing network requests, reading console errors, and feeding that context back to the agent. It was slower, but it worked.&lt;/p&gt;
&lt;p&gt;By the end, Pangolin had three real interfaces: a Rust CLI, a Python SDK, and a Svelte UI. All of them worked against the same API and feature set.&lt;/p&gt;
&lt;p&gt;&lt;img src=&quot;https://i.imgur.com/AwXP27Y.png&quot; alt=&quot;100 hours developing Pangolin Catlaog&quot;&gt;&lt;/p&gt;
&lt;h2&gt;7. What Pangolin Is—and What It Isn’t&lt;/h2&gt;
&lt;p&gt;Pangolin exists. You can run it. You can click around, create catalogs, register assets, request access, and vend credentials across clouds.&lt;/p&gt;
&lt;p&gt;&lt;img src=&quot;https://i.imgur.com/yWDNUcd.png&quot; alt=&quot;Pangolin Catalog Exists&quot;&gt;&lt;/p&gt;
&lt;p&gt;That said, I don’t see Pangolin as a production catalog. I don’t plan to invest heavily beyond bug fixes and minor improvements. For a truly open, production-ready lakehouse catalog, Apache Polaris is still the best option today. If you want a managed path, platforms like Dremio Catalog, which build on Polaris, handle the complex parts for you.&lt;/p&gt;
&lt;p&gt;Pangolin serves a different purpose. It’s a proof of concept. It shows what can happen when a community-oriented project tries to bring versioning, federation, governance, business metadata, and access workflows together in one place.&lt;/p&gt;
&lt;p&gt;If you’re a lakehouse nerd like me, Pangolin might be fun to explore. If it sparks ideas or nudges other projects to co-locate these features sooner, then it did its job.&lt;/p&gt;
&lt;p&gt;&lt;a href=&quot;https://www.alexmerced.com/data&quot;&gt;Make sure to follow me on linkedin and substack&lt;/a&gt;&lt;/p&gt;
&lt;p&gt;&lt;img src=&quot;https://i.imgur.com/81PKZJp.png&quot; alt=&quot;Pangolin Catalog is a Question Made Real&quot;&gt;&lt;/p&gt;
</content:encoded><author>Alex Merced</author></item><item><title>What Are Recursive Language Models?</title><link>https://iceberglakehouse.com/posts/2026-01-recursive-langauge-models/</link><guid isPermaLink="true">https://iceberglakehouse.com/posts/2026-01-recursive-langauge-models/</guid><description>
**Get Data Lakehouse Books:**
- [Apache Iceberg: The Definitive Guide](https://drmevn.fyi/tableformatblog)
- [Apache Polaris: The Defintive Guide](ht...</description><pubDate>Sat, 10 Jan 2026 09:00:00 GMT</pubDate><content:encoded>&lt;p&gt;&lt;strong&gt;Get Data Lakehouse Books:&lt;/strong&gt;&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;a href=&quot;https://drmevn.fyi/tableformatblog&quot;&gt;Apache Iceberg: The Definitive Guide&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&quot;https://drmevn.fyi/tableformatblog-62P6t&quot;&gt;Apache Polaris: The Defintive Guide&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&quot;https://hubs.la/Q03GfY4f0&quot;&gt;Architecting an Apache Iceberg Lakehouse&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&quot;https://www.puppygraph.com/ebooks/apache-iceberg-digest-vol-1&quot;&gt;The Apache Iceberg Digest: Vol. 1&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;&lt;strong&gt;Lakehouse Community:&lt;/strong&gt;&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;a href=&quot;https://www.datalakehousehub.com&quot;&gt;Join the Data Lakehouse Community&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&quot;https://lakehouseblogs.com&quot;&gt;Data Lakehouse Blog Roll&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&quot;https://osscommunity.com&quot;&gt;OSS Community Listings&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&quot;https://developer.dremio.com&quot;&gt;Dremio Lakehouse Developer Hub&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;hr&gt;
&lt;p&gt;Recursive Language Models (RLMs) are language models that call themselves.&lt;/p&gt;
&lt;p&gt;That sounds strange at first—but the idea is simple. Instead of answering a question in one go, an RLM breaks the task into smaller parts, then asks itself those sub-questions. It builds the answer step by step, using structured function calls along the way.&lt;/p&gt;
&lt;p&gt;This is different from how standard LLMs work. A typical model tries to predict the full response directly from a prompt. If the task has multiple steps, it has to manage them all in a single stream of text. That can work for short tasks, but it often falls apart when the model needs to remember intermediate results or reuse the same logic multiple times.&lt;/p&gt;
&lt;p&gt;&lt;img src=&quot;https://dev-to-uploads.s3.amazonaws.com/uploads/articles/xqzdf2stxn61a1jhjcd6.png&quot; alt=&quot;Image description&quot;&gt;&lt;/p&gt;
&lt;p&gt;RLMs don’t try to do everything at once. They write and execute structured calls—like &lt;code&gt;CALL(&amp;quot;question&amp;quot;, args)&lt;/code&gt;—inside their own output. The system sees this call, pauses the main response, evaluates the subtask, then inserts the result and continues. It’s a recursive loop: the model is both the planner and the executor.&lt;/p&gt;
&lt;p&gt;This gives RLMs a kind of dynamic memory and control flow. They can stop, plan, re-enter themselves with new input, and combine results. That’s what makes them powerful—and fundamentally different from the static prompting methods most models use today.&lt;/p&gt;
&lt;h2&gt;What Problem Do RLMs Solve?&lt;/h2&gt;
&lt;p&gt;Language models are good at sounding smart. But when the task involves multiple steps, especially ones that depend on each other, standard models often fail.&lt;/p&gt;
&lt;p&gt;Why? Because they generate everything in a straight line.&lt;/p&gt;
&lt;p&gt;If you ask a regular LLM to solve a logic puzzle, it has to juggle the entire solution in one pass. There’s no mechanism to stop, break the task apart, and reuse parts of its own reasoning. It has no structure—just one long stream of text.&lt;/p&gt;
&lt;p&gt;&lt;img src=&quot;https://dev-to-uploads.s3.amazonaws.com/uploads/articles/nqyodm8imk6zrniirz3h.png&quot; alt=&quot;Image description&quot;&gt;&lt;/p&gt;
&lt;p&gt;Prompt engineering helps, but only up to a point. You can ask the model to “think step by step” or “show your work,” and that can improve results. But these tricks don’t change how the model actually runs. It still generates everything in one session, with no built-in way to modularize or reuse logic.&lt;/p&gt;
&lt;p&gt;Recursive Language Models change this. They treat complex tasks as programs. The model doesn’t just answer—it writes code-like calls to itself. Those calls are evaluated in real time, and their results are folded back into the response.&lt;/p&gt;
&lt;p&gt;&lt;img src=&quot;https://dev-to-uploads.s3.amazonaws.com/uploads/articles/9mrs3yiylirta0d7mp4b.png&quot; alt=&quot;Image description&quot;&gt;&lt;/p&gt;
&lt;p&gt;This lets RLMs:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Reuse their own logic.&lt;/li&gt;
&lt;li&gt;Focus on one part of the task at a time.&lt;/li&gt;
&lt;li&gt;Scale to deeper or more recursive problems.&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;In other words, RLMs solve the structure problem. They bring composability and control into language generation—two things that most LLMs still lack.&lt;/p&gt;
&lt;h2&gt;How Do RLMs Actually Work?&lt;/h2&gt;
&lt;p&gt;At the core of Recursive Language Models is a simple but powerful loop: generate, detect, call, repeat.&lt;/p&gt;
&lt;p&gt;&lt;img src=&quot;https://dev-to-uploads.s3.amazonaws.com/uploads/articles/is9rohgc2g1lbt2nrvvb.png&quot; alt=&quot;Image description&quot;&gt;&lt;/p&gt;
&lt;p&gt;Here’s how it plays out:&lt;/p&gt;
&lt;ol&gt;
&lt;li&gt;&lt;strong&gt;The model receives a prompt.&lt;/strong&gt;&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;It starts generating a response.&lt;/strong&gt;&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;When it hits a subtask, it emits a structured function call&lt;/strong&gt;—something like &lt;code&gt;CALL(&amp;quot;Summarize&amp;quot;, &amp;quot;text goes here&amp;quot;)&lt;/code&gt;.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;The system pauses&lt;/strong&gt;, evaluates that call by feeding it back into the same model, and gets a result.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;The result is inserted&lt;/strong&gt;, and the original response resumes.&lt;/li&gt;
&lt;/ol&gt;
&lt;p&gt;This process can happen once—or dozens of times inside a single response.&lt;/p&gt;
&lt;p&gt;&lt;img src=&quot;https://dev-to-uploads.s3.amazonaws.com/uploads/articles/r8r9qwefzhnhr8bxxyyb.png&quot; alt=&quot;Image description&quot;&gt;&lt;/p&gt;
&lt;p&gt;Let’s take a concrete example. Suppose you ask an RLM to explain a complicated technical article. Instead of trying to summarize the whole thing at once, the model might first break the article into sections. Then it could issue recursive calls to summarize each section individually. After that, it could combine those pieces into a final answer.&lt;/p&gt;
&lt;p&gt;So what’s actually new here?&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;The model isn’t just generating text. It’s &lt;em&gt;controlling execution&lt;/em&gt;.&lt;/li&gt;
&lt;li&gt;Each function call is explicit and machine-readable. It’s not hidden in plain text.&lt;/li&gt;
&lt;li&gt;The model learns not just &lt;em&gt;what&lt;/em&gt; to say, but &lt;em&gt;when&lt;/em&gt; to delegate subtasks to itself.&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;&lt;img src=&quot;https://dev-to-uploads.s3.amazonaws.com/uploads/articles/3oreu6lmnf9ui304nhs8.png&quot; alt=&quot;Image description&quot;&gt;&lt;/p&gt;
&lt;p&gt;This design introduces modular reasoning. It’s closer to programming than prompting. And it’s what makes RLMs capable of solving longer, deeper, and more compositional tasks than traditional LLMs.&lt;/p&gt;
&lt;h2&gt;How Are RLMs Different From Reasoning Models?&lt;/h2&gt;
&lt;p&gt;It’s easy to confuse Recursive Language Models with models designed for reasoning. After all, both aim to solve harder, multi-step problems. But they take very different paths.&lt;/p&gt;
&lt;p&gt;Reasoning models try to think better within a fixed response. They rely on prompting tricks (“Let’s think step by step”), fine-tuning, or architectural tweaks to encourage more logical answers. But they still generate their full output in one go. There’s no built-in structure or recursion—just better text generation.&lt;/p&gt;
&lt;p&gt;Recursive Language Models go further. They change how language models &lt;em&gt;run&lt;/em&gt;, not just how they &lt;em&gt;think&lt;/em&gt;.&lt;/p&gt;
&lt;p&gt;&lt;img src=&quot;https://dev-to-uploads.s3.amazonaws.com/uploads/articles/9ki8oqyq0gpy55b4kykg.png&quot; alt=&quot;Image description&quot;&gt;&lt;/p&gt;
&lt;p&gt;Here’s the key distinction:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Reasoning models&lt;/strong&gt; operate in a flat, linear space. They can simulate step-by-step thinking, but they don’t control execution.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;RLMs&lt;/strong&gt; introduce a real control flow. They can pause, emit a sub-call, re-enter themselves, and build results incrementally.&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;&lt;img src=&quot;https://dev-to-uploads.s3.amazonaws.com/uploads/articles/qys887jwi9f157l5rg3q.png&quot; alt=&quot;Image description&quot;&gt;&lt;/p&gt;
&lt;p&gt;Think of it this way: reasoning models try to write better essays. RLMs write and run programs.&lt;/p&gt;
&lt;p&gt;This also makes RLMs easier to inspect and debug. Each recursive call is explicit. You can see the full tree of operations the model performed—what it asked, what it answered, and how it combined the results. That transparency is rare in LLM workflows, and it opens the door to more robust systems.&lt;/p&gt;
&lt;p&gt;&lt;img src=&quot;https://dev-to-uploads.s3.amazonaws.com/uploads/articles/0akxrwtfly8hqkn5opxt.png&quot; alt=&quot;Image description&quot;&gt;&lt;/p&gt;
&lt;p&gt;So while reasoning models stretch the limits of static prompting, RLMs redefine what a model can do at runtime.&lt;/p&gt;
&lt;h2&gt;Why Recursion Changes What LLMs Can Do&lt;/h2&gt;
&lt;p&gt;Recursion isn’t just a technical upgrade—it’s a shift in what language models are capable of.&lt;/p&gt;
&lt;p&gt;With recursion, models don’t have to guess the whole answer in one pass. They can build it piece by piece, reusing their own capabilities as needed. This unlocks new behaviors that standard models struggle with.&lt;/p&gt;
&lt;p&gt;&lt;img src=&quot;https://dev-to-uploads.s3.amazonaws.com/uploads/articles/byceucjiz5m6nl13e57b.png&quot; alt=&quot;Image description&quot;&gt;&lt;/p&gt;
&lt;p&gt;Here’s what that looks like in practice:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Logic puzzles&lt;/strong&gt;: Instead of brute-forcing a full solution, an RLM can write out each rule, evaluate sub-cases, and combine the results.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Math word problems&lt;/strong&gt;: The model can break a complex problem into steps, solve each one recursively, and verify intermediate answers.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Code generation&lt;/strong&gt;: RLMs can draft a function, then call themselves to write test cases, fix bugs, or generate helper functions.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Proof generation&lt;/strong&gt;: For theorem proving, recursion lets the model build a proof tree, checking smaller lemmas along the way.&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;In the paper’s experiments, RLMs outperformed non-recursive baselines on multi-step benchmarks. They were also &lt;em&gt;more efficient&lt;/em&gt;. Recursive calls reduced total token usage, because the model could reuse logic instead of repeating it .&lt;/p&gt;
&lt;p&gt;&lt;img src=&quot;https://dev-to-uploads.s3.amazonaws.com/uploads/articles/q6896v8g85mmclp8ipqd.png&quot; alt=&quot;Image description&quot;&gt;&lt;/p&gt;
&lt;p&gt;This is a key point: recursion isn’t just about accuracy. It’s also about &lt;em&gt;efficiency&lt;/em&gt; and &lt;em&gt;composability&lt;/em&gt;. Instead of scaling linearly with problem size, RLMs can scale logarithmically by solving smaller pieces and reusing solutions.&lt;/p&gt;
&lt;p&gt;That makes them a better fit for tasks where reasoning depth grows quickly—exactly the kind of problems LLMs are starting to face in real-world applications.&lt;/p&gt;
&lt;h2&gt;Why This Matters Now&lt;/h2&gt;
&lt;p&gt;Language models are everywhere—but most still follow a simple pattern: input goes in, output comes out. That’s fine for quick answers or lightweight tasks. But for anything complex, it’s not enough.&lt;/p&gt;
&lt;p&gt;Today, developers are building agents, chains, and tool-using systems on top of LLMs. These wrappers simulate structure, but they’re often fragile. They rely on prompt hacking, regex parsing, and external orchestration to manage what the model can’t do natively.&lt;/p&gt;
&lt;p&gt;Recursive Language Models offer a cleaner path. Instead of bolting on structure from the outside, they build it in.&lt;/p&gt;
&lt;p&gt;This matters for a few reasons:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Fewer moving parts&lt;/strong&gt;: RLMs remove the need for external chains or custom routing logic. The model decides when and how to branch.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Greater transparency&lt;/strong&gt;: Each recursive call is visible and traceable. You can audit what the model did, step by step.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Better generalization&lt;/strong&gt;: Once trained to use recursion, the model can apply it flexibly across domains—math, code, reasoning, even planning.&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;And we’re just getting started. RLMs are early, but they hint at a broader shift: treating models not just as generators, but as runtime environments. That opens the door to future systems where models can plan, act, and adapt on their own, with clear structure behind every step.&lt;/p&gt;
&lt;p&gt;&lt;img src=&quot;https://dev-to-uploads.s3.amazonaws.com/uploads/articles/u2g8vytnrdnb5czywiiv.png&quot; alt=&quot;Image description&quot;&gt;&lt;/p&gt;
&lt;p&gt;If the last few years were about making LLMs sound smart, the next few might be about making them &lt;em&gt;think&lt;/em&gt; with structure. That’s where recursion fits in.&lt;/p&gt;
&lt;h2&gt;Conclusion: A New Way to Think with Language Models&lt;/h2&gt;
&lt;p&gt;Recursive Language Models aren’t just a tweak to existing LLMs. They represent a shift in how models operate.&lt;/p&gt;
&lt;p&gt;Instead of treating every task as a one-shot prediction, RLMs break problems into parts, solve them recursively, and combine the results. That gives them something most language models still lack: structure.&lt;/p&gt;
&lt;p&gt;&lt;img src=&quot;https://dev-to-uploads.s3.amazonaws.com/uploads/articles/59pdnsjnqmulb1dlrq53.png&quot; alt=&quot;Image description&quot;&gt;&lt;/p&gt;
&lt;p&gt;This structure matters. It makes models more reliable on complex tasks. It makes their reasoning easier to follow. And it opens the door to new capabilities—like planning, verifying, or adapting—without needing complex external systems.&lt;/p&gt;
&lt;p&gt;We’re still early in this space. But the idea is simple and powerful: give models the tools to use themselves. From there, a new class of language systems can emerge—not just fluent, but recursive, modular, and built to handle depth.&lt;/p&gt;
&lt;p&gt;RLMs don’t just make better answers. They make better models.&lt;/p&gt;
</content:encoded><author>Alex Merced</author></item><item><title>2025 Year in Review Apache Iceberg, Polaris, Parquet, and Arrow</title><link>https://iceberglakehouse.com/posts/2025-12-2025-year-in-review-iceberg-arrow-polaris-parquet/</link><guid isPermaLink="true">https://iceberglakehouse.com/posts/2025-12-2025-year-in-review-iceberg-arrow-polaris-parquet/</guid><description>
**Get Data Lakehouse Books:**
- [Apache Iceberg: The Definitive Guide](https://drmevn.fyi/tableformatblog)
- [Apache Polaris: The Defintive Guide](ht...</description><pubDate>Mon, 29 Dec 2025 09:00:00 GMT</pubDate><content:encoded>&lt;p&gt;&lt;strong&gt;Get Data Lakehouse Books:&lt;/strong&gt;&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;a href=&quot;https://drmevn.fyi/tableformatblog&quot;&gt;Apache Iceberg: The Definitive Guide&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&quot;https://drmevn.fyi/tableformatblog-62P6t&quot;&gt;Apache Polaris: The Defintive Guide&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&quot;https://hubs.la/Q03GfY4f0&quot;&gt;Architecting an Apache Iceberg Lakehouse&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&quot;https://www.puppygraph.com/ebooks/apache-iceberg-digest-vol-1&quot;&gt;The Apache Iceberg Digest: Vol. 1&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;&lt;strong&gt;Lakehouse Community:&lt;/strong&gt;&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;a href=&quot;https://www.datalakehousehub.com&quot;&gt;Join the Data Lakehouse Community&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&quot;https://lakehouseblogs.com&quot;&gt;Data Lakehouse Blog Roll&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&quot;https://osscommunity.com&quot;&gt;OSS Community Listings&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&quot;https://developer.dremio.com&quot;&gt;Dremio Lakehouse Developer Hub&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;hr&gt;
&lt;p&gt;The open lakehouse is no longer a concept. In 2025, key Apache projects matured, making data warehouse performance on object storage a practical reality. This post walks through the most critical developments in four of those projects: Iceberg, Polaris, Parquet, and Arrow. Each is building a critical layer for an open, engine-agnostic analytics stack.&lt;/p&gt;
&lt;p&gt;We start with Iceberg.&lt;/p&gt;
&lt;h2&gt;Apache Iceberg&lt;/h2&gt;
&lt;p&gt;Iceberg spent 2025 delivering the core elements of Format Version 3, while setting the stage for a more indexable and cache-friendly V4 format. Its release cadence remained steady and focused. The project shipped three main versions: 1.8.0 in February, 1.9.0 in April, and 1.10.0 in September.&lt;/p&gt;
&lt;h3&gt;What shipped in 2025&lt;/h3&gt;
&lt;p&gt;Iceberg 1.8.0 introduced deletion vectors, default column values, and row-level lineage metadata. These features help engines express updates more efficiently while tracking the origin of each record .&lt;/p&gt;
&lt;p&gt;Version 1.9.0 expanded type support. Iceberg now includes a &lt;code&gt;variant&lt;/code&gt; type for semi-structured data and geospatial types for geometry-based filtering. The release also added nanosecond timestamps and improved the semantics of equality deletes .&lt;/p&gt;
&lt;p&gt;By 1.10.0, the project had added encryption key metadata, cleanup logic for orphaned delete vectors, and full compatibility with Spark 4.0 . Partition statistics were made incremental to reduce overhead in large-scale table planning.&lt;/p&gt;
&lt;p&gt;These changes matter. Deletion vectors reduce the cost of updates. Default column values simplify table evolution. Variant support opens the door to querying nested JSON and evolving schemas. Together, these features make Iceberg more expressive and more efficient.&lt;/p&gt;
&lt;h3&gt;What&apos;s coming next&lt;/h3&gt;
&lt;p&gt;The community has started preparing for Format V4. Key goals include native index support and a formal caching model. The Iceberg dev list also agreed to raise the Java baseline to JDK 17, clearing the way for future performance and security improvements .&lt;/p&gt;
&lt;p&gt;Work is also underway to extend the REST catalog spec. This will improve consistency across catalogs like Polaris and make multi-engine deployments behave more predictably.&lt;/p&gt;
&lt;p&gt;All of this reflects a clear direction. Iceberg is not only stable, but optimized. It is now equipped to support warehouse workloads with ACID guarantees, even on cloud object storage.&lt;/p&gt;
&lt;h2&gt;Apache Polaris&lt;/h2&gt;
&lt;p&gt;Polaris is a new incubating project, but in 2025 it made a fast entrance. Its purpose is simple: act as a shared catalog and governance layer for Iceberg tables across multiple query engines. This includes Spark, Flink, Dremio, Trino, StarRocks, and any system that supports Iceberg&apos;s REST catalog protocol.&lt;/p&gt;
&lt;h3&gt;Why Polaris matters&lt;/h3&gt;
&lt;p&gt;Today, companies often manage Iceberg tables across multiple engines. Each engine needs a way to authenticate, authorize, and operate on metadata safely. Polaris fills that gap. It provides a consistent API, stores policies centrally, and handles short-term credential vending through built-in integrations with cloud providers.&lt;/p&gt;
&lt;p&gt;This makes Polaris one of the first Iceberg-native catalogs to support full multi-engine access, with RBAC and table-level security as first-class features.&lt;/p&gt;
&lt;h3&gt;What shipped in 2025&lt;/h3&gt;
&lt;p&gt;Polaris released three versions in its first year: 1.0.0-incubating in July, 1.1.0 in September, and 1.2.0 in October.&lt;/p&gt;
&lt;p&gt;The first release included core catalog APIs, a PostgreSQL-backed persistence layer, Quarkus runtime, and initial support for snapshot and compaction policies. It also supported external identity providers, ETag-based caching, and federated metadata views .&lt;/p&gt;
&lt;p&gt;Version 1.1.0 added Hive Metastore integration, support for S3-compatible stores like MinIO, and improvements to modularity and CLI tooling .&lt;/p&gt;
&lt;p&gt;Version 1.2.0 focused on governance. It expanded RBAC, introduced fine-grained update permissions, and added event logging. AWS Aurora IAM login support also shipped, helping teams standardize credentials across engines .&lt;/p&gt;
&lt;h3&gt;What&apos;s coming next&lt;/h3&gt;
&lt;p&gt;Polaris is not standing still. Active mailing list discussions show interest in idempotent commit operations, improved retries, and broader NoSQL compatibility. The project is also planning to support Delta Lake tables through its generic table APIs.&lt;/p&gt;
&lt;p&gt;Polaris is already production-ready for Iceberg. It supports time travel, commit retries, STS credential vending, and a policy-based governance model. These capabilities make it the metadata backbone of an open lakehouse.&lt;/p&gt;
&lt;h2&gt;Apache Parquet&lt;/h2&gt;
&lt;p&gt;Parquet is the disk format most Iceberg tables use. In 2025, the project focused on performance and long-term maintainability. While its interface has changed little, its internals received key upgrades.&lt;/p&gt;
&lt;h3&gt;What shipped in 2025&lt;/h3&gt;
&lt;p&gt;The biggest release was Parquet Java 1.16.0 in September. It removed legacy Hadoop 2 support, raised the Java baseline to 11, and enabled vectorized reads by default. These changes help projects like Iceberg, Trino, and Spark take advantage of faster scan paths with less configuration .&lt;/p&gt;
&lt;p&gt;The update also refreshed core dependencies like Protobuf and Jackson, fixed bugs in nested field casting, and added CLI support for printing size statistics. For teams managing data layout at scale, this makes table introspection simpler and safer.&lt;/p&gt;
&lt;p&gt;On the C++ side, version 12.0 of the Parquet format finalized support for Decimal32 and Decimal64 encodings. These types make aggregations and filters on fixed-point numbers more space-efficient .&lt;/p&gt;
&lt;h3&gt;What&apos;s coming next&lt;/h3&gt;
&lt;p&gt;The Parquet community has begun discussing what a V3 format might look like. Topics include FSST-based string encoding, cleaner metadata layouts, and faster bloom filter indexing. These ideas aim to reduce scan times and improve filter pushdown without breaking compatibility .&lt;/p&gt;
&lt;p&gt;The dev list also revisited lingering V2 features like optional checksums and page-level statistics. There is consensus that these will stabilize in 2026, completing the long tail of V2 work before any format transition.&lt;/p&gt;
&lt;p&gt;Parquet’s future is evolutionary, not disruptive. The team is focused on speed, compatibility, and precision. That’s exactly what Iceberg and other engines need from their storage format.&lt;/p&gt;
&lt;h2&gt;Apache Arrow&lt;/h2&gt;
&lt;p&gt;Arrow provides the in-memory columnar format that many engines use to exchange data without copying or re-encoding. In 2025, the project extended its feature set, added new bindings, and continued improving compute performance.&lt;/p&gt;
&lt;h3&gt;What shipped in 2025&lt;/h3&gt;
&lt;p&gt;Arrow released versions 20.0.0 in April, 21.0.0 in July, and 22.0.0 in October. Each brought changes across the stack, including C++, Python, Java, and R bindings.&lt;/p&gt;
&lt;p&gt;The October release expanded compute functions with new regex matchers, selection kernels, and logical operators. It also improved CSV read/write performance, added support for &lt;code&gt;attrs&lt;/code&gt; in Pandas DataFrames, and stabilized Decimal32 and Decimal64 support across languages .&lt;/p&gt;
&lt;p&gt;Arrow Flight, the RPC layer, shipped a working SQL client implementation. This lays the groundwork for distributed query pushdown using Arrow buffers. Timezone-aware types also advanced, with the community approving a new &lt;code&gt;TimestampWithOffset&lt;/code&gt; type to better handle UTC offsets in analytical workflows .&lt;/p&gt;
&lt;p&gt;Language support improved too. Arrow released official wheels for modern Linux platforms, added MATLAB bindings, and expanded test coverage for R and Julia. These improvements reduce friction when adopting Arrow across new platforms.&lt;/p&gt;
&lt;h3&gt;What&apos;s coming next&lt;/h3&gt;
&lt;p&gt;Arrow’s roadmap points toward broader Flight SQL adoption, faster filter and projection kernels, and more alignment between language libraries. Mailing list discussion shows active work on offset encoding, enum types, and compression improvements.&lt;/p&gt;
&lt;p&gt;More importantly, Arrow is no longer just a format. It’s becoming an interoperability layer for lakehouse engines. With zero-copy sharing across Spark, Dremio, DuckDB, and beyond, Arrow enables the low-latency experience users expect from a warehouse.&lt;/p&gt;
&lt;p&gt;Arrow’s 2025 work reinforced that direction: fast, portable, and deeply integrated with the tools that matter.&lt;/p&gt;
&lt;h2&gt;Wrapping up&lt;/h2&gt;
&lt;p&gt;Apache Iceberg, Polaris, Parquet, and Arrow all pushed forward in 2025. Each project focused on practical features that improve performance, governance, or compatibility. Together, they form a foundation for a warehouse experience on open data.&lt;/p&gt;
&lt;p&gt;This year’s progress wasn’t about experimentation. It was about consolidation. The features that shipped—from deletion vectors to vectorized reads to Flight SQL—are already in production. They make it easier to build, operate, and scale lakehouse systems.&lt;/p&gt;
&lt;p&gt;In 2026, expect the conversation to shift from format maturity to engine convergence. With multi-engine catalogs, index-aware tables, and in-memory interoperability in place, the future looks a lot more accessible. And a lot faster.&lt;/p&gt;
</content:encoded><author>Alex Merced</author></item><item><title>dremioframe &amp; iceberg - Pythonic interfaces for Dremio and Apache Iceberg</title><link>https://iceberglakehouse.com/posts/2025-12-dremioframe-and-iceframe/</link><guid isPermaLink="true">https://iceberglakehouse.com/posts/2025-12-dremioframe-and-iceframe/</guid><description>
Modern data teams want simple tools to work with Iceberg tables and Dremio. Two new Python libraries now make that work easier. The first is DremioFr...</description><pubDate>Fri, 05 Dec 2025 09:00:00 GMT</pubDate><content:encoded>&lt;p&gt;Modern data teams want simple tools to work with Iceberg tables and Dremio. Two new Python libraries now make that work easier. The first is DremioFrame. It gives you a clear set of functions for managing your Dremio Cloud or Dremio Software project through code. The second is IceFrame. It gives you a direct way to create and maintain Iceberg tables using PyIceberg and Polars with native extensions. Both libraries are in alpha. This is the best time to try them, share your ideas, and report issues.&lt;/p&gt;
&lt;p&gt;You can test them with a free 30-day Dremio Cloud trial that includes $400 in credits. Sign up &lt;a href=&quot;https://drmevn.fyi/am-get-started&quot;&gt;here to get started&lt;/a&gt; . The trial includes a built-in Apache Polaris-based Iceberg catalog (on the ui you&apos;ll see a namespaces section, that&apos;s the catalog), so you can create tables and explore them from both libraries. This lets you know how the tools fit into real workflows with no setup.&lt;/p&gt;
&lt;p&gt;The goal of both libraries is simple. They remove friction. They give you short, readable code. They help you move from idea to result with less effort. They both have built-in AI Agents for assisting you generate code using the library and more. Early feedback from real users will shape their future. Your tests and your questions will guide the next steps. This article introduces the two projects and shows how they work together.&lt;/p&gt;
&lt;p&gt;&lt;img src=&quot;https://i.imgur.com/Sj9EKFC.png&quot; alt=&quot;dremioframe and iceframe&quot;&gt;&lt;/p&gt;
&lt;h2&gt;Why These Libraries Exist&lt;/h2&gt;
&lt;p&gt;Python is the language many teams use for data work. People write scripts, build pipelines, and test ideas in notebooks. Yet working with Iceberg tables or the Dremio REST API often means long code and many repeated steps. These two libraries remove that weight.&lt;/p&gt;
&lt;p&gt;DremioFrame gives you a direct way to manage your Dremio catalog, users, views, and jobs. You write clear code that creates folders, defines views, and handles security rules. You no longer need to build each API request by hand.&lt;/p&gt;
&lt;p&gt;IceFrame gives you a focused set of tools for Iceberg tables. You can compact files, evolve partitions, and run maintenance tasks with short commands.&lt;/p&gt;
&lt;p&gt;Both libraries aim to shorten the path from idea to action. They help you test new patterns, share scripts with your team, and work with Iceberg and Dremio in a direct way.&lt;/p&gt;
&lt;p&gt;&lt;img src=&quot;https://i.imgur.com/NgAbTNY.png&quot; alt=&quot;How dremioframe and iceframe work&quot;&gt;&lt;/p&gt;
&lt;h2&gt;Meet DremioFrame&lt;/h2&gt;
&lt;p&gt;DremioFrame is a Python client for Dremio Cloud and Dremio Software. It wraps the REST API in a clean set of methods. You can manage sources, folders, views, tags, and security rules with short commands. You can also run SQL and work with query results as DataFrames.&lt;/p&gt;
&lt;p&gt;The library gives you a clear structure. You access the catalog through &lt;code&gt;client.catalog&lt;/code&gt;. You manage users and roles through &lt;code&gt;client.admin&lt;/code&gt;. You can also manage reflections that speed up queries. Each action is a direct Python call that maps to a known Dremio feature.&lt;/p&gt;
&lt;p&gt;The design is simple. You write code that creates a source, builds a view, assigns a policy, or deletes an item. You do not handle request URLs or version tags yourself. This helps teams move faster and keep their scripts readable.&lt;/p&gt;
&lt;p&gt;DremioFrame fits well in automation. You can create large batches of folders or datasets through parallel calls. You can also use it in small scripts that update a single view. The goal is to make Dremio easier to use in everyday work.&lt;/p&gt;
&lt;p&gt;&lt;img src=&quot;https://i.imgur.com/1G2hEi2.png&quot; alt=&quot;dremioframe leverages the Dremio Engine&quot;&gt;&lt;/p&gt;
&lt;h2&gt;Meet IceFrame&lt;/h2&gt;
&lt;p&gt;IceFrame is a Python library that gives you direct control over Iceberg tables. It focuses on clear commands that help you maintain data and keep tables fast. You can compact small files, sort data, evolve partitions, and clear old snapshots. Each task uses a short call that reflects the action you want to take.&lt;/p&gt;
&lt;p&gt;The library also supports Iceberg views when the catalog allows it. You can define a view with a simple SQL string and replace it when your logic changes. You can also call stored procedures that handle cleanup and maintenance. This includes rewriting files, removing orphan files, and keeping only recent snapshots.&lt;/p&gt;
&lt;p&gt;IceFrame includes an AI assistant for table exploration. You can ask questions in plain language. The tool can show schemas, write example code, and suggest filters or joins. This helps new users learn how the data is shaped and how to work with it.&lt;/p&gt;
&lt;p&gt;The goal is steady control with minimal code. You keep your tables healthy and easy to query. You also gain tools to understand your data without long setup or manual checks.&lt;/p&gt;
&lt;p&gt;&lt;img src=&quot;https://i.imgur.com/E9p232c.png&quot; alt=&quot;iceframe&quot;&gt;&lt;/p&gt;
&lt;h2&gt;Why Dremio Cloud Is the Best Place to Try Them&lt;/h2&gt;
&lt;p&gt;Dremio Cloud gives you a smooth way to test both libraries. The trial includes a built-in Iceberg catalog with hosted storage (you can use your own storage with a non-trial account), so you can create tables right away. You do not need to run a separate service or set up extra storage. You write code, create a table, and see it in the Dremio catalog within seconds.&lt;/p&gt;
&lt;p&gt;The free 30-day trial includes $400 in credits. You can sign up at &lt;a href=&quot;https://drmevn.fyi/am-get-started&quot;&gt;the get started page&lt;/a&gt;. This gives you enough room to explore IceFrame operations, build views with DremioFrame, and test how the two tools work together.&lt;/p&gt;
&lt;p&gt;The setup is light. You create a personal access token, connect through Python, and begin writing code. You can also switch between the console and your scripts to see changes in real time. This makes the trial a strong place for experiments, quick tests, and early feedback.&lt;/p&gt;
&lt;p&gt;&lt;img src=&quot;https://i.imgur.com/ipzlvnA.png&quot; alt=&quot;Dremioframe in action&quot;&gt;&lt;/p&gt;
&lt;p&gt;&lt;img src=&quot;https://i.imgur.com/1XHwCGQ.png&quot; alt=&quot;Iceframe in Action&quot;&gt;&lt;/p&gt;
&lt;h2&gt;Using Both Libraries Together&lt;/h2&gt;
&lt;p&gt;You can use IceFrame and DremioFrame in the same workflow. IceFrame lets you create and shape Iceberg tables locally. DremioFrame lets you see those tables in the catalog, build views on top of them alongside other databases/lakes/warehouses, and apply rules for access or masking. This gives you one flow from data creation to data use.&lt;/p&gt;
&lt;p&gt;A simple pattern looks like this. You can write to and manage lightweight Iceberg tables using iceframe for local processing, and use dremioframe to work with Dremio for extensive data processing and query federation with databases/lakes/warehouses, and to curate a semantic and governance layer on top of your data.&lt;/p&gt;
&lt;p&gt;You do not move between many tools. You do not manage long request bodies. You write small blocks of code that express the action you need. This helps teams test new ideas and keep their work easy to read and share.&lt;/p&gt;
&lt;p&gt;&lt;img src=&quot;https://i.imgur.com/BBJMzRh.png&quot; alt=&quot;Using dremioframe and iceframe together&quot;&gt;&lt;/p&gt;
&lt;h2&gt;How to Get Started&lt;/h2&gt;
&lt;ul&gt;
&lt;li&gt;&lt;a href=&quot;https://github.com/developer-advocacy-dremio/dremio-cloud-dremioframe&quot;&gt;Dremioframe Repo&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&quot;https://pypi.org/project/dremioframe/&quot;&gt;Dremioframe on Pypi&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&quot;https://github.com/AlexMercedCoder/iceframe&quot;&gt;Iceframe Repo&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&quot;https://pypi.org/project/iceframe/&quot;&gt;Iceframe on Pypi&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;You can install both libraries with a single step. Run &lt;code&gt;pip install dremioframe&lt;/code&gt; and &lt;code&gt;pip install iceframe&lt;/code&gt;. You can then import them in any script or notebook. This gives you direct access to the Dremio catalog and your Iceberg tables.&lt;/p&gt;
&lt;p&gt;You do not need to clone the repos to use the tools. Cloning is only needed if you want to read the source code or contribute changes. Most users will install the packages from PyPI and begin writing code right away.&lt;/p&gt;
&lt;p&gt;After installation, you create a personal access token in Dremio Cloud. You pass that token to DremioFrame when you create the client. You also point IceFrame at your Iceberg catalog. Once this is done, you can create tables, define views, and run cleanup tasks with short commands.&lt;/p&gt;
&lt;h2&gt;Side-by-Side Examples&lt;/h2&gt;
&lt;p&gt;The two libraries serve different roles, but they work well together. The examples below show how to connect, run a simple query, and create a table in each library. The code stays short in both cases.&lt;/p&gt;
&lt;h3&gt;Connect&lt;/h3&gt;
&lt;p&gt;&lt;strong&gt;DremioFrame&lt;/strong&gt;&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-python&quot;&gt;from dremioframe.client import DremioClient

client = DremioClient(
    token=&amp;quot;YOUR_DREMIO_CLOUD_PAT&amp;quot;,
    project_id=&amp;quot;YOUR_PROJECT_ID&amp;quot;
)
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;&lt;strong&gt;IceFrame&lt;/strong&gt;&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-python&quot;&gt;from iceframe import IceFrame

ice = IceFrame(
    {
        &amp;quot;uri&amp;quot;: &amp;quot;https://catalog.dremio.cloud/api/iceberg/v1&amp;quot;,
        &amp;quot;token&amp;quot;: &amp;quot;YOUR_DREMIO_CLOUD_PAT&amp;quot;,
        &amp;quot;project_id&amp;quot;: &amp;quot;YOUR_PROJECT_ID&amp;quot;
    }
)
&lt;/code&gt;&lt;/pre&gt;
&lt;h3&gt;Run a Query&lt;/h3&gt;
&lt;p&gt;&lt;strong&gt;DremioFrame&lt;/strong&gt;&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-python&quot;&gt;df = client.sql.run(&amp;quot;SELECT 1 AS value&amp;quot;)
print(df)
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;&lt;strong&gt;IceFrame&lt;/strong&gt;&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-python&quot;&gt;result = ice.query(&amp;quot;some_table&amp;quot;).limit(10).execute()
print(result)
&lt;/code&gt;&lt;/pre&gt;
&lt;h3&gt;Create a Table&lt;/h3&gt;
&lt;p&gt;&lt;strong&gt;DremioFrame&lt;/strong&gt;
You create a view or dataset through the catalog. Here is a simple view example.&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-python&quot;&gt;client.catalog.create_view(
    path=[&amp;quot;Samples&amp;quot;, &amp;quot;small_view&amp;quot;],
    sql=&amp;quot;SELECT * FROM Samples.samples.Employees&amp;quot;
)
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;&lt;strong&gt;IceFrame&lt;/strong&gt;
You create an Iceberg table by writing data.&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-python&quot;&gt;from datetime import datetime

data = [
    {&amp;quot;id&amp;quot;: 1, &amp;quot;name&amp;quot;: &amp;quot;Ada&amp;quot;, &amp;quot;created_at&amp;quot;: datetime.utcnow()},
    {&amp;quot;id&amp;quot;: 2, &amp;quot;name&amp;quot;: &amp;quot;Max&amp;quot;, &amp;quot;created_at&amp;quot;: datetime.utcnow()}
]

ice.create_table(&amp;quot;my_table&amp;quot;, data=data)
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;These examples show the contrast. DremioFrame works with the Dremio catalog. IceFrame works with Iceberg storage. When used together, they give you a complete path from data creation to query.&lt;/p&gt;
&lt;h2&gt;Query Builder Examples&lt;/h2&gt;
&lt;p&gt;Both libraries include a query builder. Each builder keeps the code readable and avoids long SQL strings. The examples below show how each one works.&lt;/p&gt;
&lt;h3&gt;DremioFrame Query Builder&lt;/h3&gt;
&lt;p&gt;DremioFrame can build SQL through a fluent API. You call &lt;code&gt;client.table(...)&lt;/code&gt; to start. You then add filters, selects, joins, or limits. The builder compiles the final SQL when you run the query.&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-python&quot;&gt;# Start with a table in the Dremio catalog
df = (
    client.table(&amp;quot;Samples.samples.Employees&amp;quot;)
        .select(&amp;quot;employee_id&amp;quot;, &amp;quot;full_name&amp;quot;, &amp;quot;department&amp;quot;)
        .filter(&amp;quot;department = &apos;Engineering&apos;&amp;quot;)
        .limit(5)
        .run()
)

print(df)
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;This pattern helps when you want to build queries from variables or reuse parts of the logic. The SQL stays clean, and the structure is easy to read.&lt;/p&gt;
&lt;h3&gt;IceFrame Query Builder&lt;/h3&gt;
&lt;p&gt;IceFrame includes a builder for Iceberg tables. You call &lt;code&gt;ice.query(&amp;quot;table_name&amp;quot;)&lt;/code&gt; to start. You can then filter rows, pick columns, join tables, or sort results. The builder runs the final plan with &lt;code&gt;execute()&lt;/code&gt;. It will determine what parts of the query should be used for Iceberg predicate pushdown and what should be handled after scanning the data for better peformance.&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-python&quot;&gt;from iceframe.expressions import Column

result = (
    ice.query(&amp;quot;my_table&amp;quot;)
        .filter(Column(&amp;quot;id&amp;quot;) &amp;gt; 10)
        .select(&amp;quot;id&amp;quot;, &amp;quot;name&amp;quot;)
        .sort(&amp;quot;id&amp;quot;)
        .limit(5)
        .execute()
)

print(result)
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;This pattern keeps the logic simple. You express intent with short steps. The code stays close to how you think about the data.&lt;/p&gt;
&lt;p&gt;Both builders help you avoid long SQL strings. They also make it easier to share examples with your team and adapt them to new cases.&lt;/p&gt;
&lt;h2&gt;Agents and Procedures&lt;/h2&gt;
&lt;p&gt;Both libraries include features that help you work faster with less manual code. Each tool offers an agent that can guide you through common tasks. IceFrame also includes direct access to Iceberg procedures that keep tables healthy.&lt;/p&gt;
&lt;h3&gt;Agents&lt;/h3&gt;
&lt;p&gt;&lt;strong&gt;DremioFrame Agent&lt;/strong&gt;&lt;br&gt;
DremioFrame includes an optional agent that can help you work with DremioFrame and Dremio. It can help you write queries, write DremioFrame scripts, and much more.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;IceFrame Agent&lt;/strong&gt;&lt;br&gt;
IceFrame includes a chat agent for Iceberg tables. You can ask about table schemas, filters, and joins. The agent can write Python code for common IceFrame tasks. It can also explain how to compact files or clean snapshots. This helps new users understand how each feature works. It also helps teams share patterns in a simple way.&lt;/p&gt;
&lt;h3&gt;IceFrame Procedures&lt;/h3&gt;
&lt;p&gt;IceFrame gives you access to Iceberg maintenance procedures. These keep data clean and reduce the cost of reading tables. You call each procedure with a short command.&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-python&quot;&gt;# Rewrite data files
ice.call_procedure(&amp;quot;my_table&amp;quot;, &amp;quot;rewrite_data_files&amp;quot;, target_file_size_mb=256)

# Remove old snapshots
ice.call_procedure(&amp;quot;my_table&amp;quot;, &amp;quot;expire_snapshots&amp;quot;, older_than_ms=7 * 24 * 3600 * 1000)

# Remove orphan files
ice.call_procedure(&amp;quot;my_table&amp;quot;, &amp;quot;remove_orphan_files&amp;quot;)

# Fast-forward a branch
ice.call_procedure(&amp;quot;my_table&amp;quot;, &amp;quot;fast_forward&amp;quot;, to_branch=&amp;quot;main&amp;quot;)
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;These steps help you keep tables tidy. They reduce file counts, remove unused data, and keep history at a safe size. You can schedule them or run them by hand. Paired with the agent, you have a clear path from learning a task to running it.&lt;/p&gt;
&lt;p&gt;The two libraries share a goal. They help you act faster and with less effort. The agents guide you. The procedures handle the work that keeps your tables stable.&lt;/p&gt;
&lt;p&gt;&lt;img src=&quot;https://i.imgur.com/F1xxpWP.png&quot; alt=&quot;dremioframe and iceframe&quot;&gt;&lt;/p&gt;
</content:encoded><author>Alex Merced</author></item><item><title>Introducing dremioframe - A Pythonic DataFrame Interface for Dremio</title><link>https://iceberglakehouse.com/posts/2025-11-introducing-dremioframe-dataframe-python-library/</link><guid isPermaLink="true">https://iceberglakehouse.com/posts/2025-11-introducing-dremioframe-dataframe-python-library/</guid><description>
If you&apos;re a data analyst or Python developer who prefers chaining expressive `.select()` and `.mutate()` calls over writing raw SQL, you&apos;re going to ...</description><pubDate>Sat, 29 Nov 2025 09:00:00 GMT</pubDate><content:encoded>&lt;p&gt;If you&apos;re a data analyst or Python developer who prefers chaining expressive &lt;code&gt;.select()&lt;/code&gt; and &lt;code&gt;.mutate()&lt;/code&gt; calls over writing raw SQL, you&apos;re going to love &lt;code&gt;dremioframe&lt;/code&gt; — the unofficial Python DataFrame library for Dremio (currently in Alpha).&lt;/p&gt;
&lt;p&gt;Dremio has always made it easy to query across cloud and on-prem datasets using SQL. Some users prefer the ergonomics of DataFrame-style APIs, where transformations are composable, readable, and testable — especially when working in notebooks or building data pipelines in Python.&lt;/p&gt;
&lt;p&gt;That’s where &lt;code&gt;dremioframe&lt;/code&gt; comes in. It bridges the gap between SQL and Python by letting you build Dremio queries using intuitive DataFrame methods like &lt;code&gt;.select()&lt;/code&gt;, &lt;code&gt;.filter()&lt;/code&gt;, &lt;code&gt;.mutate()&lt;/code&gt;, and more. Under the hood, it still generates SQL and pushes down queries to Dremio, but you write it the way you&apos;re used to in Python.&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;Want to try this yourself?&lt;br&gt;
You can &lt;a href=&quot;https://drmevn.fyi/am-get-started&quot;&gt;sign up for a free 30-day trial of Dremio Cloud&lt;/a&gt;, which includes full access to Agentic AI features, native Apache Iceberg integration, and support for all Iceberg catalogs (e.g. AWS Glue, Nessie, Snowflake, Hive, etc.).&lt;br&gt;
Or if you&apos;d rather run Dremio locally for free, check out the &lt;a href=&quot;https://www.dremio.com/blog/intro-to-dremio-nessie-and-apache-iceberg-on-your-laptop/&quot;&gt;Community Edition setup guide&lt;/a&gt;. Community Edition doesn’t include Agentic AI or full catalog support, but still lets you run federated queries and work with some Iceberg catalogs like Glue and Nessie.&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;In this post, we’ll walk through how to get started with &lt;code&gt;dremioframe&lt;/code&gt;—from installing the library and configuring authentication, to writing powerful queries using SQL, DataFrame chaining, and expression builders. We’ll wrap up with a look at some of the more advanced features it unlocks for analytics, ingestion, and administration.&lt;/p&gt;
&lt;p&gt;Let’s dive in.&lt;/p&gt;
&lt;h2&gt;Installing &lt;code&gt;dremioframe&lt;/code&gt; and Setting Up Your Environment&lt;/h2&gt;
&lt;p&gt;To get started, you’ll need to install the &lt;code&gt;dremioframe&lt;/code&gt; Python package. It’s published on PyPI and can be installed with pip:&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-bash&quot;&gt;pip install dremioframe
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Once installed, you’ll need to set up authentication so the library can connect to your Dremio instance. The easiest way to do this is by setting environment variables in a .env file or directly in your shell.&lt;/p&gt;
&lt;h3&gt;For Dremio Cloud (recommended for full feature access):&lt;/h3&gt;
&lt;p&gt;In your .env file (or shell), set the following:&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-env&quot;&gt;DREMIO_PAT=&amp;lt;your_personal_access_token&amp;gt;
DREMIO_PROJECT_ID=&amp;lt;your_project_id&amp;gt;
DREMIO_PROJECT_NAME=&amp;lt;your_project_name&amp;gt;
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;These credentials can be generated in your Dremio Cloud account by going to project settings.&lt;/p&gt;
&lt;h4&gt;Don’t have an account?&lt;/h4&gt;
&lt;p&gt;&lt;a href=&quot;https://drmevn.fyi/am-get-started&quot;&gt;Start your free 30-day trial of Dremio Cloud&lt;/a&gt; to use dremioframe with Agentic AI, native Apache Iceberg support, and full access to all Iceberg catalogs.&lt;/p&gt;
&lt;h3&gt;For Dremio Community Edition (local setup):&lt;/h3&gt;
&lt;p&gt;If you&apos;re running Dremio locally, for example using the Community Edition, you’ll use a different set of environment variables or pass connection parameters directly in code:&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-env&quot;&gt;DREMIO_HOSTNAME=localhost
DREMIO_PORT=32010
DREMIO_USERNAME=admin
DREMIO_PASSWORD=password123
DREMIO_TLS=false
&lt;/code&gt;&lt;/pre&gt;
&lt;h4&gt;Not ready for the cloud yet?&lt;/h4&gt;
&lt;p&gt;You can &lt;a href=&quot;https://www.dremio.com/blog/intro-to-dremio-nessie-and-apache-iceberg-on-your-laptop/&quot;&gt;try the Community Edition locally by following this guide&lt;/a&gt;.
It supports federated queries and works with some Iceberg catalogs (like AWS Glue and Nessie), though it doesn’t include the AI features or full catalog support available in Dremio Cloud and Enterprise.&lt;/p&gt;
&lt;p&gt;With your environment configured, you’re ready to connect to Dremio and start querying like a Pythonista.&lt;/p&gt;
&lt;h2&gt;Creating a Dremio Client (Sync or Async)&lt;/h2&gt;
&lt;p&gt;Once your environment is set up, the next step is to create a &lt;code&gt;DremioClient&lt;/code&gt; instance. This object is your entry point for running queries with &lt;code&gt;dremioframe&lt;/code&gt;.&lt;/p&gt;
&lt;h3&gt;Synchronous Client&lt;/h3&gt;
&lt;p&gt;For most use cases, the synchronous client is sufficient and straightforward to use. If you&apos;ve set your environment variables, you can initialize the client like this:&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-python&quot;&gt;from dremioframe.client import DremioClient

client = DremioClient()  # reads config from environment
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;If you prefer to pass credentials explicitly (useful in scripts or when using the Community Edition), you can do:&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-python&quot;&gt;client = DremioClient(
    hostname=&amp;quot;localhost&amp;quot;,
    port=32010,
    username=&amp;quot;admin&amp;quot;,
    password=&amp;quot;password123&amp;quot;,
    tls=False  # Set to True if connecting over HTTPS
)
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;This sets up a connection to your Dremio instance using standard authentication.&lt;/p&gt;
&lt;h3&gt;Asynchronous Client&lt;/h3&gt;
&lt;p&gt;If you&apos;re working in an async application (e.g., FastAPI, asyncio notebooks, etc.), dremioframe also supports an async client:&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-python&quot;&gt;from dremioframe.client import AsyncDremioClient

async with AsyncDremioClient(
    pat=&amp;quot;YOUR_PAT&amp;quot;, 
    project_id=&amp;quot;YOUR_PROJECT_ID&amp;quot;
) as client:
    df = await client.table(&amp;quot;Samples.samples.dremio.com.zips.json&amp;quot;) \
                    .select(&amp;quot;city&amp;quot;, &amp;quot;state&amp;quot;) \
                    .limit(5) \
                    .toPandas()
    print(df)
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;The async API mirrors the sync one, but allows you to await results in event-driven applications.&lt;/p&gt;
&lt;h2&gt;Running a Pure SQL Query&lt;/h2&gt;
&lt;p&gt;Even though &lt;code&gt;dremioframe&lt;/code&gt; shines with its DataFrame-style interface, you can still execute raw SQL when needed using the &lt;code&gt;.query()&lt;/code&gt; method. This is helpful when you already have a SQL statement or want to run ad hoc queries.&lt;/p&gt;
&lt;p&gt;Here’s a simple example that selects city and state from the sample zips dataset:&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-python&quot;&gt;df = client.query(&amp;quot;&amp;quot;&amp;quot;
    SELECT city, state
    FROM Samples.samples.dremio.com.zips.json
    WHERE state = &apos;CA&apos;
    ORDER BY city
    LIMIT 10
&amp;quot;&amp;quot;&amp;quot;)

print(df)
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;The result is a lightweight wrapper around a Pandas DataFrame, so you can treat it just like any other DataFrame in Python.&lt;/p&gt;
&lt;p&gt;You can also convert it explicitly to a Pandas DataFrame if needed:&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-python&quot;&gt;pdf = df.toPandas()
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;&lt;strong&gt;Tip:&lt;/strong&gt; Dremio optimizes and accelerates this query under the hood, especially when you&apos;re on &lt;a href=&quot;https://drmevn.fyi/am-get-started&quot;&gt;Dremio Cloud&lt;/a&gt;, where features like autonomous reflection caching are automatic and don&apos;t need manual usage.&lt;/p&gt;
&lt;p&gt;If you prefer a hybrid approach, dremioframe allows mixing SQL and DataFrame APIs freely—which we&apos;ll explore next.&lt;/p&gt;
&lt;h2&gt;Querying with &lt;code&gt;.select()&lt;/code&gt; and SQL Functions&lt;/h2&gt;
&lt;p&gt;The real power of &lt;code&gt;dremioframe&lt;/code&gt; comes from its expressive, Pandas-like query builder. You can use &lt;code&gt;.select()&lt;/code&gt; to pick columns and include SQL expressions, just like in raw SQL — but with the clarity and structure of method chaining.&lt;/p&gt;
&lt;p&gt;Let’s say we want to select a few fields and apply a SQL function like &lt;code&gt;UPPER()&lt;/code&gt; to transform the state name:&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-python&quot;&gt;df = client.table(&amp;quot;Samples.samples.dremio.com.zips.json&amp;quot;) \
           .select(
               &amp;quot;city&amp;quot;, 
               &amp;quot;state&amp;quot;, 
               &amp;quot;pop&amp;quot;, 
               &amp;quot;UPPER(state) AS state_upper&amp;quot;  # using SQL function
           ) \
           .filter(&amp;quot;pop &amp;gt; 100000&amp;quot;) \
           .limit(10) \
           .collect()

print(df)
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;This returns 10 rows where the population is over 100,000 and includes the state_upper column that’s uppercased using Dremio’s SQL engine.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Remember:&lt;/strong&gt; even though you&apos;re using .select(), these expressions are passed through directly to Dremio and fully optimized as part of the SQL query plan.&lt;/p&gt;
&lt;p&gt;You can freely combine standard column names with SQL functions, aliases, expressions, and computed columns. This lets you build powerful queries without writing SQL directly.&lt;/p&gt;
&lt;p&gt;Want to experiment yourself? Spin up a &lt;a href=&quot;https://drmevn.fyi/am-get-started&quot;&gt;free Dremio Cloud workspace&lt;/a&gt; or try the &lt;a href=&quot;https://www.dremio.com/blog/intro-to-dremio-nessie-and-apache-iceberg-on-your-laptop/&quot;&gt;Community Edition on your laptop&lt;/a&gt;.&lt;/p&gt;
&lt;h2&gt;Transforming Data with &lt;code&gt;.mutate()&lt;/code&gt;&lt;/h2&gt;
&lt;p&gt;While &lt;code&gt;.select()&lt;/code&gt; is great for choosing and computing columns in one go, &lt;code&gt;.mutate()&lt;/code&gt; lets you &lt;strong&gt;add new derived columns&lt;/strong&gt; to an existing selection — much like &lt;code&gt;mutate()&lt;/code&gt; in R or &lt;code&gt;.assign()&lt;/code&gt; in Pandas.&lt;/p&gt;
&lt;p&gt;Let’s take the same query from before and add a new column that calculates population density by dividing population by a fictional land area (just for demo purposes):&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-python&quot;&gt;df = client.table(&amp;quot;Samples.samples.dremio.com.zips.json&amp;quot;) \
           .select(&amp;quot;city&amp;quot;, &amp;quot;state&amp;quot;, &amp;quot;pop&amp;quot;) \
           .mutate(
               pop_thousands=&amp;quot;pop / 1000&amp;quot;,               # create a scaled version
               pop_label=&amp;quot;CASE WHEN pop &amp;gt; 100000 THEN &apos;large&apos; ELSE &apos;small&apos; END&amp;quot;
           ) \
           .filter(&amp;quot;state = &apos;TX&apos;&amp;quot;) \
           .limit(10) \
           .collect()

print(df)
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;In this example:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;
&lt;p&gt;&lt;code&gt;pop_thousands&lt;/code&gt; is a new numeric column.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;code&gt;pop_label&lt;/code&gt; is a new string column based on a conditional expression using CASE WHEN.&lt;/p&gt;
&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;You can pass any SQL-compatible string expressions into .mutate() using column_name=expression syntax. The expressions are compiled into the underlying SQL query, so performance is fully optimized.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Pro tip:&lt;/strong&gt; You can chain multiple .mutate() calls if you prefer smaller, incremental steps.&lt;/p&gt;
&lt;p&gt;Try experimenting with your own columns! If you’re using &lt;a href=&quot;https://drmevn.fyi/am-get-started&quot;&gt;Dremio Cloud&lt;/a&gt;, you can test these queries on larger datasets with full query acceleration and Iceberg table support. Or run &lt;a href=&quot;https://www.dremio.com/blog/intro-to-dremio-nessie-and-apache-iceberg-on-your-laptop/&quot;&gt;Community Edition&lt;/a&gt; locally to follow along with your own data.&lt;/p&gt;
&lt;h2&gt;Building Queries Programmatically with the Function API&lt;/h2&gt;
&lt;p&gt;For more complex or dynamic queries, &lt;code&gt;dremioframe&lt;/code&gt; provides a powerful &lt;strong&gt;function builder API&lt;/strong&gt; through the &lt;code&gt;F&lt;/code&gt; module — similar to how PySpark or dplyr work. This lets you construct expressions programmatically rather than writing raw SQL strings.&lt;/p&gt;
&lt;p&gt;Let’s rewrite the previous example using &lt;code&gt;F&lt;/code&gt;:&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-python&quot;&gt;from dremioframe import F

df = client.table(&amp;quot;Samples.samples.dremio.com.zips.json&amp;quot;) \
           .select(
               F.col(&amp;quot;city&amp;quot;),
               F.col(&amp;quot;state&amp;quot;),
               F.col(&amp;quot;pop&amp;quot;),
               (F.col(&amp;quot;pop&amp;quot;) / 1000).alias(&amp;quot;pop_thousands&amp;quot;),
               F.case()
                 .when(F.col(&amp;quot;pop&amp;quot;) &amp;gt; 100000, F.lit(&amp;quot;large&amp;quot;))
                 .else_(F.lit(&amp;quot;small&amp;quot;))
                 .end()
                 .alias(&amp;quot;pop_label&amp;quot;)
           ) \
           .filter(F.col(&amp;quot;state&amp;quot;) == F.lit(&amp;quot;TX&amp;quot;)) \
           .limit(10) \
           .collect()

print(df)
&lt;/code&gt;&lt;/pre&gt;
&lt;h3&gt;What’s happening here?&lt;/h3&gt;
&lt;ul&gt;
&lt;li&gt;
&lt;p&gt;&lt;code&gt;F.col(&amp;quot;column_name&amp;quot;)&lt;/code&gt; references a column.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;code&gt;F.case().when(...).else_(...).end()&lt;/code&gt; builds a SQL &lt;code&gt;CASE WHEN&lt;/code&gt; expression.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;code&gt;F.lit(&amp;quot;value&amp;quot;)&lt;/code&gt; injects a literal value into the expression.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;Arithmetic operations like / can be done using Python operators.&lt;/p&gt;
&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;This method is especially useful when building queries dynamically — for instance, choosing which fields to include or filter based on user input.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Tip:&lt;/strong&gt; You can mix function objects with standard strings if needed. Just make sure each expression passed to &lt;code&gt;.select()&lt;/code&gt; or &lt;code&gt;.mutate()&lt;/code&gt; is either a string or an &lt;code&gt;F&lt;/code&gt; object.&lt;/p&gt;
&lt;p&gt;Want to try building dynamic queries against Iceberg tables or REST-ingested datasets? Sign up for &lt;a href=&quot;https://drmevn.fyi/am-get-started&quot;&gt;Dremio Cloud&lt;/a&gt; or use &lt;a href=&quot;https://www.dremio.com/blog/intro-to-dremio-nessie-and-apache-iceberg-on-your-laptop/&quot;&gt;Community Edition&lt;/a&gt; to test these locally.&lt;/p&gt;
&lt;h2&gt;What Else Can &lt;code&gt;dremioframe&lt;/code&gt; Do?&lt;/h2&gt;
&lt;p&gt;By now, you’ve seen how &lt;code&gt;dremioframe&lt;/code&gt; lets you run SQL, build DataFrame-style queries, and programmatically compose logic using expressions. But there’s much more under the hood.&lt;/p&gt;
&lt;p&gt;Here’s a quick overview of some additional capabilities you might find useful:&lt;/p&gt;
&lt;h3&gt;🔄 Joins, Unions, and Time Travel&lt;/h3&gt;
&lt;ul&gt;
&lt;li&gt;Join tables with &lt;code&gt;.join()&lt;/code&gt;, &lt;code&gt;.left_join()&lt;/code&gt;, &lt;code&gt;.right_join()&lt;/code&gt;, or &lt;code&gt;.full_join()&lt;/code&gt; using either SQL expressions or &lt;code&gt;F&lt;/code&gt; functions.&lt;/li&gt;
&lt;li&gt;Use &lt;code&gt;.union()&lt;/code&gt; to combine rows from two datasets.&lt;/li&gt;
&lt;li&gt;Query historical snapshots of Iceberg tables using &lt;code&gt;.at_snapshot(&amp;quot;SNAPSHOT_ID&amp;quot;)&lt;/code&gt;.&lt;/li&gt;
&lt;/ul&gt;
&lt;pre&gt;&lt;code class=&quot;language-python&quot;&gt;df = client.table(&amp;quot;sales&amp;quot;).at_snapshot(&amp;quot;123456789&amp;quot;)
&lt;/code&gt;&lt;/pre&gt;
&lt;ul&gt;
&lt;li&gt;Iceberg time travel is fully supported in Dremio Cloud and Dremio Enterprise.&lt;/li&gt;
&lt;/ul&gt;
&lt;h3&gt;Ingest External Data&lt;/h3&gt;
&lt;p&gt;You can pull data from REST APIs and ingest it directly into Dremio:&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-python&quot;&gt;client.ingest_api(
    url=&amp;quot;https://jsonplaceholder.typicode.com/posts&amp;quot;,
    table_name=&amp;quot;sandbox.api_posts&amp;quot;,
    mode=&amp;quot;merge&amp;quot;
)
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;You can also insert Pandas DataFrames into Dremio tables using:&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-python&quot;&gt;client.table(&amp;quot;sandbox.my_table&amp;quot;).insert(&amp;quot;sandbox.my_table&amp;quot;, data=pd_df)
&lt;/code&gt;&lt;/pre&gt;
&lt;h3&gt;Analyze, Visualize, and Export&lt;/h3&gt;
&lt;p&gt;Use &lt;code&gt;.group_by()&lt;/code&gt; with aggregates like &lt;code&gt;.sum()&lt;/code&gt;, &lt;code&gt;.count()&lt;/code&gt;, &lt;code&gt;.mean()&lt;/code&gt;.&lt;/p&gt;
&lt;p&gt;Sort with &lt;code&gt;.order_by()&lt;/code&gt;, paginate with &lt;code&gt;.offset()&lt;/code&gt;, and chart using &lt;code&gt;.chart()&lt;/code&gt;.&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-python&quot;&gt;df.chart(kind=&amp;quot;bar&amp;quot;, x=&amp;quot;state&amp;quot;, y=&amp;quot;pop&amp;quot;)
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Export results to local files:&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-python&quot;&gt;df.to_csv(&amp;quot;output.csv&amp;quot;)
df.to_parquet(&amp;quot;output.parquet&amp;quot;)
&lt;/code&gt;&lt;/pre&gt;
&lt;h3&gt;Data Quality Checks&lt;/h3&gt;
&lt;p&gt;Built-in expectations let you validate your data:&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-python&quot;&gt;df.quality.expect_not_null(&amp;quot;pop&amp;quot;)
df.quality.expect_column_values_to_be_between(&amp;quot;pop&amp;quot;, min=1, max=1000000)
&lt;/code&gt;&lt;/pre&gt;
&lt;h3&gt;Admin and Debug Tools&lt;/h3&gt;
&lt;ul&gt;
&lt;li&gt;
&lt;p&gt;Create and manage reflections (Dremio&apos;s Unique Acceleration Layer).&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;Retrieve and inspect job profiles with &lt;code&gt;.get_job_profile()&lt;/code&gt;.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;Use &lt;code&gt;.explain()&lt;/code&gt; to debug SQL plans:&lt;/p&gt;
&lt;/li&gt;
&lt;/ul&gt;
&lt;pre&gt;&lt;code class=&quot;language-python&quot;&gt;df.explain()
&lt;/code&gt;&lt;/pre&gt;
&lt;h3&gt;Asynchronous Queries &amp;amp; CLI Access&lt;/h3&gt;
&lt;ul&gt;
&lt;li&gt;
&lt;p&gt;Use AsyncDremioClient for non-blocking workflows.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;Run queries via the command-line tool dremio-cli.&lt;/p&gt;
&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;&lt;strong&gt;Pro tip&lt;/strong&gt;: Want to test features like data ingestion, Iceberg catalog browsing, and AI-powered analytics? &lt;a href=&quot;https://drmevn.fyi/am-get-started&quot;&gt;Dremio Cloud’s 30-day trial&lt;/a&gt; gives you full access. For local development, &lt;a href=&quot;https://www.dremio.com/blog/intro-to-dremio-nessie-and-apache-iceberg-on-your-laptop/&quot;&gt;Community Edition&lt;/a&gt; is a great way to experiment.&lt;/p&gt;
&lt;p&gt;&lt;code&gt;dremioframe&lt;/code&gt; is still evolving, but it&apos;s already a powerful toolkit for Pythonic analytics on top of Dremio’s lakehouse engine. Whether you&apos;re running federated queries, ingesting external APIs, or interacting with Iceberg tables, it helps you stay in the Python world while leveraging all the power of Dremio under the hood.&lt;/p&gt;
&lt;h2&gt;Conclusion&lt;/h2&gt;
&lt;p&gt;Whether you&apos;re an analyst who loves the clarity of chained DataFrame operations, or a Python developer looking to integrate Dremio into your data pipelines, &lt;code&gt;dremioframe&lt;/code&gt; offers a compelling, flexible, and powerful interface to Dremio&apos;s lakehouse capabilities.&lt;/p&gt;
&lt;p&gt;With just a few lines of code, you can:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Connect securely to Dremio Cloud or Community Edition&lt;/li&gt;
&lt;li&gt;Run raw SQL or chain DataFrame-style queries&lt;/li&gt;
&lt;li&gt;Add computed columns with &lt;code&gt;.mutate()&lt;/code&gt; or build expressions with the &lt;code&gt;F&lt;/code&gt; API&lt;/li&gt;
&lt;li&gt;Work with federated sources, Apache Iceberg tables, and even ingest external data&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;By using &lt;code&gt;dremioframe&lt;/code&gt;, you get the best of both worlds: the expressiveness of Python and the performance of Dremio’s SQL engine.&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;Don’t forget — you can &lt;a href=&quot;https://drmevn.fyi/am-get-started&quot;&gt;sign up for a free 30-day trial of Dremio Cloud&lt;/a&gt; to experience all the advanced features like Agentic AI and native support for all Iceberg catalogs.&lt;br&gt;
Or, if you&apos;re experimenting locally, &lt;a href=&quot;https://www.dremio.com/blog/intro-to-dremio-nessie-and-apache-iceberg-on-your-laptop/&quot;&gt;try Community Edition&lt;/a&gt; to run federated queries and interact with Glue or Nessie-based Iceberg tables.&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;The &lt;code&gt;dremioframe&lt;/code&gt; project is still evolving, but it’s already a powerful toolkit for building readable, maintainable, and scalable data workflows in Python. Give it a try and let us know what you build.&lt;/p&gt;
&lt;h2&gt;NOTE&lt;/h2&gt;
&lt;p&gt;&lt;code&gt;dremioframe&lt;/code&gt; is an unofficial library and currently in Alpha. Please submit any issues or pull requests to the &lt;a href=&quot;https://github.com/developer-advocacy-dremio/dremio-cloud-dremioframe?tab=readme-ov-file&quot;&gt;git repo&lt;/a&gt;.&lt;/p&gt;
</content:encoded><author>Alex Merced</author></item><item><title>Comprehensive Hands-on Walk Through of Dremio Cloud Next Gen (Hands-on with Free Trial)</title><link>https://iceberglakehouse.com/posts/2025-11-dremio-next-gen-cloud-tutorial/</link><guid isPermaLink="true">https://iceberglakehouse.com/posts/2025-11-dremio-next-gen-cloud-tutorial/</guid><description>
[Video Playlist of this Walkthough](https://www.youtube.com/playlist?list=PL-gIUf9e9CCvY0bcRBGu2SzFFR-yJGIB6)

On November 13, at the [Subsurface Lak...</description><pubDate>Wed, 12 Nov 2025 09:00:00 GMT</pubDate><content:encoded>&lt;p&gt;&lt;a href=&quot;https://www.youtube.com/playlist?list=PL-gIUf9e9CCvY0bcRBGu2SzFFR-yJGIB6&quot;&gt;Video Playlist of this Walkthough&lt;/a&gt;&lt;/p&gt;
&lt;p&gt;On November 13, at the &lt;a href=&quot;https://www.dremio.com/subsurface?utm_source=ev_external_blog&amp;amp;utm_medium=influencer&amp;amp;utm_campaign=pag&amp;amp;utm_term=nextgencloudtut&amp;amp;utm_content=alexmerced&quot;&gt;Subsurface Lakehouse Conference&lt;/a&gt; in New York City, Dremio announced and released &lt;a href=&quot;https://www.dremio.com/?utm_source=ev_external_blog&amp;amp;utm_medium=influencer&amp;amp;utm_campaign=pag&amp;amp;utm_term=nextgencloudtut&amp;amp;utm_content=alexmerced&quot;&gt;Dremio Next Gen Cloud&lt;/a&gt;, the most complete and accessible version of its Lakehouse Platform to date. This release advances Dremio’s mission to make data lakehouses easy, fast, and affordable for organizations of any size.&lt;/p&gt;
&lt;p&gt;This tutorial offers a hands-on introduction to Dremio and walks through the new free trial experience. With managed storage and no need to connect your own infrastructure or enter a credit card (until you want to), you can explore the full platform, including new AI features, Autonomous Performance Management, and the integrated lakehouse catalog, right away.&lt;/p&gt;
&lt;h2&gt;What is Dremio?&lt;/h2&gt;
&lt;p&gt;Dremio is a Data Lakehouse Platform for the AI Era, let&apos;s explore what this means.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;What is a Data Lakehouse?&lt;/strong&gt;&lt;/p&gt;
&lt;p&gt;A data lakehouse is an architecture that uses your data lake (object storage or Hadoop) as the primary data store for flexibility and openness, then adds two layers to operationalize it like a data warehouse:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;
&lt;p&gt;A table format such as Apache Iceberg, Delta Lake, Apache Hudi, or Apache Paimon. These formats allow structured datasets stored in Apache Parquet files to be treated as individual tables with ACID guarantees, snapshot isolation, time travel, and more—rather than just a collection of files without these capabilities.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;A lakehouse catalog that tracks your lakehouse tables and other assets. It serves as the central access point for data discovery and access control.&lt;/p&gt;
&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;Dremio is designed to unify these modular lakehouse components into a seamless experience. Unlike platforms that treat Iceberg as an add-on to proprietary formats, Dremio is built to be natively Iceberg-first—delivering a warehouse-like experience without vendor lock-in.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;The Challenges of the Data Lakehouse&lt;/strong&gt;&lt;/p&gt;
&lt;p&gt;While lakehouses offer the benefit of serving as a central source of truth across tools, they come with practical challenges during implementation.&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;
&lt;p&gt;How do you make your lakehouse work alongside other data that isn’t yet in the lakehouse?&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;How do you optimize storage as data files accumulate and become inefficient after multiple updates and snapshots?&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;Which catalog should you use, and how do you deploy and maintain it for secure, governed access?&lt;/p&gt;
&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;&lt;strong&gt;How Dremio’s Platform Supports the Lakehouse&lt;/strong&gt;&lt;/p&gt;
&lt;p&gt;Dremio simplifies many of these challenges with a platform that makes your lakehouse feel like it “just works.” It does this through several powerful features:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Query Federation&lt;/strong&gt;: Dremio is one of the fastest engines for Apache Iceberg queries, but it also connects to and queries other databases, data lakes, data warehouses, and catalogs efficiently. This means you can start using Dremio with your existing data infrastructure and transition to a full lakehouse setup over time.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Integrated Catalog&lt;/strong&gt;: Dremio includes a built-in Iceberg catalog, ready to use from day one. This catalog:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Is based on Apache Polaris, the community-led standard for lakehouse catalogs&lt;/li&gt;
&lt;li&gt;Automatically optimizes Iceberg table storage, eliminating manual tuning&lt;/li&gt;
&lt;li&gt;Provides governance for both Iceberg tables and SQL views with role-based and fine-grained access controls&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;End-to-End Performance Management&lt;/strong&gt;: Managing query performance can be time-consuming. Dremio reduces this burden by automatically clustering Iceberg tables and applying multiple layers of caching. One key feature is Autonomous Reflections, which accelerate queries behind the scenes based on actual usage patterns—improving performance before users even notice a problem.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Semantic and Context Layer&lt;/strong&gt;: Dremio includes a built-in semantic layer where you can define business concepts using SQL views, track lineage, and write documentation. This structure not only supports consistent usage across teams but also provides valuable context to AI systems for more accurate analysis.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;AI-Native Features&lt;/strong&gt;: Dremio Next Gen Cloud includes a built-in AI agent that can run queries, generate documentation, and create visualizations. For external AI systems, the MCP server gives agents access to both data and semantics. New AI functions also let you work with unstructured data for expanded analytical possibilities.&lt;/p&gt;
&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;Dremio aims to provide a familiar and easy SQL interface to all your data.&lt;/p&gt;
&lt;h2&gt;Registering For Dremio Trial&lt;/h2&gt;
&lt;p&gt;To get started with your Dremio Trial, head over to the &lt;a href=&quot;https://www.dremio.com/get-started/?utm_source=ev_external_blog&amp;amp;utm_medium=influencer&amp;amp;utm_campaign=pag&amp;amp;utm_term=nextgencloudtut&amp;amp;utm_content=alexmerced&quot;&gt;Getting Started Page&lt;/a&gt; and create a new account with your preferred method.&lt;/p&gt;
&lt;p&gt;&lt;img src=&quot;https://i.imgur.com/ut2jBNf.png&quot; alt=&quot;Getting Started Page with Dremio&quot;&gt;&lt;/p&gt;
&lt;p&gt;If using Google/Microsoft/Github you&apos;ll be all right up after authenticating, if signing up with your email you&apos;ll get an email to confirm your registration.&lt;/p&gt;
&lt;p&gt;&lt;img src=&quot;https://i.imgur.com/fnzehGU.png&quot; alt=&quot;Confirmation Email&quot;&gt;&lt;/p&gt;
&lt;p&gt;When you create a new Dremio account, it automatically creates a new &lt;code&gt;Organization&lt;/code&gt;, which can contain multiple &lt;code&gt;Projects&lt;/code&gt;. The organization will be assigned a default name, which you can change later.&lt;/p&gt;
&lt;p&gt;On the next screen, you’ll name your first project. This initial project will use Dremio’s managed storage as the default storage for the lakehouse catalog.&lt;/p&gt;
&lt;p&gt;If you prefer to use your own data lake as catalog storage, you can create a new project when you&apos;re ready. Currently, only Amazon S3 is supported for custom catalog storage, with additional options coming soon.&lt;/p&gt;
&lt;p&gt;Even though S3 is the only supported option for Dremio Catalog storage at the moment, Dremio still allows you to connect to other Iceberg catalogs backed by any cloud storage solution and data lakes using its wide range of source connectors.&lt;/p&gt;
&lt;p&gt;&lt;img src=&quot;https://i.imgur.com/fnzehGU.png&quot; alt=&quot;Choosing your Dremio Region and Project Name&quot;&gt;&lt;/p&gt;
&lt;p&gt;Now you&apos;ll be on your Dremio Dashboard where you&apos;ll wait a few minutes for your organization to be provisioned.&lt;/p&gt;
&lt;p&gt;&lt;img src=&quot;https://i.imgur.com/JlPAxrI.png&quot; alt=&quot;Provisioning of Dremio Project&quot;&gt;&lt;/p&gt;
&lt;p&gt;Once the environment is provisioned you&apos;ll see several options including a chat box to work with the new integrated Dremio AI Agent which we will revisit later in this tutorial.&lt;/p&gt;
&lt;p&gt;&lt;img src=&quot;https://i.imgur.com/lO0aeGl.png&quot; alt=&quot;The Dremio environment is now active!&quot;&gt;&lt;/p&gt;
&lt;p&gt;One of the best ways to get started is by adding data to Dremio by clicking &lt;code&gt;add data&lt;/code&gt; which will open a window where you can either:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;
&lt;p&gt;Connect an existing Database, Data Lake, Data Warehouse or Data Lakehouse catalog to begin querying data you have in other platforms&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;Upload a CSV, JSON or PARQUET file which will convert the file into an Iceberg table in Dremio Catalog for you to be able to query.&lt;/p&gt;
&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;If looking for some sample files to upload, &lt;a href=&quot;https://www.kaggle.com/&quot;&gt;Kaggle&lt;/a&gt; is always a good place to find some datasets to play with.&lt;/p&gt;
&lt;p&gt;Although for this tutorial let&apos;s use SQL to create tables in Dremio Catalog, insert records into those tables and then query them.&lt;/p&gt;
&lt;h2&gt;Curating Your Lakehouse&lt;/h2&gt;
&lt;p&gt;Let&apos;s go visit the data explorer to show you how you will navigate your integrated catalog and other datasources. Click on the second icon from the top in the menu that is to the left of the screen that looks like a table, this will take you to the dataset explorer.&lt;/p&gt;
&lt;p&gt;&lt;img src=&quot;https://i.imgur.com/iBnA0TP.png&quot; alt=&quot;Dremio&apos;s Navigation Menu&quot;&gt;&lt;/p&gt;
&lt;p&gt;In the dataset explorer you&apos;ll see two sections:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Namespaces&lt;/strong&gt;: This is the native Apache Polaris based catalog that belongs to your project. You create namespaces as top-level folders to organize your Apache Iceberg Tables and SQL Views (view SQL can refer to Iceberg and Non-Iceberg datasets, like joining an Iceberg table and a Snowflake Table).&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Sources&lt;/strong&gt;: These are the other sources you&apos;ve connected to Dremio using Dremio&apos;s connectors. You can open up a source and see all the tables available inside of it from the dataset explorer.&lt;/p&gt;
&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;Please click on the plus sign next to &amp;quot;namespaces&amp;quot; and add a new namespace called &amp;quot;dremio&amp;quot; this will be necessary to run some SQL scripts I give you without needing to modify them.&lt;/p&gt;
&lt;p&gt;&lt;img src=&quot;https://i.imgur.com/OSoBEP0.png&quot; alt=&quot;Adding a new namespace&quot;&gt;&lt;/p&gt;
&lt;p&gt;Now you&apos;ll see the new &lt;code&gt;dremio&lt;/code&gt; namespace and in their we can create new tables and views. You may notice there is already a sample data namespace which includes a variety of sample data you can use to experiment with if you want.&lt;/p&gt;
&lt;p&gt;&lt;img src=&quot;https://i.imgur.com/3PfTjp1.png&quot; alt=&quot;The new namespace has been added to Dremio&quot;&gt;&lt;/p&gt;
&lt;h2&gt;Running Some SQL&lt;/h2&gt;
&lt;p&gt;Now head over to the the &amp;quot;SQL Runner&amp;quot; a full SQL IDE built right into the Dremio experience which includes autocomplete, syntax highlighting, function lookup, the typical IDE shortcuts and much more. It is accessed by clicking the third menu icon which looks like a mini terminal window.&lt;/p&gt;
&lt;p&gt;&lt;img src=&quot;https://i.imgur.com/lKIFs6c.png&quot; alt=&quot;The Dremio SQL Runner&quot;&gt;&lt;/p&gt;
&lt;p&gt;Let me call out a few things to your attention:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;
&lt;p&gt;On the left you&apos;ll notice a column where you can browse available datasets you can use this drag dataset names or column names into your queries so you don&apos;t have to type things out everytime.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;You&apos;ll notice this column has a second tab called scripts, you can save the SQL in any tab as a script you can comeback to later, great for template scripts or scripts you haven&apos;t finished yet.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;The SQL editor is on top and the results viewer is on the bottom, if you run multiple SQL statements the results viewer will give you a tab for the result of each query run making it easy to isolate the results of different parts of your script.&lt;/p&gt;
&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;There is a lot more to learn about the SQL runner, but let&apos;s go ahead and run some SQL. I&apos;ve written several SQL scripts you can copy into the SQL Runner and run as is. Choose any of the below and copy them into SQL runner and run the SQL. Give the code a look over, comments in the code help explain what it is doing.&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;a href=&quot;https://github.com/developer-advocacy-dremio/apache-iceberg-lakehouse-workshop/blob/main/industry-examples/finance_example.sql&quot;&gt;Finance Example&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&quot;https://github.com/developer-advocacy-dremio/apache-iceberg-lakehouse-workshop/blob/main/industry-examples/gov_example.sql&quot;&gt;Government Example&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&quot;https://github.com/developer-advocacy-dremio/apache-iceberg-lakehouse-workshop/blob/main/industry-examples/healthcare_example.sql&quot;&gt;Healthcare Example&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&quot;https://github.com/developer-advocacy-dremio/apache-iceberg-lakehouse-workshop/blob/main/industry-examples/insurance_example.sql&quot;&gt;Government Example&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&quot;https://github.com/developer-advocacy-dremio/apache-iceberg-lakehouse-workshop/blob/main/industry-examples/manufacturing.sql&quot;&gt;Manufacturing Example with Data Health Checks&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&quot;https://github.com/developer-advocacy-dremio/apache-iceberg-lakehouse-workshop/blob/main/industry-examples/retail.sql&quot;&gt;Retail Example with Physical Transformations&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&quot;https://github.com/developer-advocacy-dremio/apache-iceberg-lakehouse-workshop/blob/main/industry-examples/supply_chain_example.sql&quot;&gt;Supply Chain Example&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;The SQL for the majority of these examples follow a similar pattern:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Create sub folder for the example&lt;/li&gt;
&lt;li&gt;Create a bronze/silver/gold subfolder in that subfolder&lt;/li&gt;
&lt;li&gt;Create and insert data in the base tables in the raw folder&lt;/li&gt;
&lt;li&gt;Join and model using a SQL view to create the silver layer&lt;/li&gt;
&lt;li&gt;Create a use case specific views from the silver view in the gold layer&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;This represents a very typical way of using Dremio where you model your datasets not by replicating your data but logically with SQL views. Dremio&apos;s autonomous reflections feature will see how these views are queried and dynamically determine what views should be materialized into the Dremio&apos;s reflection cache without anyone having to lift a finger keeping everything performance on storage and compute usage. A data engineer could also manually trigger the creation of a reflection and Dremio will assign that reflection a score to help understand whether the reflection is providing value or not, we&apos;ll show this when we got over Dremio&apos;s settings UI.&lt;/p&gt;
&lt;p&gt;Two of the examples do something a little different:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;
&lt;p&gt;The Manufacturing example uses SQL views to create a medallion architecture, but it then also creates health check views which allow you to get a quick glance at the health of a dataset by aggregating nulls, duplicates, etc.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;The retail example shows you what it may look like if you don&apos;t want to model your data logically but physically using CTAS (Create Table As) statements to materialize the data as physical tables through the bronze/silver/gold layers.&lt;/p&gt;
&lt;/li&gt;
&lt;/ul&gt;
&lt;h2&gt;AI Visualizations&lt;/h2&gt;
&lt;p&gt;Now that you&apos;ve added some data to your lakehouse but running some of my example scripts, let&apos;s show you some of the benefits of Dremio&apos;s new integrated AI Agent.&lt;/p&gt;
&lt;p&gt;While if you are planning to build long lived BI dashboards you should still leverage integrations most BI tools like Tableau and Power BI have with Dremio, but if you need to just seem something visual quick or just need something to screenshot into a presentation the AI Agent can be quite helpful.&lt;/p&gt;
&lt;p&gt;So let&apos;s say I run the supply chain example, I could prompt the AI Agent with a prompt like:&lt;/p&gt;
&lt;p&gt;&lt;code&gt;Can you create a bar chart based on the data in the gold layer of my supply chain data&lt;/code&gt;&lt;/p&gt;
&lt;p&gt;The AI Agent will pull the list of tables your particular user has access to and see if it finds the dataset I loosely referred to. It will then pull the schema and other data from the semantic layer to better understand the dataset before deciding how to best visualize it. But of course my prompt was vague and the AI agent decides to confirm some details with me:&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;I can create that bar chart — quick clarifying questions before I run the query
I inspected your gold layer and found the view dremio.supply_chain.gold.daily_supplier_delivery_kpis with these columns:

supplier_id (string)
receipt_date (date)
lines_received (int)
otif_rate (double)
on_time_rate (double)
in_full_rate (double)
total_received_qty (int)
total_order_qty (int)
gross_fill_rate (decimal)
To produce the bar chart I need to know which metric and timeframe you want visualized. Common options:

Average OTIF per supplier (bar per supplier showing average “otif_rate” across the dataset)
Average gross fill rate per supplier (bar per supplier showing average “gross_fill_rate”)
Total lines received per supplier for a specific date or date range (sum of “lines_received”)
Side-by-side bars per supplier showing two metrics (e.g., OTIF and gross_fill_rate) — I can produce grouped bars if you confirm.
Suggested defaults (if you don’t specify):

Metric: avg(otif_rate)
Timeframe: all available dates
Chart: vertical bar, supplier on x-axis, metric on y-axis, ordered by metric desc
If that works, I will run this SQL against the gold view and build the bar chart:

SELECT “supplier_id”, AVG(“otif_rate”) AS “avg_otif_rate”, AVG(“gross_fill_rate”) AS “avg_gross_fill_rate” FROM “first-project”.“dremio”.“supply_chain”.“gold”.“daily_supplier_delivery_kpis” — optionally add: WHERE “receipt_date” BETWEEN DATE ‘2025-08-01’ AND DATE ‘2025-08-31’ GROUP BY “supplier_id” ORDER BY “avg_otif_rate” DESC

Please confirm:

Which metric? (otif_rate, gross_fill_rate, lines_received, total_received_qty, etc.)
Timeframe (all data, a date, or a date range). If you say a month/week/day, I will assume the current year unless you specify otherwise.
Chart type (single bar, grouped bars for two metrics).
After you confirm I will run the query and produce the bar chart for you.
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;I respond with&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;1. Average OTIF per supplier
2. All data
3. Vertical Bar
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;It then explains to the me it&apos;s thought process to illustrate where it needs more detail, I notice at this point it is including the project name &lt;code&gt;first-project&lt;/code&gt; in the table name which is not necessary as this is assumed so it should start with the namespace &lt;code&gt;dremio&lt;/code&gt;. I tell it this and get back my visualization.&lt;/p&gt;
&lt;p&gt;&lt;img src=&quot;https://i.imgur.com/KeHWL8h.png&quot; alt=&quot;Generating a Visualization with the Dremio AI Agent&quot;&gt;&lt;/p&gt;
&lt;p&gt;We found success starting with quite a vague question but working with the AI we were able to get a visualization of a useful metric within a few minutes.&lt;/p&gt;
&lt;h2&gt;AI Function&lt;/h2&gt;
&lt;p&gt;Using your data to create visualization isn&apos;t the only cool AI integration in the Dremio Arsenal. Dremio also has added a variety of new SQL AI Functions which allow you to do a variety of things like turn unstructured data into structured data. Let&apos;s see a very simple example you can run right in your SQL runner assuming you have a &lt;code&gt;dremio&lt;/code&gt; namespace.&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-sql&quot;&gt;-- Create the recipes table with an ARRAY column for ingredients (sample rows)
-- Note: this uses CREATE TABLE AS SELECT to create a physical table with sample data.
CREATE FOLDER IF NOT EXISTS dremio.recipes;
CREATE TABLE IF NOT EXISTS dremio.recipes.recipes AS
SELECT 1 AS &amp;quot;id&amp;quot;,
       &apos;Mild Salsa&apos; AS &amp;quot;name&amp;quot;,
       ARRAY[&apos;tomato&apos;,&apos;onion&apos;,&apos;cilantro&apos;,&apos;jalapeno&apos;,&apos;lime&apos;] AS &amp;quot;ingredients&amp;quot;,
       CURRENT_TIMESTAMP AS &amp;quot;created_at&amp;quot;
UNION ALL
SELECT 2, &apos;Medium Chili&apos;, ARRAY[&apos;beef&apos;,&apos;tomato&apos;,&apos;onion&apos;,&apos;chili powder&apos;,&apos;cumin&apos;,&apos;jalapeno&apos;], CURRENT_TIMESTAMP
UNION ALL
SELECT 3, &apos;Spicy Vindaloo&apos;, ARRAY[&apos;chicken&apos;,&apos;chili&apos;,&apos;ginger&apos;,&apos;garlic&apos;,&apos;vinegar&apos;,&apos;habanero&apos;], CURRENT_TIMESTAMP;

-- Create View where AI is used to classify each recipe as Mild, Medium or Spicy
CREATE OR REPLACE VIEW dremio.recipes.recipes_enhanced AS SELECT id,
       name,
       ingredients,
       AI_CLASSIFY(&apos;Identify the Spice Level:&apos; || ARRAY_TO_STRING(ingredients, &apos;,&apos;), ARRAY [ &apos;mild&apos;, &apos;medium&apos;, &apos;spicy&apos; ]) AS spice_level
from   dremio.recipes.recipes;
&lt;/code&gt;&lt;/pre&gt;
&lt;ul&gt;
&lt;li&gt;
&lt;p&gt;The First SQL statement creates a table of recipes where the ingredients are an array of strings&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;The Second SQL statement create a view where we use the AI_CLASSIFY function to prompt the AI given the ingredients whether the recipe is &lt;code&gt;mild&lt;/code&gt;, &lt;code&gt;medium&lt;/code&gt; or &lt;code&gt;spicy&lt;/code&gt;.&lt;/p&gt;
&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;&lt;img src=&quot;https://i.imgur.com/PiJGMmF.png&quot; alt=&quot;The Dremio AI Functions&quot;&gt;&lt;/p&gt;
&lt;p&gt;With these AI functions you can also use it pull data from JSON files or folders of images to generate structured datasets. Imagine taking a folder of scans of paper applications and turning them into an iceberg table with all the right fields by having the AI scan these images, this is kind of use case made possible by these functions.&lt;/p&gt;
&lt;h2&gt;Dremio Jobs Pane&lt;/h2&gt;
&lt;p&gt;Want to see what queries are coming or investigate deeper why a query may have failed or taken longer than expecting, the Dremio job pane which is the next option on the left menu will allow you see all your jobs and then click on them to see exaustive detail on how they were processed.&lt;/p&gt;
&lt;p&gt;&lt;img src=&quot;https://i.imgur.com/EhD18PE.png&quot; alt=&quot;Dremio Job Pane&quot;&gt;&lt;/p&gt;
&lt;h2&gt;Dremio Settings&lt;/h2&gt;
&lt;p&gt;If you click on the last menu item, the gear, you&apos;ll get two options:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Project Settings&lt;/li&gt;
&lt;li&gt;Org Settings&lt;/li&gt;
&lt;/ul&gt;
&lt;h3&gt;Project Settings&lt;/h3&gt;
&lt;p&gt;&lt;img src=&quot;https://i.imgur.com/YCfoaMz.png&quot; alt=&quot;Dremio Project Settings&quot;&gt;&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;You can find project info like:
&lt;ul&gt;
&lt;li&gt;project name and id (project names are fixed, org names can change)&lt;/li&gt;
&lt;li&gt;MCP server url to connect your external AI agent to leverage your Dremio instance&lt;/li&gt;
&lt;li&gt;JDBC url to connect to Dremio using external JDBC clients and in custom scripts&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;&lt;strong&gt;NOTE:&lt;/strong&gt; SQL can be sent to Dremio for execution outside of Dremio&apos;s UI using JDBC, ODBC, Apache Arrow Flight and Dremio&apos;s REST API. Refer to docs.dremio.com for documentation on how to leverage these interfaces.&lt;/p&gt;
&lt;p&gt;Also in project settings you&apos;ll find sections like:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;
&lt;p&gt;Catalog: To update catalog settings like how often metadata refresh should happen. (&lt;strong&gt;Note:&lt;/strong&gt; Lineage view is based on the metadata as of the last refresh so if something isn&apos;t reflected in lineage the metadata may not have refreshed yet. You can either make the metadata refresh more frequently or wait till it refreshes on its current schedule.)&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;Engines: For managing your different Dremio execution engines, what is their size, when they should spin up and when they should spin down&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;BI Tools: Enable or disable Tableu and Power BI buttons&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;Monitor: Dashboard to monitor Dremio project health&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;Reflections: See scores on reflections you have created, you can also delete reflections if you no longer need them from here&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;Engine Routing: Create rules for which jobs should go to which engines, for example jobs from certain users may be routed to their own engine which is tracked for charge backs.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;Preferences: Turn on and off certain Dremio features&lt;/p&gt;
&lt;/li&gt;
&lt;/ul&gt;
&lt;h3&gt;Organization Settings&lt;/h3&gt;
&lt;p&gt;Under Organizations settings you&apos;ll find:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;name of org, which can be changed&lt;/li&gt;
&lt;li&gt;Manage authentication protocols&lt;/li&gt;
&lt;li&gt;Manage projects, users, and roles&lt;/li&gt;
&lt;/ul&gt;
&lt;h3&gt;User Settings&lt;/h3&gt;
&lt;p&gt;At the very bottom left corner there is a button to see settings for the individual user. The Main use for this is to change to dark/light more and to create PAT tokens for authenticating external clients.&lt;/p&gt;
&lt;h2&gt;Granting Access&lt;/h2&gt;
&lt;p&gt;Once you create new non-admin users in your Dremio org, they&apos;ll have zero access to anything so you&apos;ll need to give them precise access to particular projects, namespaces, folders, sources etc.&lt;/p&gt;
&lt;p&gt;While you can do this for an individual user, it will likely be easier to create &amp;quot;roles&amp;quot; you can grant access to groups of users with. Below is the example of the kind of SQL you may use to grant access to a single namespace for a new user.&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;-- Give Permissions to project
GRANT SELECT, VIEW REFLECTION, VIEW JOB HISTORY, USAGE, MONITOR,
       CREATE TABLE, INSERT, UPDATE, DELETE, DROP, ALTER, EXTERNAL QUERY, ALTER REFLECTION, OPERATE
ON PROJECT
TO USER &amp;quot;alphatest2user@alexmerced.com&amp;quot;;

-- Give Permissions to Namespace in Catalog
GRANT ALTER, USAGE, SELECT, WRITE, DROP on FOLDER &amp;quot;dremio&amp;quot; to USER &amp;quot;alphatest2user@alexmerced.com&amp;quot;;

-- Give Permissions to a Folder in the namespace
GRANT ALTER, USAGE, SELECT, WRITE, DROP on FOLDER dremio.recipes to USER &amp;quot;alphatest2user@alexmerced.com&amp;quot;;
&lt;/code&gt;&lt;/pre&gt;
&lt;h2&gt;Connecting your Dremio Catalog to Other Engines Like Spark&lt;/h2&gt;
&lt;p&gt;Now you can connect to the Dremio Platform using JDBC/ODBC/ADBC-Flight/REST and send SQL to Dremio for Dremio to execute which I hope you take full advantage of. Although, sometimes you are sharing a dataset in your catalog with someone else who wants to use their preferred compute tool. Dremio Catalog bein  Apache Polaris based supports the Apache Iceberg REST Catalog SPEC meaning it can connect to pretty much to any Apache Iceberg supporting tool. Below is an example of how you&apos;d connect in Spark.&lt;/p&gt;
&lt;p&gt;Run a local spark envrionment using the following command:&lt;/p&gt;
&lt;p&gt;&lt;code&gt;docker run -p 8888:8888 -e DREMIO_PAT={YOUR PAT TOKEN} alexmerced/spark35nb:latest&lt;/code&gt;&lt;/p&gt;
&lt;p&gt;Then use the following code to run spark code against Dremio Catalog (keep in mind the CATALOG_NAME variable should match your project name).&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-python&quot;&gt;import os
import pyspark
from pyspark.sql import SparkSession
from pyspark.sql import Row

# Fetch Dremio base URL and PAT from environment variables
DREMIO_CATALOG_URI = &amp;quot;https://catalog.dremio.cloud/api/iceberg&amp;quot;
DREMIO_AUTH_URI = &amp;quot;https://login.dremio.cloud/oauth/token&amp;quot;
DREMIO_PAT = os.environ.get(&apos;DREMIO_PAT&apos;)
CATALOG_NAME = &amp;quot;first-project&amp;quot; # should be project name

if not DREMIO_CATALOG_URI or not CATALOG_NAME or not DREMIO_AUTH_URI or not DREMIO_PAT:
    raise ValueError(&amp;quot;Please set environment variables DREMIO_CATALOG_URI, DREMIO_AUTH_URI and DREMIO_PAT.&amp;quot;)

# Configure Spark session with Iceberg and Dremio catalog settings
conf = (
    pyspark.SparkConf()
        .setAppName(&apos;DremioIcebergSparkApp&apos;)
        # Required external packages For FILEIO (org.apache.iceberg:iceberg-azure-bundle:1.9.2, org.apache.iceberg:iceberg-aws-bundle:1.9.2, org.apache.iceberg:iceberg-azure-bundle:1.9.2, org.apache.iceberg:iceberg-gcp-bundle:1.9.2)
        .set(&apos;spark.jars.packages&apos;, &apos;org.apache.iceberg:iceberg-spark-runtime-3.5_2.12:1.9.2,com.dremio.iceberg.authmgr:authmgr-oauth2-runtime:0.0.5,org.apache.iceberg:iceberg-aws-bundle:1.9.2&apos;)
        # Enable Iceberg Spark extensions
        .set(&apos;spark.sql.extensions&apos;, &apos;org.apache.iceberg.spark.extensions.IcebergSparkSessionExtensions&apos;)
        # Define Dremio catalog configuration using RESTCatalog
        .set(&apos;spark.sql.catalog.dremio&apos;, &apos;org.apache.iceberg.spark.SparkCatalog&apos;)
        .set(&apos;spark.sql.catalog.dremio.catalog-impl&apos;, &apos;org.apache.iceberg.rest.RESTCatalog&apos;)
        .set(&apos;spark.sql.catalog.dremio.uri&apos;, DREMIO_CATALOG_URI)
        .set(&apos;spark.sql.catalog.dremio.warehouse&apos;, CATALOG_NAME)  # Not used but required by Spark
        .set(&apos;spark.sql.catalog.dremio.cache-enabled&apos;, &apos;false&apos;)
        .set(&apos;spark.sql.catalog.dremio.header.X-Iceberg-Access-Delegation&apos;, &apos;vended-credentials&apos;)
        # Configure OAuth2 authentication using PAT
        .set(&apos;spark.sql.catalog.dremio.rest.auth.type&apos;, &apos;com.dremio.iceberg.authmgr.oauth2.OAuth2Manager&apos;)
        .set(&apos;spark.sql.catalog.dremio.rest.auth.oauth2.token-endpoint&apos;, DREMIO_AUTH_URI)
        .set(&apos;spark.sql.catalog.dremio.rest.auth.oauth2.grant-type&apos;, &apos;token_exchange&apos;)
        .set(&apos;spark.sql.catalog.dremio.rest.auth.oauth2.client-id&apos;, &apos;dremio&apos;)
        .set(&apos;spark.sql.catalog.dremio.rest.auth.oauth2.scope&apos;, &apos;dremio.all&apos;)
        .set(&apos;spark.sql.catalog.dremio.rest.auth.oauth2.token-exchange.subject-token&apos;, DREMIO_PAT)
        .set(&apos;spark.sql.catalog.dremio.rest.auth.oauth2.token-exchange.subject-token-type&apos;, &apos;urn:ietf:params:oauth:token-type:dremio:personal-access-token&apos;)
)

# Initialize Spark session
spark = SparkSession.builder.config(conf=conf).getOrCreate()
print(&amp;quot;✅ Spark session connected to Dremio Catalog.&amp;quot;)

# Step 1: Create a namespace (schema) in the Dremio catalog
spark.sql(&amp;quot;CREATE NAMESPACE IF NOT EXISTS dremio.db&amp;quot;)
# spark.sql(&amp;quot;CREATE NAMESPACE IF NOT EXISTS dremio.db.test1&amp;quot;)
print(&amp;quot;✅ Namespaces Created&amp;quot;)

# Step 2: Create sample Iceberg tables in the Dremio catalog
spark.sql(&amp;quot;&amp;quot;&amp;quot;
CREATE TABLE IF NOT EXISTS dremio.db.customers (
    id INT,
    name STRING,
    email STRING
)
USING iceberg
&amp;quot;&amp;quot;&amp;quot;)

spark.sql(&amp;quot;&amp;quot;&amp;quot;
CREATE TABLE IF NOT EXISTS dremio.db.orders (
    order_id INT,
    customer_id INT,
    amount DOUBLE
)
USING iceberg
&amp;quot;&amp;quot;&amp;quot;)

print(&amp;quot;✅ Tables Created&amp;quot;)

# Step 3: Insert sample data into the tables
customers_data = [
    Row(id=1, name=&amp;quot;Alice&amp;quot;, email=&amp;quot;alice@example.com&amp;quot;),
    Row(id=2, name=&amp;quot;Bob&amp;quot;, email=&amp;quot;bob@example.com&amp;quot;)
]

orders_data = [
    Row(order_id=101, customer_id=1, amount=250.50),
    Row(order_id=102, customer_id=2, amount=99.99)
]

print(&amp;quot;✅ Dataframes Generated&amp;quot;)

customers_df = spark.createDataFrame(customers_data)
orders_df = spark.createDataFrame(orders_data)

customers_df.writeTo(&amp;quot;dremio.db.customers&amp;quot;).append()
orders_df.writeTo(&amp;quot;dremio.db.orders&amp;quot;).append()

print(&amp;quot;✅ Tables created and sample data inserted.&amp;quot;)
&lt;/code&gt;&lt;/pre&gt;
&lt;h2&gt;Conclusion&lt;/h2&gt;
&lt;p&gt;Dremio Next Gen Cloud represents a major leap forward in making the data lakehouse experience seamless, powerful, and accessible. Whether you&apos;re just beginning your lakehouse journey or modernizing a complex data environment, Dremio gives you the tools to work faster and smarter—with native Apache Iceberg support, AI-powered features, and a fully integrated catalog.&lt;/p&gt;
&lt;p&gt;From federated queries across diverse sources to autonomous performance tuning, Dremio abstracts away the operational headaches so you can focus on delivering insights. And with built-in AI capabilities, you&apos;re not just managing data—you’re unlocking its full potential.&lt;/p&gt;
&lt;p&gt;If you haven’t already, &lt;a href=&quot;https://www.dremio.com/get-started/?utm_source=ev_external_blog&amp;amp;utm_medium=influencer&amp;amp;utm_campaign=pag&amp;amp;utm_term=nextgencloudtut&amp;amp;utm_content=alexmerced&quot;&gt;sign up for your free trial&lt;/a&gt; and start building your lakehouse—no infrastructure or credit card required.&lt;/p&gt;
&lt;p&gt;The next generation of analytics is here. Time to explore what’s possible.&lt;/p&gt;
</content:encoded><author>Alex Merced</author></item><item><title>2025-2026 Guide to Learning about Apache Iceberg, Data Lakehouse &amp; Agentic AI</title><link>https://iceberglakehouse.com/posts/2025-10-2026-guide-to-learning-lakehouse-iceberg-agentic-ai/</link><guid isPermaLink="true">https://iceberglakehouse.com/posts/2025-10-2026-guide-to-learning-lakehouse-iceberg-agentic-ai/</guid><description>
The data world is evolving fast. Just a few years ago, building a modern analytics stack meant stitching together tools, ETL pipelines, and compromis...</description><pubDate>Thu, 23 Oct 2025 09:00:00 GMT</pubDate><content:encoded>&lt;p&gt;The data world is evolving fast. Just a few years ago, building a modern analytics stack meant stitching together tools, ETL pipelines, and compromises. Today, open standards like Apache Iceberg, modular architectures like the data lakehouse, and emerging patterns like Agentic AI are reshaping how teams store, manage, and use data.&lt;/p&gt;
&lt;p&gt;But with all this innovation comes one challenge: where do you start?&lt;/p&gt;
&lt;p&gt;This guide was created to answer that question. Whether you&apos;re a data engineer exploring the Iceberg table format, an architect building a lakehouse, or a developer curious about AI agents that interact with real-time data, this resource will walk you through it. No hype. No fluff. Just a curated directory of the best learning paths, tools, and concepts to help you build a practical foundation.&lt;/p&gt;
&lt;p&gt;We will break down the links into many categories to help you find what you are looking for. There is more content beyond what I have listed here and here are two good directories where you can find more content to explore.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Lakehouse Blog Directories&lt;/strong&gt;
We will break down the links into many categories to help you find what you are looking for. There is more content beyond what I have listed here and here are two good directories where you can find more content to explore.&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;a href=&quot;https://developer.dremio.com&quot;&gt;Dremio Developer Hub which has an OSS Blogroll&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&quot;https://lakehouseblogs.com&quot;&gt;The Lakehouse Blog Directory&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;&lt;strong&gt;Get Data Lakehouse Books:&lt;/strong&gt;
I&apos;ve had the honor of getting to participate in some long form written content around the lakehouse, make of these you can get for free, links below:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;a href=&quot;https://drmevn.fyi/tableformatblog&quot;&gt;Apache Iceberg: The Definitive Guide&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&quot;https://drmevn.fyi/tableformatblog-62P6t&quot;&gt;Apache Polaris: The Defintive Guide&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&quot;https://hubs.la/Q03GfY4f0&quot;&gt;Architecting an Apache Iceberg Lakehouse&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&quot;https://www.puppygraph.com/ebooks/apache-iceberg-digest-vol-1&quot;&gt;The Apache Iceberg Digest: Vol. 1&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;&lt;strong&gt;Lakehouse Community:&lt;/strong&gt;
Below are some links where you can network with other lakehouse enthusiasts and discover lakehouse conferences and meetups near you!&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;a href=&quot;https://www.datalakehousehub.com&quot;&gt;Join the Data Lakehouse Community&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&quot;https://osscommunity.com&quot;&gt;OSS Community Listings&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;h2&gt;The Data Lakehouse&lt;/h2&gt;
&lt;p&gt;The idea behind a data lakehouse is simple: keep the flexibility of a data lake, add the performance and structure of a warehouse, and make it all accessible from one place. But turning that idea into a working architecture takes more than just buzzwords. In this section, you&apos;ll find tutorials, architectural guides, and practical walkthroughs that explain how lakehouses work, when they make sense, and how to get started, whether you’re running everything on object storage or looking to unify data access across teams and tools.&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;a href=&quot;https://amdatalakehouse.substack.com/p/the-2025-and-2026-ultimate-guide?r=h4f8p&quot;&gt;2026 Guide to the Data Lakehouse Ecosystem&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&quot;https://www.dremio.com/blog/looking-back-the-last-year-in-lakehouse-oss-advances-in-apache-arrow-iceberg-polaris-incubating/?utm_source=ev_external_blog&amp;amp;utm_medium=influencer&amp;amp;utm_campaign=iceberg&amp;amp;utm_term=2026-content-guide&amp;amp;utm_content=alexmerced&quot;&gt;Looking back the last year in Lakehouse OSS: Advances in Apache Arrow, Iceberg &amp;amp; Polaris (incubating)&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&quot;https://www.dremio.com/blog/scaling-data-lakes-moving-from-raw-parquet-to-iceberg-lakehouses/?utm_source=ev_external_blog&amp;amp;utm_medium=influencer&amp;amp;utm_campaign=iceberg&amp;amp;utm_term=2026-content-guide&amp;amp;utm_content=alexmerced&quot;&gt;Scaling Data Lakes: Moving from Raw Parquet to Iceberg Lakehouses&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&quot;https://www.dremio.com/blog/5-ways-dremio-makes-apache-iceberg-lakehouses-easy/?utm_source=ev_external_blog&amp;amp;utm_medium=influencer&amp;amp;utm_campaign=iceberg&amp;amp;utm_term=2026-content-guide&amp;amp;utm_content=alexmerced&quot;&gt;5 Ways Dremio Makes Apache Iceberg Lakehouses Easy&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&quot;https://medium.com/data-engineering-with-dremio/2025-guide-to-architecting-an-iceberg-lakehouse-9b19ed42c9de&quot;&gt;2025 Guide to Architecting an Apache Iceberg Lakehouse&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;h2&gt;Apache Iceberg&lt;/h2&gt;
&lt;p&gt;Apache Iceberg is the table format that makes data lakehouses actually work. It brings support for ACID transactions, schema evolution, time travel, and scalable performance to your cloud storage, without locking you into a vendor or engine. If you’ve ever wrestled with Hive tables or brittle partitioning logic, this section is for you. Here, you&apos;ll find beginner-friendly resources, deep dives into metadata and catalogs, and hands-on guides for working with Iceberg using engines like Spark, Flink, and Dremio.&lt;/p&gt;
&lt;h3&gt;What are Lakehouse Open Table Formats&lt;/h3&gt;
&lt;p&gt;Table formats are the backbone of the modern lakehouse. They define how data files are organized, versioned, and transacted, bringing warehouse‑level reliability to open storage. This section explores what makes formats like Apache Iceberg, Delta Lake, and Apache Hudi so important. You’ll learn how they handle schema evolution, partitioning, and ACID transactions while staying engine‑agnostic, ensuring your data remains open, performant, and ready for any workload.&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;a href=&quot;https://www.dremio.com/blog/apache-iceberg-crash-course-what-is-a-data-lakehouse-and-a-table-format/?utm_source=ev_external_blog&amp;amp;utm_medium=influencer&amp;amp;utm_campaign=iceberg&amp;amp;utm_term=2026-content-guide&amp;amp;utm_content=alexmerced&quot;&gt;What is a Data Lakehouse Table Format&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&quot;https://amdatalakehouse.substack.com/p/the-ultimate-guide-to-open-table?r=h4f8p&quot;&gt;Ultimate Guide to Open Table Formats&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&quot;https://www.dremio.com/blog/exploring-the-architecture-of-apache-iceberg-delta-lake-and-apache-hudi/?utm_source=ev_external_blog&amp;amp;utm_medium=influencer&amp;amp;utm_campaign=iceberg&amp;amp;utm_term=2026-content-guide&amp;amp;utm_content=alexmerced&quot;&gt;Exploring the Architecture of Apache Iceberg, Delta Lake, and Apache Hudi&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;h3&gt;Apache Iceberg Tutorials&lt;/h3&gt;
&lt;p&gt;Getting hands‑on is the fastest way to learn Apache Iceberg. In these tutorials, you’ll spin up local environments, run your first SQL commands, and connect Iceberg tables with catalogs like Apache Polaris or engines like Spark and Dremio. Each guide walks you through setup, basic operations, and troubleshooting so you can move from theory to practice without friction.&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;a href=&quot;https://amdatalakehouse.substack.com/p/tutorial-intro-to-apache-iceberg?r=h4f8p&quot;&gt;Intro to Iceberg with Apache Spark, Apache Polaris &amp;amp; Minio&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&quot;https://www.dremio.com/blog/try-apache-polaris-incubating-on-your-laptop-with-minio/?utm_source=ev_external_blog&amp;amp;utm_medium=influencer&amp;amp;utm_campaign=iceberg&amp;amp;utm_term=2026-content-guide&amp;amp;utm_content=alexmerced&quot;&gt;Try Apache Polaris (incubating) on Your Laptop with Minio&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&quot;https://www.dremio.com/blog/intro-to-dremio-nessie-and-apache-iceberg-on-your-laptop/?utm_source=ev_external_blog&amp;amp;utm_medium=influencer&amp;amp;utm_campaign=iceberg&amp;amp;utm_term=2026-content-guide&amp;amp;utm_content=alexmerced&quot;&gt;Intro to Dremio, Nessie, and Apache Iceberg on Your Laptop&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;h3&gt;Iceberg Migration Tooling and Ingestion&lt;/h3&gt;
&lt;p&gt;Moving existing datasets into Apache Iceberg doesn’t have to be painful. This section highlights migration patterns, ingestion tools, and automation workflows that make it easier to adopt Iceberg at scale. You’ll find step‑by‑step resources covering snapshot‑based migrations, bulk ingests, and hybrid models that help teams modernize data lakes while minimizing downtime and duplication.&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;a href=&quot;https://www.dremio.com/blog/migration-guide-for-apache-iceberg-lakehouses/?utm_source=ev_external_blog&amp;amp;utm_medium=influencer&amp;amp;utm_campaign=iceberg&amp;amp;utm_term=2026-content-guide&amp;amp;utm_content=alexmerced&quot;&gt;Migration Guide for Apache Iceberg Lakehouses&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&quot;https://www.dremio.com/blog/8-tools-for-ingesting-data-into-apache-iceberg/?utm_source=ev_external_blog&amp;amp;utm_medium=influencer&amp;amp;utm_campaign=iceberg&amp;amp;utm_term=2026-content-guide&amp;amp;utm_content=alexmerced&quot;&gt;8 Tools For Ingesting Data Into Apache Iceberg&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;h3&gt;Iceberg Catalogs&lt;/h3&gt;
&lt;p&gt;A table format is only as useful as the catalog that organizes it. Iceberg catalogs manage metadata, access control, and engine interoperability, essential pieces of a production lakehouse. In this section, you’ll explore the expanding catalog ecosystem, from open implementations like Apache Polaris to commercial and hybrid options. These resources explain how catalogs enable discoverability, governance, and smooth multi‑engine coordination across your data environment.&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;a href=&quot;https://amdatalakehouse.substack.com/p/an-exploration-of-the-commercial?r=h4f8p&quot;&gt;An Exploration of Commercial Ecosystem of Iceberg Catalogs&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&quot;https://amdatalakehouse.substack.com/p/building-a-universal-lakehouse-catalog?r=h4f8p&quot;&gt;Building a Universal Lakehouse Catalog: Catalogs beyond Iceberg&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&quot;https://www.dremio.com/blog/the-growing-apache-polaris-ecosystem-the-growing-apache-iceberg-catalog-standard/?utm_source=ev_external_blog&amp;amp;utm_medium=influencer&amp;amp;utm_campaign=iceberg&amp;amp;utm_term=2026-content-guide&amp;amp;utm_content=alexmerced&quot;&gt;The Growing Apache Polaris Ecosystem (The Growing Apache Iceberg Catalog Standard)&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;h3&gt;Apache Iceberg Table Optimization&lt;/h3&gt;
&lt;p&gt;Keeping Iceberg tables fast requires more than good schema design. Over time, data fragmentation, small files, and metadata sprawl can slow queries and inflate costs. The articles in this section show how to maintain healthy tables through compaction, clustering, and automatic optimization. You’ll also learn how modern platforms like Dremio manage this maintenance autonomously so performance tuning doesn’t become a full‑time job.&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;a href=&quot;https://www.dremio.com/blog/optimizing-iceberg-tables/?utm_source=ev_external_blog&amp;amp;utm_medium=influencer&amp;amp;utm_campaign=iceberg&amp;amp;utm_term=2026-content-guide&amp;amp;utm_content=alexmerced&quot;&gt;Optimizing Apache Iceberg Tables – Manual and Automatic&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&quot;https://www.dremio.com/blog/apache-iceberg-table-performance-management-with-dremios-optimize/?utm_source=ev_external_blog&amp;amp;utm_medium=influencer&amp;amp;utm_campaign=iceberg&amp;amp;utm_term=2026-content-guide&amp;amp;utm_content=alexmerced&quot;&gt;Apache Iceberg Table Performance Management with Dremio’s OPTIMIZE&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&quot;https://www.dremio.com/blog/minimizing-iceberg-table-management-with-smart-writing/?utm_source=ev_external_blog&amp;amp;utm_medium=influencer&amp;amp;utm_campaign=iceberg&amp;amp;utm_term=2026-content-guide&amp;amp;utm_content=alexmerced&quot;&gt;Minimizing Iceberg Table Management with Smart Writing&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&quot;https://www.dremio.com/blog/apache-iceberg-table-storage-management-with-dremios-vacuum-table/?utm_source=ev_external_blog&amp;amp;utm_medium=influencer&amp;amp;utm_campaign=iceberg&amp;amp;utm_term=2026-content-guide&amp;amp;utm_content=alexmerced&quot;&gt;Apache Iceberg Table Storage Management with Dremio’s VACUUM TABLE&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&quot;https://medium.com/@alexmercedtech/materialization-and-acceleration-in-the-iceberg-lakehouse-era-comparing-dremio-trino-doris-de3c96413b1a&quot;&gt;Materialization and Query Optimization in the Iceberg Era&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;h3&gt;Iceberg Technical Deep Dives&lt;/h3&gt;
&lt;p&gt;Once you understand the basics, the real fun begins. These deep dives unpack how Iceberg works under the hood, covering metadata structures, query caching, authentication, and advanced performance topics. Whether you’re benchmarking, extending the format, or building your own catalog integration, this section will help you understand Iceberg’s architecture and internal mechanics in detail.&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;a href=&quot;https://www.dremio.com/blog/query-results-caching-on-iceberg-tables/?utm_source=ev_external_blog&amp;amp;utm_medium=influencer&amp;amp;utm_campaign=iceberg&amp;amp;utm_term=2026-content-guide&amp;amp;utm_content=alexmerced&quot;&gt;Query Results Caching on Iceberg Tables&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&quot;https://www.dremio.com/blog/benchmarking-framework-for-the-apache-iceberg-catalog-polaris/?utm_source=ev_external_blog&amp;amp;utm_medium=influencer&amp;amp;utm_campaign=iceberg&amp;amp;utm_term=2026-content-guide&amp;amp;utm_content=alexmerced&quot;&gt;Benchmarking Framework for the Apache Iceberg Catalog, Polaris&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&quot;https://www.dremio.com/blog/too-many-roundtrips-metadata-overhead-in-the-modern-lakehouse/?utm_source=ev_external_blog&amp;amp;utm_medium=influencer&amp;amp;utm_campaign=iceberg&amp;amp;utm_term=2026-content-guide&amp;amp;utm_content=alexmerced&quot;&gt;Too Many Roundtrips: Metadata Overhead in the Modern Lakehouse&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&quot;https://www.dremio.com/blog/introducing-dremio-auth-manager-for-apache-iceberg/?utm_source=ev_external_blog&amp;amp;utm_medium=influencer&amp;amp;utm_campaign=iceberg&amp;amp;utm_term=2026-content-guide&amp;amp;utm_content=alexmerced&quot;&gt;Introducing Dremio Auth Manager for Apache Iceberg&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&quot;https://www.dremio.com/blog/dremios-apache-iceberg-clustering-technical-blog/?utm_source=ev_external_blog&amp;amp;utm_medium=influencer&amp;amp;utm_campaign=iceberg&amp;amp;utm_term=2026-content-guide&amp;amp;utm_content=alexmerced&quot;&gt;Dremio’s Apache Iceberg Clustering: Technical Blog&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;h3&gt;The Future of Apache Iceberg&lt;/h3&gt;
&lt;p&gt;Apache Iceberg continues to evolve alongside emerging workloads like Agentic AI and next‑generation file formats. This section looks ahead at what’s coming; new format versions, engine integrations, and evolving standards such as Polaris and REST catalogs. If you want to stay informed on where Iceberg is heading and how it fits into the broader open‑data movement, start here.&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;a href=&quot;https://www.dremio.com/blog/exploring-the-evolving-file-format-landscape-in-ai-era-parquet-lance-nimble-and-vortex-and-what-it-means-for-apache-iceberg/?utm_source=ev_external_blog&amp;amp;utm_medium=influencer&amp;amp;utm_campaign=iceberg&amp;amp;utm_term=2026-content-guide&amp;amp;utm_content=alexmerced&quot;&gt;Exploring the Evolving File Format Landscape in AI Era: Parquet, Lance, Nimble and Vortex And What It Means for Apache Iceberg&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&quot;https://www.dremio.com/blog/apache-iceberg-v3/?utm_source=ev_external_blog&amp;amp;utm_medium=influencer&amp;amp;utm_campaign=iceberg&amp;amp;utm_term=2026-content-guide&amp;amp;utm_content=alexmerced&quot;&gt;What’s New in Apache Iceberg Format Version 3?&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&quot;https://medium.com/data-engineering-with-dremio/the-state-of-apache-iceberg-v4-october-2025-edition-c186dc29b6f5&quot;&gt;The State of Apache Iceberg v4&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;h2&gt;Agentic AI&lt;/h2&gt;
&lt;p&gt;Agentic AI is a new class of systems that don’t just answer questions, they take action. These agents make decisions, follow workflows, and learn from outcomes, but they’re only as smart as the data they can access. That’s where open lakehouse architectures come in. This section explores the intersection of data architecture and autonomous systems, with content focused on how to power agents using structured, governed, and real-time data from your Iceberg-based lakehouse. From semantic layers to zero-ETL federation, you&apos;ll see what it takes to build AI that isn&apos;t just reactive, but genuinely useful.&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;a href=&quot;https://www.dremio.com/blog/the-model-context-protocol-mcp-a-beginners-guide-to-plug-and-play-agents/?utm_source=ev_external_blog&amp;amp;utm_medium=influencer&amp;amp;utm_campaign=iceberg&amp;amp;utm_term=2026-content-guide&amp;amp;utm_content=alexmerced&quot;&gt;The Model Context Protocol (MCP): A Beginner’s Guide to Plug-and-Play Agents&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&quot;https://amdatalakehouse.substack.com/p/understanding-rpc-and-mcp-in-agentic?r=h4f8p&quot;&gt;Understanding the Role of RPC in Agentic AI &amp;amp; MCP&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&quot;https://www.dremio.com/blog/who-benefits-from-mcp-on-analytics-platforms/&quot;&gt;Who Benefits From MCP on an Analytics Platform?&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&quot;https://amdatalakehouse.substack.com/p/tutorial-multi-agent-collaboration?r=h4f8p&quot;&gt;Tutorial: Multi-Agent Collaboration with LangChain, MCP, and Google A2A Protocol&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&quot;https://amdatalakehouse.substack.com/p/composable-analytics-with-agents?r=h4f8p&quot;&gt;Composable Analytics with Agents: Leveraging Virtual Datasets and the Semantic Layer&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&quot;https://amdatalakehouse.substack.com/p/unlocking-the-power-of-agentic-ai?r=h4f8p&quot;&gt;Unlocking the Power of Agentic AI with Dremio and Apache Iceberg&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&quot;https://www.dremio.com/blog/why-agentic-ai-needs-a-data-lakehouse/?utm_source=ev_external_blog&amp;amp;utm_medium=influencer&amp;amp;utm_campaign=iceberg&amp;amp;utm_term=2026-content-guide&amp;amp;utm_content=alexmerced&quot;&gt;Why Agentic AI Needs a Data Lakehouse&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&quot;https://www.dremio.com/blog/testing-mcp-integration-in-existing-data-pipelines/?utm_source=ev_external_blog&amp;amp;utm_medium=influencer&amp;amp;utm_campaign=iceberg&amp;amp;utm_term=2026-content-guide&amp;amp;utm_content=alexmerced&quot;&gt;Test Driving MCP: Is Your Data Pipeline Ready to Talk?&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&quot;https://www.dremio.com/blog/using-dremios-mcp-server-with-agentic-ai-frameworks/?utm_source=ev_external_blog&amp;amp;utm_medium=influencer&amp;amp;utm_campaign=iceberg&amp;amp;utm_term=2026-content-guide&amp;amp;utm_content=alexmerced&quot;&gt;Using Dremio’s MCP Server with Agentic AI Frameworks&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&quot;https://www.dremio.com/blog/using-the-dremio-mcp-server-with-any-llm-model/?utm_source=ev_external_blog&amp;amp;utm_medium=influencer&amp;amp;utm_campaign=iceberg&amp;amp;utm_term=2026-content-guide&amp;amp;utm_content=alexmerced&quot;&gt;Using Dremio MCP with any LLM Model&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&quot;https://www.dremio.com/blog/how-dremio-reflections-give-agentic-ai-a-unique-edge/?utm_source=ev_external_blog&amp;amp;utm_medium=influencer&amp;amp;utm_campaign=iceberg&amp;amp;utm_term=2026-content-guide&amp;amp;utm_content=alexmerced&quot;&gt;How Dremio Reflections Give Agentic AI a Unique Edge&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&quot;https://www.dremio.com/blog/optimizing-apache-iceberg-for-agentic-ai/?utm_source=ev_external_blog&amp;amp;utm_medium=influencer&amp;amp;utm_campaign=iceberg&amp;amp;utm_term=2026-content-guide&amp;amp;utm_content=alexmerced&quot;&gt;Optimizing Apache Iceberg for Agentic AI&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;
</content:encoded><author>Alex Merced</author></item><item><title>An Exploration of the Commercial Iceberg Catalog Ecosystem</title><link>https://iceberglakehouse.com/posts/2025-10-exploring-commerical-apache-iceberg-catalogs/</link><guid isPermaLink="true">https://iceberglakehouse.com/posts/2025-10-exploring-commerical-apache-iceberg-catalogs/</guid><description>
**Get Data Lakehouse Books:**
- [Apache Iceberg: The Definitive Guide](https://drmevn.fyi/tableformatblog)
- [Apache Polaris: The Defintive Guide](ht...</description><pubDate>Tue, 21 Oct 2025 09:00:00 GMT</pubDate><content:encoded>&lt;p&gt;&lt;strong&gt;Get Data Lakehouse Books:&lt;/strong&gt;&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;a href=&quot;https://drmevn.fyi/tableformatblog&quot;&gt;Apache Iceberg: The Definitive Guide&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&quot;https://drmevn.fyi/tableformatblog-62P6t&quot;&gt;Apache Polaris: The Defintive Guide&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&quot;https://hubs.la/Q03GfY4f0&quot;&gt;Architecting an Apache Iceberg Lakehouse&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&quot;https://www.puppygraph.com/ebooks/apache-iceberg-digest-vol-1&quot;&gt;The Apache Iceberg Digest: Vol. 1&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;&lt;strong&gt;Lakehouse Community:&lt;/strong&gt;&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;a href=&quot;https://www.datalakehousehub.com&quot;&gt;Join the Data Lakehouse Community&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&quot;https://lakehouseblogs.com&quot;&gt;Data Lakehouse Blog Roll&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&quot;https://osscommunity.com&quot;&gt;OSS Community Listings&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&quot;https://developer.dremio.com&quot;&gt;Dremio Lakehouse Developer Hub&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;Apache Iceberg has quickly become the table format of choice for building open, flexible, and high-performance data lakehouses. It solves long-standing issues around schema evolution, ACID transactions, and engine interoperability. Enabling a shared, governed data layer across diverse compute environments.&lt;/p&gt;
&lt;p&gt;But while the table format itself is open and standardized, the catalog layer, the system responsible for tracking and exposing table metadata, is where key decisions begin to shape your architecture. How your organization selects and manages an Iceberg catalog can influence everything from query performance to write flexibility to vendor lock-in risk.&lt;/p&gt;
&lt;p&gt;This blog explores the current landscape of commercial Iceberg catalogs, focusing on the emerging Iceberg REST Catalog (IRC) standard and how different vendors interpret and implement it. We’ll examine where catalogs prioritize cross-engine interoperability, where they embed proprietary optimization features, and how organizations can approach these trade-offs strategically.&lt;/p&gt;
&lt;p&gt;You’ll also learn what options exist when native optimizations aren’t available, including how to design your own or consider a catalog-neutral optimization tool like Ryft.io (when using cloud object storage).&lt;/p&gt;
&lt;p&gt;By the end, you&apos;ll have a clear view of the commercial ecosystem, and a framework to help you choose a path that fits your technical goals while minimizing operational friction.&lt;/p&gt;
&lt;h2&gt;The Role of Iceberg REST Catalogs in the Modern Lakehouse&lt;/h2&gt;
&lt;p&gt;At the heart of every Apache Iceberg deployment is a catalog. It’s more than just a registry of tables, it’s the control plane for transactions, schema changes, and metadata access. And thanks to the Apache Iceberg REST Catalog (IRC) specification, catalogs no longer need to be tightly coupled to any single engine.&lt;/p&gt;
&lt;p&gt;The IRC defines a standardized, HTTP-based API that lets query engines like Spark, Trino, Flink, Dremio, and others communicate with a catalog in a consistent way. That means developers can write data from one engine and read it from another, without worrying about format mismatches or metadata drift.&lt;/p&gt;
&lt;p&gt;This decoupling brings three major benefits:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Multi-language support&lt;/strong&gt;: Since the interface is language-agnostic, you can interact with the catalog from tools written in Java, Python, Rust, or Go.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Compute independence&lt;/strong&gt;: Query and write operations don’t require the catalog to be embedded in the engine, everything runs through REST.&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;Adoption of the IRC spec is growing rapidly. Vendors like Dremio, Snowflake, Google, and Databricks now offer catalogs that expose some or all of the REST API. This trend signals a broader shift toward open metadata services, where engine choice is driven by workload needs, not infrastructure constraints.&lt;/p&gt;
&lt;p&gt;But as we’ll see next, implementing the REST API is only part of the story. The real architectural decisions start when you consider &lt;strong&gt;how these catalogs handle optimization, write access, and cross-engine consistency&lt;/strong&gt;.&lt;/p&gt;
&lt;h2&gt;Key Considerations When Choosing a Catalog&lt;/h2&gt;
&lt;p&gt;Picking a catalog shapes how your Iceberg lakehouse runs. The decision affects who can read and write data, how tables stay performant, and how easy it is to run multiple engines. Focus on facts. Match catalog capabilities to your operational needs.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Read-write interoperability.&lt;/strong&gt;&lt;br&gt;
Some catalogs expose the full Iceberg REST Catalog APIs so any compatible engine can read and write tables. Other offerings restrict external writes or recommend using specific engines for writes. These differences change how you design ingestion and cross-engine workflows.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Server-side performance features.&lt;/strong&gt;&lt;br&gt;
Catalogs vary in how much they manage table health for you. A few provide automated compaction, delete-file handling, and lifecycle management. Others leave those tasks to your teams and to open-source engines. If you want fewer operational jobs, prioritize a catalog with built-in performance management.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Vendor neutrality versus added convenience.&lt;/strong&gt;&lt;br&gt;
A catalog that automates maintenance reduces day-to-day work. It also increases dependency on that vendor’s maintenance model. If your priority is full independence across engines then you may prefer a catalog that implements the Iceberg REST spec faithfully so you can plan for external maintenance processes.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Costs and Compatibility&lt;/strong&gt;
Some catalogs may be limited on which storage providers they can work with or may charge just for usage of the catalog even if you use external compute and this should be considered.&lt;/p&gt;
&lt;p&gt;A short checklist to evaluate a candidate catalog&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Does it implement the Iceberg REST Catalog APIs for both reads and writes?&lt;/li&gt;
&lt;li&gt;Does it provide automatic table maintenance or only catalog services?&lt;/li&gt;
&lt;li&gt;What write restrictions or safety guards exist for external engines?&lt;/li&gt;
&lt;li&gt;Which clouds and storage systems does it support?&lt;/li&gt;
&lt;li&gt;Are there extra costs to using the catalog?&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;Use this checklist when you compare offerings. It helps reveal trade-offs between operational simplicity and multi-engine freedom.&lt;/p&gt;
&lt;h2&gt;Catalog Optimization: Native vs. Neutral Approaches&lt;/h2&gt;
&lt;p&gt;Once your Iceberg tables are in place, keeping them fast and cost-effective becomes a daily concern. File sizes grow unevenly, delete files stack up, and query times creep higher. This is where table optimization comes in—and where catalog differences start to matter.&lt;/p&gt;
&lt;p&gt;Most commercial catalogs fall into two categories:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Native Optimization Available&lt;/strong&gt;&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Manual Optimization Required&lt;/strong&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;h3&gt;Native Optimization Available&lt;/h3&gt;
&lt;p&gt;Vendors like &lt;strong&gt;Dremio&lt;/strong&gt;, &lt;strong&gt;AWS Glue&lt;/strong&gt;, and &lt;strong&gt;Databricks Unity Catalog&lt;/strong&gt; offer built-in optimization features that automatically manage compaction, delete file cleanup, and snapshot pruning. These features are often tightly integrated into their orchestration layers or compute engines.&lt;/p&gt;
&lt;p&gt;Benefits:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;No need to schedule Spark or Flink jobs manually&lt;/li&gt;
&lt;li&gt;Optimizations are triggered based on metadata activity&lt;/li&gt;
&lt;li&gt;Helps reduce cloud storage costs and improve query performance&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;Tradeoff:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;These features are often proprietary and non-transferable. If you move catalogs or engines, you may lose automation and need to build optimization pipelines.&lt;/li&gt;
&lt;/ul&gt;
&lt;h3&gt;Catalog-Neutral or Manual Optimization&lt;/h3&gt;
&lt;p&gt;Some catalogs, including open-source options like Apache Polaris, don&apos;t come with built-in optimization. Instead, you have two options:&lt;/p&gt;
&lt;ol&gt;
&lt;li&gt;&lt;strong&gt;Run your own compaction pipelines&lt;/strong&gt; using engines like Spark or Flink. Can also manually orchestrate Dremio&apos;s OPTIMIZE and VACUUM commands with any catalog.&lt;/li&gt;
&lt;li&gt;Use a &lt;strong&gt;catalog-neutral optimization service&lt;/strong&gt; like &lt;strong&gt;Ryft.io&lt;/strong&gt;, which works across any REST-compatible catalog, but currently only supports storage on AWS, Azure, or GCP. There is also the open source Apache Amoro which automates the use of Spark based optimizations.&lt;/li&gt;
&lt;/ol&gt;
&lt;p&gt;This route offers maximum flexibility but requires:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Engineering effort to configure and monitor compaction&lt;/li&gt;
&lt;li&gt;Knowledge of best practices for tuning optimization jobs&lt;/li&gt;
&lt;li&gt;A way to coordinate across engines to avoid conflicting writes&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;In short: if optimization is a feature you want off your plate, look for a catalog that handles it natively. If you prefer full control or need a more cloud-agnostic setup, neutral optimization tools or open workflows may serve you better.&lt;/p&gt;
&lt;h2&gt;What If Native Optimization Doesn’t Exist?&lt;/h2&gt;
&lt;p&gt;Not every catalog includes built-in optimization. If you&apos;re using a minimal catalog, or one that prioritizes openness over orchestration, you’ll need to handle performance tuning another way. That’s not a dealbreaker, but it does require a decision.&lt;/p&gt;
&lt;p&gt;Here are the two main paths forward when native optimization isn’t part of the package:&lt;/p&gt;
&lt;h3&gt;Option 1: Build Your Own Optimization Pipelines&lt;/h3&gt;
&lt;p&gt;Apache Iceberg is fully compatible with open engines like &lt;strong&gt;Apache Spark&lt;/strong&gt;, &lt;strong&gt;Flink&lt;/strong&gt;, and &lt;strong&gt;Dremio&lt;/strong&gt;. Each of these supports table maintenance features such as:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;File compaction&lt;/li&gt;
&lt;li&gt;Manifest rewriting&lt;/li&gt;
&lt;li&gt;Snapshot expiration&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;You can schedule these jobs using tools like Airflow or dbt, or embed them directly into your data ingestion flows. This approach works in any environment, including on-prem, hybrid, and cloud.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Pros&lt;/strong&gt;:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Complete flexibility in how and when you optimize&lt;/li&gt;
&lt;li&gt;Can tailor jobs to match data patterns and storage costs&lt;/li&gt;
&lt;li&gt;Fully open and vendor-independent&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;&lt;strong&gt;Cons&lt;/strong&gt;:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Requires engineering effort to build, monitor, and tune jobs&lt;/li&gt;
&lt;li&gt;No centralized UI or automation unless you build one&lt;/li&gt;
&lt;/ul&gt;
&lt;h3&gt;Option 2: Use a Catalog-Neutral Optimization Vendor&lt;/h3&gt;
&lt;p&gt;Vendors like Ryft.io offer managed optimization services designed specifically for Iceberg. These tools run outside your query engines and handle compaction, cleanup, and layout improvements without relying on any one catalog or engine.&lt;/p&gt;
&lt;p&gt;NOTE: Apache Amoro offers an open source optimization tool if looking for an open source option.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Key detail&lt;/strong&gt;: Ryft currently only supports deployments that store data in &lt;strong&gt;AWS S3&lt;/strong&gt;, &lt;strong&gt;Azure Data Lake&lt;/strong&gt;, or &lt;strong&gt;Google Cloud Storage&lt;/strong&gt;. If you&apos;re using on-prem HDFS or other object stores, this may not be viable.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Pros&lt;/strong&gt;:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;No need to manage optimization logic&lt;/li&gt;
&lt;li&gt;Works across multiple compute engines and catalogs&lt;/li&gt;
&lt;li&gt;Keeps optimization decoupled from platform lock-in&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;&lt;strong&gt;Cons&lt;/strong&gt;:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Limited to major cloud object storage unless using Apache Amoro&lt;/li&gt;
&lt;li&gt;Adds another vendor and billing model to your stack&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;When native optimization isn’t available, the best path depends on your team’s appetite for operational work. DIY gives you control. Neutral services give you speed. Either way, optimization remains a critical layer—whether you manage it yourself or let someone else handle it.&lt;/p&gt;
&lt;h2&gt;5. The Interoperability Spectrum&lt;/h2&gt;
&lt;p&gt;One of the key promises of Apache Iceberg is engine interoperability. The Iceberg REST Catalog API was designed so any compliant engine—whether it&apos;s Spark, Flink, Trino, or Dremio—can access tables the same way. But in practice, not all catalogs offer equal levels of interoperability.&lt;/p&gt;
&lt;p&gt;Some catalogs expose full &lt;strong&gt;read/write access&lt;/strong&gt; to external engines using the REST API. Others allow only reads—or place restrictions on how writes must be performed. This creates a spectrum, where catalogs differ in how open or engine-specific they are.&lt;/p&gt;
&lt;p&gt;Here’s how several major catalogs compare:&lt;/p&gt;
&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Catalog&lt;/th&gt;
&lt;th&gt;External Read Access&lt;/th&gt;
&lt;th&gt;External Write Access&lt;/th&gt;
&lt;th&gt;REST Spec Coverage&lt;/th&gt;
&lt;th&gt;Notes&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Dremio Catalog&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;✅ Full&lt;/td&gt;
&lt;td&gt;✅ Full&lt;/td&gt;
&lt;td&gt;✅ Full&lt;/td&gt;
&lt;td&gt;Based on Apache Polaris; full multi-engine support; no cost for external reads/writes&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Apache Polaris (Open Source)&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;✅ Full&lt;/td&gt;
&lt;td&gt;✅ Full&lt;/td&gt;
&lt;td&gt;✅ Full&lt;/td&gt;
&lt;td&gt;Vendor-neutral, open REST catalog, deploy yourself or get managed by Dremio or Snowflake&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Databricks Unity Catalog&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;✅ Full&lt;/td&gt;
&lt;td&gt;✅ Full&lt;/td&gt;
&lt;td&gt;✅ Full&lt;/td&gt;
&lt;td&gt;Optimization services are primarily Delta Lake Centered&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;AWS Glue &amp;amp; AWS S3 Tables&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;✅ Full&lt;/td&gt;
&lt;td&gt;✅ Full&lt;/td&gt;
&lt;td&gt;✅ Full&lt;/td&gt;
&lt;td&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Google BigLake Metastore&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;✅ Full&lt;/td&gt;
&lt;td&gt;✅ Full&lt;/td&gt;
&lt;td&gt;✅ Full (preview)&lt;/td&gt;
&lt;td&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Snowflake Open Catalog&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;✅ Full&lt;/td&gt;
&lt;td&gt;✅ Full&lt;/td&gt;
&lt;td&gt;✅ Full&lt;/td&gt;
&lt;td&gt;Based on Apache Polaris; Charged for requests to catalog from external reads/writes&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Snowflake Managed Tables&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;✅ Full&lt;/td&gt;
&lt;td&gt;❌ None&lt;/td&gt;
&lt;td&gt;❌ None&lt;/td&gt;
&lt;td&gt;Tables can be externally read using Snowflake&apos;s SDK&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Microsoft OneLake&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;✅ Full (Preview)&lt;/td&gt;
&lt;td&gt;✅ Virtualized Writes&lt;/td&gt;
&lt;td&gt;✅ Full (preview)&lt;/td&gt;
&lt;td&gt;✅ Virtualized via XTable&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;MinIO AIStor&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;✅ Full&lt;/td&gt;
&lt;td&gt;✅ Full&lt;/td&gt;
&lt;td&gt;✅ Full&lt;/td&gt;
&lt;td&gt;⚠️ Storage‑level Optimization Only&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Confluent TableFlow&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;✅ Full&lt;/td&gt;
&lt;td&gt;⚠️ Limited&lt;/td&gt;
&lt;td&gt;✅ Full&lt;/td&gt;
&lt;td&gt;⚠️ Fixed Snapshot Retention&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;DataHub Iceberg Catalog&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;✅ Full&lt;/td&gt;
&lt;td&gt;✅ Full&lt;/td&gt;
&lt;td&gt;✅ Full&lt;/td&gt;
&lt;td&gt;S3 Only&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;
&lt;h3&gt;What This Means for You&lt;/h3&gt;
&lt;p&gt;If your architecture depends on multiple engines, the safest route is to choose a catalog that:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Implements the full Iceberg REST spec&lt;/li&gt;
&lt;li&gt;Allows both reads and writes from all compliant engines&lt;/li&gt;
&lt;li&gt;Avoids redirecting writes through proprietary services or SDKs&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;This isn’t just about standards, it’s about reducing long-term friction. The more interoperable your catalog, the easier it is to plug in new tools, migrate workloads, or share datasets across teams without rewriting pipelines or triggering lock-in.&lt;/p&gt;
&lt;h3&gt;Architectural Patterns: Choosing the Right Iceberg Catalog for Your Stack&lt;/h3&gt;
&lt;p&gt;With a clear understanding of feature capabilities across commercial Iceberg catalogs, the next consideration is architectural alignment. How should teams select a catalog based on their engine stack, deployment model, and optimization philosophy?&lt;/p&gt;
&lt;p&gt;Here, we explore common deployment patterns and their implications:&lt;/p&gt;
&lt;h4&gt;&lt;strong&gt;Single-Engine Simplicity&lt;/strong&gt;&lt;/h4&gt;
&lt;p&gt;&lt;strong&gt;Use case:&lt;/strong&gt; Organizations standardized on one compute engine seeking high performance and low operational overhead.&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;✅ &lt;em&gt;Benefits:&lt;/em&gt;
&lt;ul&gt;
&lt;li&gt;Seamless integration between compute and catalog.&lt;/li&gt;
&lt;li&gt;Native optimization features (e.g., OPTIMIZE TABLE, Z-Ordering).&lt;/li&gt;
&lt;li&gt;Simplified access control and performance tuning.&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;li&gt;⚠️ &lt;em&gt;Trade-offs:&lt;/em&gt;
&lt;ul&gt;
&lt;li&gt;May impose file format restrictions (e.g., Parquet-only).&lt;/li&gt;
&lt;li&gt;Optimization tightly coupled to engine but if REST-Spec is ahered to you can still develop your own optimization pipelines.&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;&lt;strong&gt;Recommended Platforms:&lt;/strong&gt; Dremio, Databricks Unity Catalog, AWS Glue (with managed compute).&lt;/p&gt;
&lt;h4&gt;&lt;strong&gt;Multi-Engine Interop (Spark + Trino + Flink)&lt;/strong&gt;&lt;/h4&gt;
&lt;p&gt;&lt;strong&gt;Use case:&lt;/strong&gt; Organizations with complex, multi-engine environments that require consistent metadata across tools and Clouds.&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;em&gt;Benefits:&lt;/em&gt;
&lt;ul&gt;
&lt;li&gt;Use the right engine for the right job (ETL, BI, ML).&lt;/li&gt;
&lt;li&gt;Maximize transactional openness via full IRC support.&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;li&gt;&lt;em&gt;Trade-offs:&lt;/em&gt;
&lt;ul&gt;
&lt;li&gt;Optimization is either manual or vendor-dependent.&lt;/li&gt;
&lt;li&gt;Catalog-neutral solutions may lack server-side performance tuning.&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;&lt;strong&gt;Recommended Platforms:&lt;/strong&gt; Dremio Enterprise Catalog, Snowflake Open Catalog. (Both based on Apache Polaris)&lt;/p&gt;
&lt;h4&gt;&lt;strong&gt;Streaming-First Architectures&lt;/strong&gt;&lt;/h4&gt;
&lt;p&gt;&lt;strong&gt;Use case:&lt;/strong&gt; Teams integrating real-time data from Kafka into the lakehouse for analytics or ML.&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;em&gt;Benefits:&lt;/em&gt;
&lt;ul&gt;
&lt;li&gt;Stream-native catalog (e.g., Confluent TableFlow) materializes Kafka topics into Iceberg tables.&lt;/li&gt;
&lt;li&gt;Seamless schema registration and time-travel.&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;li&gt;&lt;em&gt;Trade-offs:&lt;/em&gt;
&lt;ul&gt;
&lt;li&gt;No schema evolution.&lt;/li&gt;
&lt;li&gt;Limited optimization control (rigid snapshot retention).&lt;/li&gt;
&lt;li&gt;Often designed for read-heavy use cases.&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;&lt;strong&gt;Recommended Platforms:&lt;/strong&gt; Confluent TableFlow, integrated with external catalogs for downstream processing.&lt;/p&gt;
&lt;h4&gt;&lt;strong&gt;Cloud-Embedded Storage Catalogs&lt;/strong&gt;&lt;/h4&gt;
&lt;p&gt;&lt;strong&gt;Use case:&lt;/strong&gt; Teams deploying AI or analytics workloads in private/hybrid cloud environments.&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;em&gt;Benefits:&lt;/em&gt;
&lt;ul&gt;
&lt;li&gt;Built-in REST Catalog support directly within storage (MinIO AIStor).&lt;/li&gt;
&lt;li&gt;Simplifies deployment—no separate metadata layer deployment.&lt;/li&gt;
&lt;li&gt;High concurrency and transactional consistency at scale.&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;li&gt;&lt;em&gt;Trade-offs:&lt;/em&gt;
&lt;ul&gt;
&lt;li&gt;Tightly bound to object storage vendor.&lt;/li&gt;
&lt;li&gt;No native table optimization&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;&lt;strong&gt;Recommended Platforms:&lt;/strong&gt; MinIO AIStor (on-premise/private cloud), AWS S3 Tables (cloud-native equivalent).&lt;/p&gt;
&lt;h4&gt;&lt;strong&gt;Governance-Led Architectures&lt;/strong&gt;&lt;/h4&gt;
&lt;p&gt;&lt;strong&gt;Use case:&lt;/strong&gt; Enterprises prioritizing metadata lineage, compliance, and discovery.&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;em&gt;Benefits:&lt;/em&gt;
&lt;ul&gt;
&lt;li&gt;Centralized metadata layer for observability and access management.&lt;/li&gt;
&lt;li&gt;Easy discovery and tracking across teams and tools.&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;li&gt;&lt;em&gt;Trade-offs:&lt;/em&gt;
&lt;ul&gt;
&lt;li&gt;No native write capabilities (metadata-only catalog).&lt;/li&gt;
&lt;li&gt;Optimization must be handled by external systems.&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;&lt;strong&gt;Recommended Platforms:&lt;/strong&gt; DataHub Iceberg Catalog (OSS or Cloud) or using an external catalog (Dremio Catalog, Apache Polaris) connected into Datahub.&lt;/p&gt;
&lt;p&gt;Each pattern has architectural trade-offs. Rather than seeking a perfect catalog, successful teams prioritize &lt;strong&gt;alignment with workflow needs&lt;/strong&gt;: engine independence, optimization automation, governance, or real-time ingestion. In some cases, hybrid strategies, like dual catalogs or catalog, neutral optimization overlays—provide the best of both worlds.&lt;/p&gt;
&lt;h2&gt;Optimization Strategy Trade-offs: Native, Manual, or Vendor-Neutral&lt;/h2&gt;
&lt;p&gt;Once an organization selects a catalog, the next major architectural decision is how to &lt;strong&gt;maintain and optimize Iceberg tables&lt;/strong&gt;. While the IRC standard guarantees transactional consistency, it says nothing about how tables should be optimized over time to preserve performance and control storage costs.&lt;/p&gt;
&lt;p&gt;Three primary approaches emerge:&lt;/p&gt;
&lt;h3&gt;1. &lt;strong&gt;Native Optimization (Catalog-Integrated Automation)&lt;/strong&gt;&lt;/h3&gt;
&lt;p&gt;Many commercial catalogs offer built-in optimization features tightly coupled with their own compute engines. These include operations such as:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Compaction (file size tuning)&lt;/li&gt;
&lt;li&gt;Delete file rewriting&lt;/li&gt;
&lt;li&gt;Snapshot expiration&lt;/li&gt;
&lt;li&gt;Partition clustering&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;Platforms like &lt;strong&gt;Dremio&lt;/strong&gt;, &lt;strong&gt;AWS Glue&lt;/strong&gt;, and &lt;strong&gt;Databricks&lt;/strong&gt; provide SQL-native or automated processes (e.g., &lt;code&gt;OPTIMIZE TABLE&lt;/code&gt;, auto-compaction) that manage these operations behind the scenes.&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;
&lt;p&gt;✅ &lt;em&gt;Pros:&lt;/em&gt;&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Zero setup—optimization is automatic or declarative.&lt;/li&gt;
&lt;li&gt;Built-in cost and performance tuning.&lt;/li&gt;
&lt;li&gt;Reduces engineering overhead.&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;⚠️ &lt;em&gt;Cons:&lt;/em&gt;&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Usually catalog-bound.&lt;/li&gt;
&lt;li&gt;Often restricted to Parquet format.&lt;/li&gt;
&lt;li&gt;Switching catalogs later requires reengineering optimization logic.&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;/ul&gt;
&lt;h3&gt;2. &lt;strong&gt;Manual Optimization (Bring Your Own Engine)&lt;/strong&gt;&lt;/h3&gt;
&lt;p&gt;Open-source Iceberg supports all required lifecycle management operations—compaction, snapshot cleanup, rewrite manifests—but leaves it up to users to implement these jobs using engines like &lt;strong&gt;Apache Spark&lt;/strong&gt;, &lt;strong&gt;Flink&lt;/strong&gt;, &lt;strong&gt;Apache Amoro&lt;/strong&gt; or &lt;strong&gt;Trino&lt;/strong&gt;.&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;
&lt;p&gt;✅ &lt;em&gt;Pros:&lt;/em&gt;&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Total freedom—no vendor lock-in.&lt;/li&gt;
&lt;li&gt;Can be integrated into any data pipeline or orchestration framework (Airflow, dbt, Dagster).&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;⚠️ &lt;em&gt;Cons:&lt;/em&gt;&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Requires custom development and scheduling.&lt;/li&gt;
&lt;li&gt;Monitoring and tuning are the user&apos;s responsibility.&lt;/li&gt;
&lt;li&gt;Risk of misconfiguration or inconsistent maintenance across tables.&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;This model works well with catalogs like &lt;strong&gt;Apache Polaris&lt;/strong&gt;, &lt;strong&gt;OneLake&lt;/strong&gt;, or &lt;strong&gt;Snowflake Open Catalog&lt;/strong&gt;, which support R/W operations but do not enforce optimization strategies.&lt;/p&gt;
&lt;h3&gt;3. &lt;strong&gt;Catalog-Neutral Optimization Vendors (e.g., Ryft.io)&lt;/strong&gt;&lt;/h3&gt;
&lt;p&gt;A newer middle ground is emerging with vendors like &lt;strong&gt;Ryft.io&lt;/strong&gt;, which offer catalog-agnostic optimization as a service. These platforms connect to your existing Iceberg tables—via any REST-compliant catalog—and run automated optimization jobs externally.&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;
&lt;p&gt;✅ &lt;em&gt;Pros:&lt;/em&gt;&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Centralized, automated optimization regardless of catalog.&lt;/li&gt;
&lt;li&gt;Maintains interoperability and neutrality.&lt;/li&gt;
&lt;li&gt;Works across major cloud storage (e.g., S3, ADLS, GCS).&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;⚠️ &lt;em&gt;Cons:&lt;/em&gt;&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Still a maturing category.&lt;/li&gt;
&lt;li&gt;Requires compatible storage (cloud object stores).&lt;/li&gt;
&lt;li&gt;Additional cost and integration complexity.&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;This is particularly valuable in multi-engine or multi-catalog environments where optimization cannot be centrally enforced but must still be automated and reliable.&lt;/p&gt;
&lt;h3&gt;Summary: The Optimization Dilemma&lt;/h3&gt;
&lt;p&gt;There is no one-size-fits-all solution:&lt;/p&gt;
&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Strategy&lt;/th&gt;
&lt;th&gt;Best For&lt;/th&gt;
&lt;th&gt;Primary Trade-off&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Native Optimization&lt;/td&gt;
&lt;td&gt;Simplicity, integrated platforms&lt;/td&gt;
&lt;td&gt;Vendor lock-in, format constraints&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Manual (BYO Engine)&lt;/td&gt;
&lt;td&gt;Open source, full control&lt;/td&gt;
&lt;td&gt;Operational complexity&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Vendor-Neutral (Ryft)&lt;/td&gt;
&lt;td&gt;Multi-cloud &amp;amp; multi-engine ops&lt;/td&gt;
&lt;td&gt;Added service dependency, still emerging&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;
&lt;p&gt;Choosing an optimization strategy is not just about performance—it’s a decision about &lt;strong&gt;how much control you need&lt;/strong&gt;, &lt;strong&gt;how much complexity you can absorb&lt;/strong&gt;, and &lt;strong&gt;how much optionality you want to preserve&lt;/strong&gt; in your architecture.&lt;/p&gt;
&lt;h2&gt;Architectural Patterns for Balancing Optimization and Interoperability&lt;/h2&gt;
&lt;p&gt;As organizations adopt Apache Iceberg REST Catalogs (IRC) to decouple compute from metadata, a recurring challenge emerges: how to balance &lt;strong&gt;open interoperability&lt;/strong&gt; with the benefits of &lt;strong&gt;proprietary optimization&lt;/strong&gt;. No single approach satisfies every use case. Instead, data architects are increasingly designing &lt;strong&gt;hybrid strategies&lt;/strong&gt; that reflect the unique demands of their data workflows, regulatory environments, and performance SLAs.&lt;/p&gt;
&lt;h3&gt;1. &lt;strong&gt;Read-Only Catalogs Paired with External Optimization&lt;/strong&gt;&lt;/h3&gt;
&lt;p&gt;Some catalogs, provide high-performance read access to Iceberg tables but restrict external writes via IRC. In these scenarios, organizations may:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Maintain &lt;strong&gt;a separate write-optimized catalog&lt;/strong&gt; (e.g., Apache Polaris, Nessie, or Glue) for ingestion, transformation and optimization.&lt;/li&gt;
&lt;li&gt;Expose tables to the read-optimized catalog &lt;strong&gt;after ingestion and optimization is complete&lt;/strong&gt;.&lt;/li&gt;
&lt;li&gt;Schedule synchronization jobs to ensure both catalogs reference consistent metadata snapshots.&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;This dual-catalog approach preserves the performance of engines with these restrictions while maintaining &lt;strong&gt;external transactional control&lt;/strong&gt; via a neutral or R/W-capable catalog.&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;em&gt;Pros:&lt;/em&gt;
&lt;ul&gt;
&lt;li&gt;Best of both worlds: performance + flexibility.&lt;/li&gt;
&lt;li&gt;Avoids modifying data in restrictive environments.&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;li&gt;&lt;em&gt;Cons:&lt;/em&gt;
&lt;ul&gt;
&lt;li&gt;Adds metadata orchestration complexity.&lt;/li&gt;
&lt;li&gt;Difficult to manage at high scale without automation.&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;/ul&gt;
&lt;h3&gt;2. &lt;strong&gt;Embedded Catalogs for Self-Managed Environments&lt;/strong&gt;&lt;/h3&gt;
&lt;p&gt;Solutions like &lt;strong&gt;MinIO AIStor&lt;/strong&gt; and &lt;strong&gt;Dremio Enterprise Catalog&lt;/strong&gt; take a radically different approach, embedding the IRC layer directly into the object store or lakehouse platform in Dremio&apos;s case. This creates a streamlined deployment architecture for &lt;strong&gt;private cloud, hybrid, or air-gapped&lt;/strong&gt; environments where full control is required.&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Enables transactional Iceberg workloads without deploying a separate metadata database.&lt;/li&gt;
&lt;li&gt;Suited for exascale, high-concurrency AI/ML pipelines.&lt;/li&gt;
&lt;li&gt;Can be used alongside external catalogs for metadata synchronization if needed.&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;This model is also increasingly relevant for regulated industries or enterprises seeking on-premise lakehouse designs with built-in metadata authority.&lt;/p&gt;
&lt;h4&gt;3. &lt;strong&gt;Virtualized Format Interop via Metadata Translation&lt;/strong&gt;&lt;/h4&gt;
&lt;p&gt;&lt;strong&gt;Microsoft OneLake&lt;/strong&gt;, using &lt;strong&gt;Apache XTable&lt;/strong&gt;, pioneers a virtualized metadata model. Instead of writing new Iceberg tables, XTable &lt;strong&gt;projects Iceberg-compatible metadata from Delta Lake&lt;/strong&gt; tables in OneLake.&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;✅ Enables external Iceberg engines to query Delta-based data with no duplication.&lt;/li&gt;
&lt;li&gt;🔄 Metadata is derived dynamically, enabling near real-time interop.&lt;/li&gt;
&lt;li&gt;⚠️ Complex Iceberg-native features may be unsupported due to reliance on Delta primitives.&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;This architecture is ideal for organizations deeply committed to Delta Lake but wanting to provide &lt;strong&gt;Iceberg-compatible access&lt;/strong&gt; for federated analytics or open-source tools.&lt;/p&gt;
&lt;h3&gt;Architectural Takeaway: Mix and Match for Your Use Case&lt;/h3&gt;
&lt;p&gt;The modern Iceberg ecosystem isn’t about picking a single vendor. Instead, it’s about selecting interoperable components that align with your architecture&apos;s &lt;strong&gt;performance, governance, and flexibility goals&lt;/strong&gt;.&lt;/p&gt;
&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Scenario&lt;/th&gt;
&lt;th&gt;Catalog Strategy&lt;/th&gt;
&lt;th&gt;Optimization Path&lt;/th&gt;
&lt;th&gt;Interop Balance&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Cloud-native with automation&lt;/td&gt;
&lt;td&gt;AWS Glue, Dremio&lt;/td&gt;
&lt;td&gt;Native Automation&lt;/td&gt;
&lt;td&gt;High (if Parquet)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Multi-engine, multi-cloud&lt;/td&gt;
&lt;td&gt;Dremio Catalog, Snowflake Open Catalog&lt;/td&gt;
&lt;td&gt;Built on OSS with Full Interop&lt;/td&gt;
&lt;td&gt;Very High&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Private/Hybrid cloud&lt;/td&gt;
&lt;td&gt;MinIO AIStor or Dremio Catalog&lt;/td&gt;
&lt;td&gt;Embedded in software for lakehouse storage or lakehouse engine&lt;/td&gt;
&lt;td&gt;Medium–High&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Stream → Lakehouse&lt;/td&gt;
&lt;td&gt;Confluent TableFlow&lt;/td&gt;
&lt;td&gt;Fixed strategy (snapshots)&lt;/td&gt;
&lt;td&gt;Limited&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Delta → Iceberg bridge&lt;/td&gt;
&lt;td&gt;OneLake + XTable&lt;/td&gt;
&lt;td&gt;Virtualized sync&lt;/td&gt;
&lt;td&gt;High for reads&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;
&lt;p&gt;Designing an effective catalog strategy means embracing modularity—using REST interoperability as the glue while tailoring optimization and governance layers to the needs of your teams.&lt;/p&gt;
&lt;h3&gt;Conclusion: Choosing the Right Iceberg Catalog for Your Strategy&lt;/h3&gt;
&lt;p&gt;The Apache Iceberg REST Catalog ecosystem has matured into a diverse landscape of offerings—each with its own balance of &lt;strong&gt;interoperability&lt;/strong&gt;, &lt;strong&gt;optimization capability&lt;/strong&gt;, and &lt;strong&gt;vendor integration strategy&lt;/strong&gt;. From hyperscalers to open-source initiatives, every catalog presents unique strengths and trade-offs.&lt;/p&gt;
&lt;p&gt;At the heart of this evolution is a simple but profound architectural truth:&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;Compute and metadata must decouple—but performance, governance, and interoperability must still align.&lt;/strong&gt;&lt;/p&gt;
&lt;/blockquote&gt;
&lt;h4&gt;🧠 Key Takeaways&lt;/h4&gt;
&lt;ul&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;If performance and simplicity are your top priorities&lt;/strong&gt;, a native-optimization platform like Dremio, Databricks, or AWS Glue offers seamless, powerful lifecycle management.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;If complete control and flexibility across tools and clouds matter more&lt;/strong&gt;, choose a self-managed catalog like Apache Polaris and prepare to invest in your own optimization pipeline or use a neutral optimizer like Ryft.io (when on major cloud object storage) or use the OSS Apache Amoro.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;If you&apos;re locked into an analytics platform&lt;/strong&gt; like Snowflake or BigQuery, understand the implications of the differing level of Iceberg support on these platforms.&lt;/p&gt;
&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;Among the platforms reviewed, Dremio strikes a rare balance: offering full Iceberg REST compatibility, native R/W support from any engine, and automated optimization, without locking users into its compute layer.&lt;/p&gt;
&lt;p&gt;Unlike platforms that charge per API call or limit external writes, Dremio only charges for compute &lt;strong&gt;run through Dremio itself&lt;/strong&gt;, meaning you can leverage external engines freely while still benefiting from the platform’s integrated catalog.&lt;/p&gt;
&lt;p&gt;This model promotes &lt;strong&gt;interoperability and performance without compromise&lt;/strong&gt;, aligning with the core principles of the Iceberg Lakehouse architecture: open metadata, multi-engine flexibility, and governed performance.&lt;/p&gt;
&lt;h3&gt;Final Thought&lt;/h3&gt;
&lt;p&gt;The Iceberg REST Catalog isn’t just an API spec, it’s the foundation for a new kind of lakehouse: open, transactional, and cloud-agnostic. Your choice of catalog defines how far you can scale without friction.&lt;/p&gt;
&lt;p&gt;Choose wisely.&lt;/p&gt;
</content:encoded><author>Alex Merced</author></item><item><title>Building a Universal Lakehouse Catalog - Beyond Iceberg Tables</title><link>https://iceberglakehouse.com/posts/2025-10-Building-Universal-Lakehouse-Catalog/</link><guid isPermaLink="true">https://iceberglakehouse.com/posts/2025-10-Building-Universal-Lakehouse-Catalog/</guid><description>
**Get Data Lakehouse Books:**
- [Apache Iceberg: The Definitive Guide](https://drmevn.fyi/tableformatblog)
- [Apache Polaris: The Defintive Guide](ht...</description><pubDate>Fri, 17 Oct 2025 09:00:00 GMT</pubDate><content:encoded>&lt;p&gt;&lt;strong&gt;Get Data Lakehouse Books:&lt;/strong&gt;&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;a href=&quot;https://drmevn.fyi/tableformatblog&quot;&gt;Apache Iceberg: The Definitive Guide&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&quot;https://drmevn.fyi/tableformatblog-62P6t&quot;&gt;Apache Polaris: The Defintive Guide&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&quot;https://hubs.la/Q03GfY4f0&quot;&gt;Architecting an Apache Iceberg Lakehouse&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&quot;https://www.puppygraph.com/ebooks/apache-iceberg-digest-vol-1&quot;&gt;The Apache Iceberg Digest: Vol. 1&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;&lt;strong&gt;Lakehouse Community:&lt;/strong&gt;&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;a href=&quot;https://www.datalakehousehub.com&quot;&gt;Join the Data Lakehouse Community&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&quot;https://lakehouseblogs.com&quot;&gt;Data Lakehouse Blog Roll&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&quot;https://osscommunity.com&quot;&gt;OSS Community Listings&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&quot;https://developer.dremio.com&quot;&gt;Dremio Lakehouse Developer Hub&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;&lt;a href=&quot;https://open.spotify.com/show/2PRDrWVpgDvKxN6n1oUsJF?si=e1a55e628ce74a10&quot;&gt;Will be recording an episode on this topic on my podcast, so please subscribe to the podcast to not miss it (Also on iTunes and other directories)&lt;/a&gt;&lt;/p&gt;
&lt;p&gt;Apache Iceberg has done something few projects manage to pull off, it created a standard. Its table format and REST-based catalog interface made it possible for different engines to read, write, and govern the same data without breaking consistency. That’s a big deal. For the first time, organizations could mix and match engines while keeping one clean, transactional view of their data.&lt;/p&gt;
&lt;p&gt;But this success brings new expectations.&lt;/p&gt;
&lt;p&gt;As lakehouse adoption grows, teams want more than just Iceberg tables under one roof. They want to treat &lt;em&gt;all&lt;/em&gt; their datasets, raw Parquet files, streaming logs, external APIs, or even other formats like Delta and Hudi, with the same consistency and governance. The problem? Today’s Iceberg catalogs don’t support that. They’re built for Iceberg tables only.&lt;/p&gt;
&lt;p&gt;So how do we move beyond that? How do we build a &lt;strong&gt;universal&lt;/strong&gt; lakehouse catalog that works across engines &lt;em&gt;and&lt;/em&gt; across formats?&lt;/p&gt;
&lt;p&gt;Let’s explore two possible paths and what’s still missing.&lt;/p&gt;
&lt;h2&gt;Iceberg’s Success: A Case Study in Standardization&lt;/h2&gt;
&lt;p&gt;To understand where catalogs could go next, it helps to look at what made Iceberg successful in the first place.&lt;/p&gt;
&lt;p&gt;Before Iceberg, working with data lakes was messy. You could store files in open formats like Parquet or ORC, but there was no clean way to manage schema changes, version history, or transactional consistency. Each engine had to implement its own logic, or worse, teams had to build brittle pipelines to fill in the gaps.&lt;/p&gt;
&lt;p&gt;Iceberg changed that. It introduced:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;A table format that handles schema evolution, ACID transactions, and partitioning without sacrificing openness.&lt;/li&gt;
&lt;li&gt;A catalog interface that lets any engine discover tables and retrieve metadata in a consistent way.&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;These two specs, the table format and the REST catalog interface, created a plug-and-play model. Spark, Flink, Trino, Dremio, and others could all speak the same language. As a result, Iceberg became the neutral zone. No vendor lock-in, no hidden contracts.&lt;/p&gt;
&lt;p&gt;But that neutrality came with a scope: Iceberg REST Catalog only tracks and governs &lt;strong&gt;Iceberg tables&lt;/strong&gt;. If your dataset isn’t an Iceberg table, there is no modern open interoperable standard for governing and accessing. And that’s where the limitation begins.&lt;/p&gt;
&lt;h2&gt;The Problem: No Standards Beyond Iceberg&lt;/h2&gt;
&lt;p&gt;While Iceberg catalogs are tightly defined for Iceberg tables, some catalogs &lt;em&gt;do&lt;/em&gt; allow you to register other types of datasets, raw Parquet, Delta tables, external views, or even API-based data sources.&lt;/p&gt;
&lt;p&gt;But there’s a catch.&lt;/p&gt;
&lt;p&gt;Each catalog handles this differently. One might use a custom registration API, another might expose a metadata file format, and yet another might treat external sources as virtual tables with limited capabilities. The result is a patchwork of behavior:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Some tools can read those datasets.&lt;/li&gt;
&lt;li&gt;Some can&apos;t see them at all.&lt;/li&gt;
&lt;li&gt;Others behave inconsistently depending on the engine and the catalog.&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;This makes interoperability fragile. What works in one engine may not work in another, even if they both support the same table format. Teams are left stitching together workarounds or writing custom integrations just to get basic access across systems.&lt;/p&gt;
&lt;p&gt;So what’s really missing here? A &lt;strong&gt;standard API&lt;/strong&gt; for non-Iceberg datasets. Something that defines:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;How to register a dataset that isn&apos;t an Iceberg table.&lt;/li&gt;
&lt;li&gt;How to describe its metadata (schema, location, stats).&lt;/li&gt;
&lt;li&gt;How to govern access across different engines.&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;The big question is: where should this standard come from, and what should it look like?&lt;/p&gt;
&lt;h2&gt;Where Should the Standard Come From?&lt;/h2&gt;
&lt;p&gt;This brings us to the real crossroads: if we need a standard API for universal lakehouse catalogs, where should it come from?&lt;/p&gt;
&lt;p&gt;There are a few possibilities:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Should it come from the Iceberg REST spec?&lt;/strong&gt;&lt;br&gt;
That would keep things in the same family and build on an existing community standard. But Iceberg’s current REST spec is tightly scoped around Iceberg tables, and expanding it to cover other data types could be a big shift and expand the project beyond what the community may be comfortable with.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Should it be defined inside a single catalog project like Polaris or Unity?&lt;/strong&gt;&lt;br&gt;
A vendor-backed project can move quickly, implement end-to-end features, and ship a working solution but then be a source of lock-in. If an open standard catalog dominates, then it becomes the home of the API standard by default.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Is it acceptable if the spec starts with a vendor?&lt;/strong&gt;&lt;br&gt;
Maybe. If that vendor drives real adoption and the API is later opened up, it can evolve into a neutral standard. But it would need wide buy-in and careful governance, to avoid becoming another moving target.&lt;/p&gt;
&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;No matter how you look at it, there are really only two main paths forward:&lt;/p&gt;
&lt;ol&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;An implementation becomes the de facto standard.&lt;/strong&gt;&lt;br&gt;
One catalog (open source or commercial) builds enough momentum that its API becomes the standard, similar to how S3 became the API for object storage.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;A neutral API spec is created independently.&lt;/strong&gt;&lt;br&gt;
This would follow the Iceberg model, where the spec came first, then vendors and engines built around it.&lt;/p&gt;
&lt;/li&gt;
&lt;/ol&gt;
&lt;p&gt;If history teaches us anything, it’s that vendor-driven standards can create long-term friction. S3 is a good example: it&apos;s ubiquitous, but it’s also tightly bound to a single provider’s roadmap leading to a whack-a-mole like catch up game for those who support the API they have no control over. That experience shaped how the industry approached table formats, this time, the community came together around Iceberg to avoid that kind of lock-in and vendor catch-up.&lt;/p&gt;
&lt;p&gt;So whatever path we take toward universal cataloging, the smart money is on a &lt;strong&gt;community standard&lt;/strong&gt;. The only question is whether that standard comes from an existing implementation, or from a new, vendor-neutral spec that everyone agrees to follow.&lt;/p&gt;
&lt;h2&gt;Exploring the Implementation-First Path: Apache Polaris and Table Sources&lt;/h2&gt;
&lt;p&gt;If the path to a universal catalog starts with an implementation, Apache Polaris (incubating) is worth watching closely. Among the open catalog projects, Polaris stands out for two reasons:&lt;/p&gt;
&lt;ol&gt;
&lt;li&gt;It&apos;s built as an open implementation of the Apache Iceberg REST Catalog spec.&lt;/li&gt;
&lt;li&gt;It&apos;s actively proposing new features to extend catalog support beyond Iceberg tables.&lt;/li&gt;
&lt;/ol&gt;
&lt;p&gt;While Polaris already supports Iceberg tables through the standard REST interface, it&apos;s exploring how to bring non-Iceberg datasets into the same catalog. This includes both structured file-based datasets like Parquet or JSON, and unstructured data like images, PDFs, or videos.&lt;/p&gt;
&lt;p&gt;Right now, Polaris includes a feature called &lt;strong&gt;Generic Tables&lt;/strong&gt;, but a more robust proposal called &lt;strong&gt;Table Sources&lt;/strong&gt; is under active discussion.&lt;/p&gt;
&lt;h3&gt;What Are Table Sources?&lt;/h3&gt;
&lt;p&gt;&lt;a href=&quot;https://lists.apache.org/thread/652z1f1n2pgf3g2ow5y382wlrtnoqth0&quot;&gt;Discussion of this proposal on the Dev List&lt;/a&gt;&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Table Sources&lt;/strong&gt; are a proposed abstraction that lets Polaris register and govern external data that isn’t already an Iceberg table. Instead of forcing everything into the Iceberg format, Polaris acts as a bridge: mapping object storage locations to queryable tables using metadata services that live outside the catalog itself.&lt;/p&gt;
&lt;p&gt;Each &lt;strong&gt;Table Source&lt;/strong&gt; includes:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;A name (used as the table identifier)&lt;/li&gt;
&lt;li&gt;A source type (structured data, unstructured objects, or Iceberg metadata)&lt;/li&gt;
&lt;li&gt;A configuration (like file format, storage location, credentials, filters, and refresh intervals)&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;For example:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Data Table Source&lt;/strong&gt;: Represents structured files like Parquet or JSON. These are registered read-only tables with metadata generated by an external service.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Object Table Source&lt;/strong&gt;: Describes unstructured data like videos or documents, exposing file metadata (size, path, modification time) in table format.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Iceberg Table Source&lt;/strong&gt;: Adapts metadata from existing Iceberg tables stored outside Polaris.&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;Polaris doesn’t scan or interpret these datasets directly. Instead, &lt;strong&gt;Source Services&lt;/strong&gt;, external processes, use the registered configurations to scan file systems, generate table metadata, and push it back to Polaris. This decouples the engine from the source and the catalog from the scanning logic.&lt;/p&gt;
&lt;p&gt;At query time, engines can interact with these registered tables using the same APIs as they would for Iceberg, even though the backing data may not follow Iceberg’s spec.&lt;/p&gt;
&lt;h3&gt;Why This Matters&lt;/h3&gt;
&lt;p&gt;If adopted, the &lt;strong&gt;Table Source&lt;/strong&gt; feature could give Polaris a head start as the reference implementation for a broader catalog API. It defines a reusable contract for registering external data, managing its lifecycle, and governing access, all in a way that’s decoupled from specific engines or formats.&lt;/p&gt;
&lt;p&gt;But this also raises the bigger question: will other catalogs follow this model? Will engines adopt the same contract for recognizing external data? Or will each system continue to define its own rules?&lt;/p&gt;
&lt;p&gt;That tension, between an evolving implementation like Polaris and the desire for an extension to the REST Catalog API standard, sets the stage for what comes next in the catalog story.&lt;/p&gt;
&lt;h2&gt;The API-First Path: Extending the Iceberg REST Catalog Spec&lt;/h2&gt;
&lt;p&gt;Now let’s explore the other side of the equation: what if instead of extending a specific implementation, we expanded the &lt;strong&gt;Iceberg REST Catalog specification itself&lt;/strong&gt;?&lt;/p&gt;
&lt;p&gt;This approach would focus on defining a &lt;strong&gt;neutral contract&lt;/strong&gt; that any catalog, Polaris, Unity, Glue, or others, could implement to support more than just Iceberg tables. Rather than focusing on what a specific system can do today, it asks: &lt;em&gt;what could a future REST catalog look like if it supported universal datasets by design?&lt;/em&gt;&lt;/p&gt;
&lt;p&gt;One of the most interesting signs of this potential is already in the spec: the &lt;strong&gt;Scan Planning Endpoint&lt;/strong&gt;.&lt;/p&gt;
&lt;h3&gt;What Is Scan Planning?&lt;/h3&gt;
&lt;p&gt;In the typical read path, an engine:&lt;/p&gt;
&lt;ol&gt;
&lt;li&gt;Requests a table from the catalog.&lt;/li&gt;
&lt;li&gt;The catalog responds with the metadata location.&lt;/li&gt;
&lt;li&gt;The engine reads the metadata files (manifests, snapshots, etc.) and plans which Parquet files to scan.&lt;/li&gt;
&lt;/ol&gt;
&lt;p&gt;But with the &lt;strong&gt;Scan Planning Endpoint&lt;/strong&gt;, the flow changes:&lt;/p&gt;
&lt;ol&gt;
&lt;li&gt;The engine calls the endpoint directly.&lt;/li&gt;
&lt;li&gt;The catalog does the heavy lifting: it traverses the metadata, evaluates filters, and returns a &lt;strong&gt;list of data files&lt;/strong&gt; to scan.&lt;/li&gt;
&lt;/ol&gt;
&lt;p&gt;This makes the engine’s job simpler if the catalog and engine support the endpoint. It no longer needs to understand Iceberg’s metadata structure. It just gets files to read.&lt;/p&gt;
&lt;h3&gt;Why This Matters for Universal Catalogs&lt;/h3&gt;
&lt;p&gt;By pushing scan planning into the catalog, the spec opens the door to something bigger:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;The catalog could expose &lt;strong&gt;non-Iceberg&lt;/strong&gt; datasets, like Delta Lake, Hudi, or raw Parquet, and return scan plans for them.&lt;/li&gt;
&lt;li&gt;It could also &lt;strong&gt;cache metadata&lt;/strong&gt; in a relational database, avoiding repeated reads from object storage.&lt;/li&gt;
&lt;li&gt;Engines remain agnostic to metadata formats, they just scan files.&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;This is a fundamental shift: the &lt;strong&gt;catalog becomes the query planner for metadata&lt;/strong&gt;, not just a metadata store.&lt;/p&gt;
&lt;p&gt;But here’s the big catch: this currently only exists on the &lt;strong&gt;read&lt;/strong&gt; side.&lt;/p&gt;
&lt;p&gt;There’s no equivalent in the spec today for the &lt;strong&gt;write path&lt;/strong&gt;.&lt;/p&gt;
&lt;h3&gt;A Hypothetical Write-Side Extension&lt;/h3&gt;
&lt;p&gt;Imagine this: instead of asking the engine to write metadata files (as is required today), the engine submits a write payload to the catalog:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;The namespace of the table&lt;/li&gt;
&lt;li&gt;The table type&lt;/li&gt;
&lt;li&gt;A list of new data files and associated summary statistics&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;The catalog could then:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Internally update its metadata, whether that’s JSON files, a manifest database, or some other format&lt;/li&gt;
&lt;li&gt;Enforce governance rules&lt;/li&gt;
&lt;li&gt;Trigger compaction or indexing tasks&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;In this model, the catalog fully owns metadata management for both reads and writes. Engines don’t need to understand Iceberg’s internals, or any other format’s internals. They just write and read data and delegate everything else.&lt;/p&gt;
&lt;h3&gt;The Trade-Offs&lt;/h3&gt;
&lt;p&gt;This model is clean and powerful. It simplifies engine logic and opens the door for catalogs to support any file-based dataset. But it comes at a cost:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;The &lt;strong&gt;catalog must be deeply optimized&lt;/strong&gt; to handle scan planning at scale.&lt;/li&gt;
&lt;li&gt;It must support high concurrency, incremental updates, and aggressive caching.&lt;/li&gt;
&lt;li&gt;Metadata operations become tightly coupled to catalog performance.&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;In other words, this model places a lot more responsibility on the catalog itself. That’s not necessarily bad, but it changes the design expectations.&lt;/p&gt;
&lt;p&gt;Still, if the goal is to build a &lt;strong&gt;universal contract&lt;/strong&gt; for working with datasets across formats, pushing more of that logic into the catalog, via a standardized API that even the major cloud vendors follow, might be the path forward.&lt;/p&gt;
&lt;h2&gt;Comparing the Two Paths: Implementation vs. API Standard&lt;/h2&gt;
&lt;p&gt;Both the &lt;em&gt;Table Sources&lt;/em&gt; approach and the &lt;em&gt;Scan Planning API model&lt;/em&gt; offer ways to move beyond Iceberg-only catalogs. But they take fundamentally different routes. One starts by expanding what a specific catalog can do and becomes the standard if that catalog becomes the standard. The other extends an API Spec that is already an industry standard with a narrower scope (standardizing transactions with Iceberg tables).&lt;/p&gt;
&lt;p&gt;Let’s weigh the trade-offs.&lt;/p&gt;
&lt;h3&gt;1. &lt;strong&gt;Flexibility and Expressiveness&lt;/strong&gt;&lt;/h3&gt;
&lt;ul&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Table Sources (Implementation-first)&lt;/strong&gt;&lt;br&gt;
✅ Easier to move quickly, Polaris can prototype and evolve features as it is a younger project with a younger community that can reach consensus quicker.&lt;br&gt;
✅ Can support structured and unstructured datasets with source-specific logic.&lt;br&gt;
✅ Avoid the lock-in of a vendor implementation becoming the standard, since Apache Polaris is a incubating Apache Project anyone can deploy.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Scan Planning Extension (API-first)&lt;/strong&gt;&lt;br&gt;
✅ Treats all datasets as files with a metadata interface, engines don’t need to know anything about the metadata format.&lt;br&gt;
✅ Opens the door for catalogs to expose Delta, Hudi, Paimon, or other sources using the same scan API.&lt;br&gt;
⚠️ Metadata management becomes much more complex for the catalog, especially for large tables or real-time use cases.&lt;/p&gt;
&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;In both scenarios, there is still always the question of a specific engines support for reading different file formats or metadata formats. Although, in both scenarios the catalog can still be the central listing governing access to all lakehouse datasets.&lt;/p&gt;
&lt;h3&gt;2. &lt;strong&gt;Governance and Control&lt;/strong&gt;&lt;/h3&gt;
&lt;ul&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Table Sources&lt;/strong&gt;&lt;br&gt;
✅ Catalog remains the system of record and point of governance.&lt;br&gt;
✅ Supports configuration-based registration, access control, and credential vending.&lt;br&gt;
⚠️ Each source type needs its own metadata strategy, increasing maintenance complexity.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Scan Planning + Write Delegation&lt;/strong&gt;&lt;br&gt;
✅ Centralizes all metadata handling, which could unify governance and simplify access rules.&lt;br&gt;
⚠️ Puts more strain on catalog durability, uptime, and scalability, it&apos;s now a bigger bottleneck for reads and writes.&lt;/p&gt;
&lt;/li&gt;
&lt;/ul&gt;
&lt;h3&gt;3. &lt;strong&gt;Ecosystem Alignment&lt;/strong&gt;&lt;/h3&gt;
&lt;ul&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Table Sources&lt;/strong&gt;&lt;br&gt;
✅ Works well for ecosystems already aligned around Polaris or compatible systems.&lt;br&gt;
⚠️ Other catalogs would need to implement Polaris-compatible logic to ensure portability. (We saw catalogs adopt the Iceberg REST Spec as it become the standard, so there is precedent)&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;REST Spec Extension&lt;/strong&gt;&lt;br&gt;
✅ Builds on a known spec (Iceberg REST), which already has buy-in across many vendors.&lt;br&gt;
✅ Keeps catalogs interchangeable if they adhere to the same read/write API contract.&lt;br&gt;
⚠️ Requires coordination and consensus across the community, which can slow down adoption.&lt;/p&gt;
&lt;/li&gt;
&lt;/ul&gt;
&lt;h3&gt;4. &lt;strong&gt;Developer Experience&lt;/strong&gt;&lt;/h3&gt;
&lt;ul&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Table Sources&lt;/strong&gt;&lt;br&gt;
✅ Clear division of responsibility: catalog governs metadata, engines execute logic.&lt;br&gt;
✅ External services (source services) handle complexity and can evolve independently.&lt;br&gt;
⚠️ Requires more infrastructure components to be deployed and maintained.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;API Extensions&lt;/strong&gt;&lt;br&gt;
✅ Simplifies engine logic, engines just hand off files and scan what they’re told.&lt;br&gt;
⚠️ Catalog APIs become more complex and require tighter validation of inputs and outputs.&lt;/p&gt;
&lt;/li&gt;
&lt;/ul&gt;
&lt;h3&gt;Summary&lt;/h3&gt;
&lt;p&gt;In practice, both paths have strengths, and challenges. A hybrid model could even emerge: catalogs like Polaris could lead the way with working implementations, while the community formalizes an API spec based on what works.&lt;/p&gt;
&lt;p&gt;The real question isn’t which is “better”, it’s which path brings the most durable, portable, and scalable standard to life.&lt;/p&gt;
</content:encoded><author>Alex Merced</author></item><item><title>Intro to Apache Iceberg with Apache Polaris and Apache Spark</title><link>https://iceberglakehouse.com/posts/2025-10-intro-to-apache-iceberg-with-apache-polaris-and-apache-spark/</link><guid isPermaLink="true">https://iceberglakehouse.com/posts/2025-10-intro-to-apache-iceberg-with-apache-polaris-and-apache-spark/</guid><description>
**Get Data Lakehouse Books:**
- [Apache Iceberg: The Definitive Guide](https://drmevn.fyi/tableformatblog)
- [Apache Polaris: The Defintive Guide](ht...</description><pubDate>Thu, 16 Oct 2025 09:00:00 GMT</pubDate><content:encoded>&lt;p&gt;&lt;strong&gt;Get Data Lakehouse Books:&lt;/strong&gt;&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;a href=&quot;https://drmevn.fyi/tableformatblog&quot;&gt;Apache Iceberg: The Definitive Guide&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&quot;https://drmevn.fyi/tableformatblog-62P6t&quot;&gt;Apache Polaris: The Defintive Guide&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&quot;https://hubs.la/Q03GfY4f0&quot;&gt;Architecting an Apache Iceberg Lakehouse&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&quot;https://www.puppygraph.com/ebooks/apache-iceberg-digest-vol-1&quot;&gt;The Apache Iceberg Digest: Vol. 1&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;&lt;strong&gt;Lakehouse Community:&lt;/strong&gt;&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;a href=&quot;https://www.datalakehousehub.com&quot;&gt;Join the Data Lakehouse Community&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&quot;https://lakehouseblogs.com&quot;&gt;Data Lakehouse Blog Roll&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&quot;https://osscommunity.com&quot;&gt;OSS Community Listings&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&quot;https://developer.dremio.com&quot;&gt;Dremio Lakehouse Developer Hub&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;Modern analytics depend on flexibility. Teams want to query raw data with the same speed and reliability they expect from a warehouse. That goal led to the rise of the &lt;em&gt;data lakehouse&lt;/em&gt;, an architecture that unifies structured and unstructured data while supporting multiple compute engines.&lt;/p&gt;
&lt;p&gt;The lakehouse model removes silos by allowing data to live in open formats, accessible to tools like Spark, Trino, Dremio, and Flink. Interoperability becomes the foundation of this design: storage is separated from compute, and metadata lives in a shared catalog. Apache Iceberg sits at the center of this open ecosystem.&lt;/p&gt;
&lt;h2&gt;The Lakehouse and the Value of Interoperability&lt;/h2&gt;
&lt;p&gt;Traditional data systems often forced teams to choose between performance and openness. Data warehouses provided fast queries but required proprietary formats and vendor lock-in. Data lakes offered openness and low cost but lacked reliability and consistent schema management.&lt;/p&gt;
&lt;p&gt;The lakehouse combines both. It keeps data in object storage while using open table formats like Apache Iceberg to bring reliability, version control, and transactional guarantees. This allows multiple engines to read and write the same datasets without duplication.&lt;/p&gt;
&lt;p&gt;Interoperability is the key advantage. When organizations use open standards, they can build systems that evolve without re-platforming. Governance, lineage, and performance optimizations can be shared across tools, creating one consistent view of enterprise data.&lt;/p&gt;
&lt;h2&gt;Apache Iceberg’s Role in the Lakehouse&lt;/h2&gt;
&lt;p&gt;Apache Iceberg is the open table format that makes the lakehouse possible. It defines how large analytic tables are stored, versioned, and accessed in cloud or on-premises object storage. Iceberg tracks snapshots of data files, enabling ACID transactions, schema evolution, and time travel.&lt;/p&gt;
&lt;p&gt;Each Iceberg table is independent of any single compute engine. Spark, Dremio, Trino, and Flink can all operate on the same tables because the format defines a consistent API for reading and writing data. This makes Iceberg a shared foundation for analytics across the open data ecosystem.&lt;/p&gt;
&lt;p&gt;In practice, Iceberg replaces the old Hive Metastore model with a more scalable and flexible metadata structure. Tables are self-describing, and every change creates a new immutable snapshot. This design not only enables concurrency and rollback but also ensures that the same data can be reliably queried from different engines without conflict.&lt;/p&gt;
&lt;h2&gt;The Structure of an Apache Iceberg Table&lt;/h2&gt;
&lt;p&gt;An Apache Iceberg table is more than a collection of data files. It is a structured system that records every version of a dataset, allowing engines to read, write, and track changes with full transactional integrity. Understanding this structure helps explain how Iceberg enables features like time travel, schema evolution, and partition management.&lt;/p&gt;
&lt;p&gt;At the top level, each table has a &lt;strong&gt;metadata directory&lt;/strong&gt; that contains JSON files describing the current state of the table. These files point to &lt;strong&gt;snapshot metadata&lt;/strong&gt;, which lists all the data files that make up the current version. Each snapshot references one or more &lt;strong&gt;manifest lists&lt;/strong&gt;, and each manifest list points to multiple &lt;strong&gt;manifest files&lt;/strong&gt;. Manifest files contain the actual list of data files, typically Parquet, ORC, or Avro, along with partition information and statistics.&lt;/p&gt;
&lt;p&gt;Every time you insert, delete, or update data, Iceberg creates a new snapshot without rewriting the existing files. This immutable design ensures that multiple users and engines can safely interact with the same table at the same time. It also makes rollback and version tracking possible, since previous snapshots are always preserved until explicitly expired.&lt;/p&gt;
&lt;p&gt;Iceberg also introduces a flexible approach to partitioning. Instead of static directories like in Hive, Iceberg uses &lt;strong&gt;partition transforms&lt;/strong&gt; that record logical rules, such as &lt;code&gt;bucket(8, id)&lt;/code&gt; or &lt;code&gt;months(order_date)&lt;/code&gt;, directly in metadata. This allows the table to manage partitions dynamically, improving query performance while keeping partitioning transparent to users.&lt;/p&gt;
&lt;p&gt;Together, these components form a self-contained and versioned system that makes object storage behave like a transactional database. In the next section, you’ll set up an environment using Apache Polaris and Apache Spark to see how this structure works in practice.&lt;/p&gt;
&lt;h2&gt;Setting Up the Environment&lt;/h2&gt;
&lt;p&gt;To explore how Apache Iceberg works in practice, you’ll use a local setup that includes three components: &lt;strong&gt;Apache Polaris&lt;/strong&gt;, &lt;strong&gt;MinIO&lt;/strong&gt;, and &lt;strong&gt;Apache Spark&lt;/strong&gt;. Polaris will serve as the catalog that manages Iceberg metadata, MinIO will act as your S3-compatible storage system, and Spark will be your compute engine for creating and querying tables.&lt;/p&gt;
&lt;p&gt;A &lt;strong&gt;catalog&lt;/strong&gt; in Iceberg defines where tables are stored and how their metadata is managed. It is responsible for keeping track of namespaces, table locations, and access control. Apache Polaris provides an open-source implementation of an Iceberg catalog that exposes a REST API for managing these operations. Polaris also adds governance features, authentication, roles, and permissions, making it more than just a metadata store.&lt;/p&gt;
&lt;p&gt;Within Polaris, users and services are represented as &lt;strong&gt;principals&lt;/strong&gt;, each with unique credentials that determine what they can access. You can assign roles and privileges to principals, giving them permission to create, update, or query catalogs and tables. This design allows multiple tools to share a single governed catalog while maintaining secure, fine-grained access.&lt;/p&gt;
&lt;h3&gt;Starting the Environment&lt;/h3&gt;
&lt;p&gt;Clone the quickstart repository and start the environment using Docker Compose:&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-bash&quot;&gt;git clone https://github.com/AlexMercedCoder/Apache-Polaris-Apache-Iceberg-Minio-Spark-Quickstart.git
cd Apache-Polaris-Apache-Iceberg-Minio-Spark-Quickstart
docker compose up -d
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;This will launch:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;
&lt;p&gt;Polaris on port &lt;code&gt;8181&lt;/code&gt; (catalog API)&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;MinIO on ports &lt;code&gt;9000&lt;/code&gt; and &lt;code&gt;9001&lt;/code&gt; (S3 and web console)&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;Spark with Jupyter Notebook on port &lt;code&gt;8888&lt;/code&gt;&lt;/p&gt;
&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;You can verify that all containers are running with:&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-bash&quot;&gt;docker ps
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Once they’re up, open the Jupyter Notebook interface by visiting &lt;code&gt;http://localhost:8888&lt;/code&gt;. Create a new Python notebook and copy the contents of &lt;code&gt;bootstrap.py&lt;/code&gt; from the repository into a cell. Running this script will bootstrap Polaris by:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;
&lt;p&gt;Creating two catalogs—&lt;code&gt;lakehouse&lt;/code&gt; and &lt;code&gt;warehouse&lt;/code&gt;—that point to MinIO buckets.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;Defining a principal with access credentials.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;Assigning roles and granting full permissions to that principal.&lt;/p&gt;
&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;When the script completes, it prints a ready-to-use Spark configuration block with all the connection details. You’ll use that configuration in the next section to create and manage Iceberg tables through Polaris.&lt;/p&gt;
&lt;h2&gt;Creating Iceberg Tables&lt;/h2&gt;
&lt;p&gt;With Polaris bootstrapped and Spark connected, you’re ready to start working with Iceberg tables. The tables you create will live in the &lt;code&gt;polaris.db&lt;/code&gt; namespace, with their data stored in your MinIO buckets. All catalog and permission management will happen automatically through Polaris.&lt;/p&gt;
&lt;p&gt;Before you begin creating tables, make sure Spark is configured to connect to Polaris. When you ran &lt;code&gt;bootstrap.py&lt;/code&gt;, the script printed out a Spark configuration block similar to the example below. This block contains the packages, catalog URI, warehouse name, and your principal’s credentials. Copy this block into a cell in your Jupyter Notebook and run it to initialize your Spark session.&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-python&quot;&gt;# Spark configuration for catalog: lakehouse
from pyspark.sql import SparkSession

spark = (SparkSession.builder
    .config(&amp;quot;spark.jars.packages&amp;quot;, &amp;quot;org.apache.polaris:polaris-spark-3.5_2.13:1.1.0-incubating,org.apache.iceberg:iceberg-aws-bundle:1.10.0,io.delta:delta-spark_2.12:3.3.1,org.apache.iceberg:iceberg-spark-runtime-3.5_2.12:1.10.0&amp;quot;)
    .config(&amp;quot;spark.sql.catalog.spark_catalog&amp;quot;, &amp;quot;org.apache.spark.sql.delta.catalog.DeltaCatalog&amp;quot;)
    .config(&amp;quot;spark.sql.extensions&amp;quot;, &amp;quot;org.apache.iceberg.spark.extensions.IcebergSparkSessionExtensions,io.delta.sql.DeltaSparkSessionExtension&amp;quot;)
    .config(&amp;quot;spark.sql.catalog.polaris&amp;quot;, &amp;quot;org.apache.polaris.spark.SparkCatalog&amp;quot;)
    .config(&amp;quot;spark.sql.catalog.polaris.uri&amp;quot;, &amp;quot;http://polaris:8181/api/catalog&amp;quot;)
    .config(&amp;quot;spark.sql.catalog.polaris.warehouse&amp;quot;, &amp;quot;lakehouse&amp;quot;)
    .config(&amp;quot;spark.sql.catalog.polaris.credential&amp;quot;, &amp;quot;{client_id}:{client_secret}&amp;quot;)
    .config(&amp;quot;spark.sql.catalog.polaris.scope&amp;quot;, &amp;quot;PRINCIPAL_ROLE:ALL&amp;quot;)
    .config(&amp;quot;spark.sql.catalog.polaris.header.X-Iceberg-Access-Delegation&amp;quot;, &amp;quot;vended-credentials&amp;quot;)
    .config(&amp;quot;spark.sql.catalog.polaris.token-refresh-enabled&amp;quot;, &amp;quot;true&amp;quot;)
    .getOrCreate())

spark.sql(&amp;quot;CREATE NAMESPACE IF NOT EXISTS polaris.db&amp;quot;).show()
spark.sql(&amp;quot;CREATE TABLE IF NOT EXISTS polaris.db.example (name STRING)&amp;quot;).show()
spark.sql(&amp;quot;INSERT INTO polaris.db.example VALUES (&apos;example value&apos;)&amp;quot;).show()
spark.sql(&amp;quot;SELECT * FROM polaris.db.example&amp;quot;).show()
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;The &lt;code&gt;client_id&lt;/code&gt; and &lt;code&gt;client_secret&lt;/code&gt; values should be filled in with the code printed at the end of running your bootstrap script. Once the Spark session starts, you’ll be able to issue SQL commands directly against Polaris.&lt;/p&gt;
&lt;h3&gt;Creating a Basic Table&lt;/h3&gt;
&lt;p&gt;Start by setting your working namespace and creating a simple unpartitioned table:&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-python&quot;&gt;# eliminates the need to prefix table names with the namespace polaris.db
spark.sql(&amp;quot;USE polaris.db&amp;quot;)

spark.sql(&amp;quot;&amp;quot;&amp;quot;
CREATE TABLE customers (
    id INT,
    name STRING,
    city STRING
)
USING iceberg
&amp;quot;&amp;quot;&amp;quot;)
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;This creates a new Iceberg table tracked by Polaris. You can confirm its existence by listing all tables in the namespace:&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-python&quot;&gt;spark.sql(&amp;quot;SHOW TABLES IN polaris.db&amp;quot;).show()
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Now open the MinIO console at &lt;code&gt;http://localhost:9001&lt;/code&gt; (admin/password are the credentials) and explore the lakehouse bucket—you’ll see a new folder structure created for your table. This directory contains the Parquet data files and the metadata that Polaris manages.&lt;/p&gt;
&lt;h3&gt;Partitioned Tables&lt;/h3&gt;
&lt;p&gt;Partitioning helps improve performance by organizing data into logical groups. Iceberg’s partition transforms let you define flexible strategies without depending on directory names.&lt;/p&gt;
&lt;p&gt;Partition by a single column:&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-python&quot;&gt;spark.sql(&amp;quot;&amp;quot;&amp;quot;
CREATE TABLE polaris.db.sales (
    sale_id INT,
    product STRING,
    quantity INT,
    city STRING
)
USING iceberg
PARTITIONED BY (city)
&amp;quot;&amp;quot;&amp;quot;)
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Partition by time:&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-python&quot;&gt;spark.sql(&amp;quot;&amp;quot;&amp;quot;
CREATE TABLE polaris.db.orders (
    order_id INT,
    customer_id INT,
    order_date DATE,
    total DECIMAL(10,2)
)
USING iceberg
PARTITIONED BY (months(order_date))
&amp;quot;&amp;quot;&amp;quot;)
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Partition by hash buckets for even data distribution:&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-python&quot;&gt;spark.sql(&amp;quot;&amp;quot;&amp;quot;
CREATE TABLE polaris.db.transactions (
    txn_id BIGINT,
    user_id BIGINT,
    amount DOUBLE
)
USING iceberg
PARTITIONED BY (bucket(8, user_id))
&amp;quot;&amp;quot;&amp;quot;)
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Each strategy changes how Iceberg organizes data, but all are tracked as metadata—not directories—making future changes safe and reversible.&lt;/p&gt;
&lt;p&gt;After creating your tables, return to the MinIO console to explore the results. You’ll notice new directories and metadata files representing the structure of each table. These files are created and tracked automatically by Polaris, ensuring that every write, update, and schema change remains consistent across all engines that connect to the catalog.&lt;/p&gt;
&lt;h2&gt;Inserting Data&lt;/h2&gt;
&lt;p&gt;Once your tables are created, you can begin inserting and modifying data through Spark. Every write operation, whether it’s an insert, update, or delete, creates a new &lt;strong&gt;snapshot&lt;/strong&gt; in Iceberg. Each snapshot represents a consistent view of your table at a specific point in time and is recorded in Polaris’s metadata catalog.&lt;/p&gt;
&lt;p&gt;Start with a simple insert:&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-python&quot;&gt;spark.sql(&amp;quot;&amp;quot;&amp;quot;
INSERT INTO polaris.db.customers VALUES
(1, &apos;Alice&apos;, &apos;New York&apos;),
(2, &apos;Bob&apos;, &apos;Chicago&apos;),
(3, &apos;Carla&apos;, &apos;Boston&apos;)
&amp;quot;&amp;quot;&amp;quot;)
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;After this insert, open the MinIO console and look inside the lakehouse bucket under polaris/db/customers. You’ll see a new folder structure containing Parquet files and Iceberg metadata files (metadata.json, snapshots, and manifests). Each write creates new files rather than overwriting existing ones, which is how Iceberg maintains atomic transactions and rollback capabilities.&lt;/p&gt;
&lt;h3&gt;Inserting into Partitioned Tables&lt;/h3&gt;
&lt;p&gt;If you created partitioned tables earlier, Iceberg will automatically place data into the correct partitions based on your table definition:&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-python&quot;&gt;spark.sql(&amp;quot;&amp;quot;&amp;quot;
INSERT INTO polaris.db.sales VALUES
(101, &apos;Laptop&apos;, 5, &apos;New York&apos;),
(102, &apos;Tablet&apos;, 3, &apos;Boston&apos;),
(103, &apos;Phone&apos;, 7, &apos;Chicago&apos;)
&amp;quot;&amp;quot;&amp;quot;)
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;To confirm partitioning, you can check MinIO. Each partition value (in this case, city) will have its own subdirectory. Iceberg manages these directories automatically through metadata, keeping partitioning invisible to end users.&lt;/p&gt;
&lt;h3&gt;Working with Larger Datasets&lt;/h3&gt;
&lt;p&gt;For larger datasets, you can also write directly from a DataFrame:&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-python&quot;&gt;data = [(201, &apos;Monitor&apos;, 2, &apos;Denver&apos;),
        (202, &apos;Keyboard&apos;, 10, &apos;Austin&apos;)]

df = spark.createDataFrame(data, [&apos;sale_id&apos;, &apos;product&apos;, &apos;quantity&apos;, &apos;city&apos;])
df.writeTo(&amp;quot;polaris.db.sales&amp;quot;).append()
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;This method is efficient for batch operations and ensures your Spark DataFrames integrate cleanly with Iceberg’s transaction system.&lt;/p&gt;
&lt;p&gt;Each time you perform a write, Polaris updates the catalog with a new snapshot ID. These snapshots allow you to query your table as it existed at any point in time, a capability you’ll explore later in the section on time travel.&lt;/p&gt;
&lt;p&gt;For now, review the lakehouse bucket in MinIO after each insert to see how Iceberg adds new Parquet and metadata files. Each transaction tells a story of how the table evolves over time, tracked and governed by Polaris.&lt;/p&gt;
&lt;h2&gt;Update, Delete, and Merge Into&lt;/h2&gt;
&lt;p&gt;Apache Iceberg provides full ACID transaction support, allowing you to update, delete, and merge data safely. Each of these operations creates a new snapshot while preserving older versions of the table, giving you consistent rollback and auditing capabilities. Polaris tracks these changes in its catalog so that every engine accessing the table sees a consistent state.&lt;/p&gt;
&lt;h3&gt;Updating Data&lt;/h3&gt;
&lt;p&gt;Use &lt;code&gt;UPDATE&lt;/code&gt; to modify existing records. For example, if one of your customers relocates:&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-python&quot;&gt;spark.sql(&amp;quot;&amp;quot;&amp;quot;
UPDATE polaris.db.customers
SET city = &apos;San Francisco&apos;
WHERE name = &apos;Alice&apos;
&amp;quot;&amp;quot;&amp;quot;)
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;This statement creates a new snapshot that replaces the affected rows with updated data. Iceberg performs this by rewriting only the data files that contain the changed rows, which keeps transactions efficient even at scale.&lt;/p&gt;
&lt;h3&gt;Deleting Data&lt;/h3&gt;
&lt;p&gt;You can delete records using a standard &lt;code&gt;DELETE&lt;/code&gt; statement:&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-python&quot;&gt;spark.sql(&amp;quot;&amp;quot;&amp;quot;
DELETE FROM polaris.db.customers
WHERE name = &apos;Bob&apos;
&amp;quot;&amp;quot;&amp;quot;)
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;After running this command, open the MinIO console and look at the customers directory in the lakehouse bucket. You’ll notice new Parquet and metadata files have appeared, Iceberg never mutates existing files. Instead, it writes new ones and updates the catalog’s snapshot metadata through Polaris.&lt;/p&gt;
&lt;h3&gt;Merging Data (Upserts)&lt;/h3&gt;
&lt;p&gt;The &lt;code&gt;MERGE INTO&lt;/code&gt; command allows you to perform upserts, merging new records with existing data based on a matching key. This is especially useful when syncing incremental updates from another source.&lt;/p&gt;
&lt;p&gt;First, create a temporary table or view that holds your new data:&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-python&quot;&gt;spark.sql(&amp;quot;&amp;quot;&amp;quot;
CREATE OR REPLACE TEMP VIEW updates AS
SELECT 1 AS id, &apos;Alice&apos; AS name, &apos;Seattle&apos; AS city
UNION ALL
SELECT 4 AS id, &apos;Dana&apos; AS name, &apos;Austin&apos; AS city
&amp;quot;&amp;quot;&amp;quot;)
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Then merge it into your main table:&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-python&quot;&gt;spark.sql(&amp;quot;&amp;quot;&amp;quot;
MERGE INTO polaris.db.customers AS target
USING updates AS source
ON target.id = source.id
WHEN MATCHED THEN UPDATE SET city = source.city
WHEN NOT MATCHED THEN INSERT *
&amp;quot;&amp;quot;&amp;quot;)
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;After the merge completes, Polaris will record a new snapshot in the catalog. You can query the customers.history or customers.snapshots metadata tables to see when and how the change occurred.&lt;/p&gt;
&lt;p&gt;Each of these operations, &lt;code&gt;UPDATE&lt;/code&gt;, &lt;code&gt;DELETE&lt;/code&gt;, and &lt;code&gt;MERGE INTO&lt;/code&gt;, produces new files in MinIO and new snapshots in Polaris. This versioned structure ensures your tables remain fully auditable. Take a moment to check the lakehouse bucket again after running each command. You’ll see Iceberg’s design in action: immutable data files, evolving metadata, and transparent version control, all orchestrated through Polaris.&lt;/p&gt;
&lt;h2&gt;Altering Partition Scheme&lt;/h2&gt;
&lt;p&gt;Over time, your table’s partitioning strategy may need to change as data grows or query patterns evolve. Apache Iceberg allows you to alter partition schemes safely, without rewriting existing files. This flexibility is one of Iceberg’s biggest advantages over traditional data lake formats. All changes are tracked by Polaris, ensuring that the catalog always reflects the current partition structure.&lt;/p&gt;
&lt;p&gt;Suppose your &lt;code&gt;sales&lt;/code&gt; table is currently partitioned by city. If queries start filtering by &lt;code&gt;product&lt;/code&gt; instead, you can modify the table’s partitioning to better suit that use case. Start by dropping the old partition field:&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-python&quot;&gt;spark.sql(&amp;quot;&amp;quot;&amp;quot;
ALTER TABLE polaris.db.sales
DROP PARTITION FIELD city
&amp;quot;&amp;quot;&amp;quot;)
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Then add a new partition field:&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-python&quot;&gt;spark.sql(&amp;quot;&amp;quot;&amp;quot;
ALTER TABLE polaris.db.sales
ADD PARTITION FIELD bucket(8, product)
&amp;quot;&amp;quot;&amp;quot;)
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;This change affects only future writes. Existing data remains organized by the previous partition scheme, while new records follow the new one. Iceberg’s metadata model keeps track of both versions, so queries continue to return complete results without manual migration.&lt;/p&gt;
&lt;p&gt;To verify your table’s current partitioning, run:&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-python&quot;&gt;spark.sql(&amp;quot;SHOW PARTITIONS polaris.db.sales&amp;quot;).show()
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;You can also view the table’s partition history through the metadata tables:&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-python&quot;&gt;spark.sql(&amp;quot;SELECT * FROM polaris.db.sales.partitions&amp;quot;).show()
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;After altering partition fields, try inserting new records and observe how Iceberg places them into new directories in MinIO. Open the lakehouse bucket in the MinIO console, navigate to your sales folder, and you’ll see both the old and new partition structures coexisting under the same table. Polaris ensures the catalog references all of them correctly.&lt;/p&gt;
&lt;p&gt;This feature makes partition evolution seamless. You can adapt to new data patterns or performance needs without downtime, data duplication, or complex ETL steps. In the next section, you’ll learn how to explore Iceberg’s built-in metadata tables and use time travel to query historical versions of your data.&lt;/p&gt;
&lt;h2&gt;Metadata Tables and Time Travel&lt;/h2&gt;
&lt;p&gt;Apache Iceberg doesn’t just store data—it stores the entire history of your data. Every write operation creates a new snapshot, and every snapshot is tracked in the table’s metadata. These metadata tables give you full visibility into how your data changes over time. Because Polaris manages the catalog, you can query these tables from any engine that connects to it, ensuring a unified and governed view of your data lifecycle.&lt;/p&gt;
&lt;h3&gt;Exploring Metadata Tables&lt;/h3&gt;
&lt;p&gt;Each Iceberg table automatically includes several metadata tables that you can query just like normal tables:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;history&lt;/strong&gt; – shows when snapshots were created.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;snapshots&lt;/strong&gt; – lists snapshot IDs and timestamps.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;files&lt;/strong&gt; – lists all data and manifest files in each snapshot.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;manifests&lt;/strong&gt; – details how files are grouped and filtered.&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;You can explore them with Spark SQL:&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-python&quot;&gt;spark.sql(&amp;quot;SELECT * FROM polaris.db.sales.history&amp;quot;).show()
spark.sql(&amp;quot;SELECT * FROM polaris.db.sales.snapshots&amp;quot;).show()
spark.sql(&amp;quot;SELECT * FROM polaris.db.sales.files&amp;quot;).show()
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;These tables reveal every version of your dataset, what files were written, when they were created, and by which operation. You can use this information for auditing, debugging, or optimizing table performance.&lt;/p&gt;
&lt;h3&gt;Querying Past Versions with Time Travel&lt;/h3&gt;
&lt;p&gt;Because Iceberg stores all historical snapshots, you can query data as it existed at a specific point in time. You can travel through time using either a snapshot ID or a timestamp.&lt;/p&gt;
&lt;p&gt;First, identify a snapshot ID from the snapshots table:&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-python&quot;&gt;spark.sql(&amp;quot;SELECT snapshot_id, committed_at FROM polaris.db.sales.snapshots&amp;quot;).show()
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Then query that version:&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-python&quot;&gt;spark.read.option(&amp;quot;snapshot-id&amp;quot;, &amp;quot;&amp;lt;snapshot_id&amp;gt;&amp;quot;).table(&amp;quot;polaris.db.sales&amp;quot;).show()
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Alternatively, you can query the table as it existed at a given timestamp:&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-python&quot;&gt;spark.read.option(&amp;quot;as-of-timestamp&amp;quot;, &amp;quot;2025-10-10T12:00:00&amp;quot;).table(&amp;quot;polaris.db.sales&amp;quot;).show()
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;This ability to reproduce historical states makes Iceberg ideal for debugging ETL processes, reproducing analytics, or auditing compliance-related datasets.&lt;/p&gt;
&lt;h3&gt;Seeing It in MinIO&lt;/h3&gt;
&lt;p&gt;Each time you insert, update, or delete data, Iceberg records a new snapshot. Open the lakehouse bucket in MinIO and navigate through your table directories—you’ll notice subdirectories under metadata/ representing each snapshot and manifest. Every change to your data produces new metadata and data files, which together describe the complete history of your table.&lt;/p&gt;
&lt;p&gt;Iceberg’s metadata and time travel capabilities, combined with Polaris’s catalog management, give you full traceability and reproducibility. In the next section, you’ll learn how to keep your tables healthy by compacting small files and expiring old snapshots.&lt;/p&gt;
&lt;h2&gt;Compaction and Snapshot Expiration&lt;/h2&gt;
&lt;p&gt;As you run inserts, updates, and merges, Iceberg continuously creates new data and metadata files. Over time, this can lead to many small files and obsolete snapshots. To maintain performance and control storage costs, Iceberg provides built-in maintenance operations for compaction and snapshot expiration. With Polaris managing the catalog, these optimizations remain consistent and trackable across all compute engines that access your tables.&lt;/p&gt;
&lt;h3&gt;Compacting Small Files&lt;/h3&gt;
&lt;p&gt;Small files are common in streaming or frequent batch ingestion workflows. Iceberg can merge them into fewer, larger files using the &lt;code&gt;rewrite_data_files&lt;/code&gt; procedure. This reduces overhead during query planning and execution.&lt;/p&gt;
&lt;p&gt;Run the following command from Spark to compact your table:&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-python&quot;&gt;spark.sql(&amp;quot;&amp;quot;&amp;quot;
CALL polaris.system.rewrite_data_files(&apos;polaris.db.sales&apos;)
&amp;quot;&amp;quot;&amp;quot;)
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;You can also target specific partitions or filter files by size:&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-python&quot;&gt;spark.sql(&amp;quot;&amp;quot;&amp;quot;
CALL polaris.system.rewrite_data_files(
  table =&amp;gt; &apos;polaris.db.sales&apos;,
  options =&amp;gt; map(&apos;min-input-files&apos;, &apos;4&apos;, &apos;max-concurrent-rewrites&apos;, &apos;2&apos;)
)
&amp;quot;&amp;quot;&amp;quot;)
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;After compaction, check your lakehouse bucket in MinIO. You’ll notice fewer Parquet files, each larger in size. Iceberg automatically updates manifests and metadata files so that queries continue to return accurate results with better performance.&lt;/p&gt;
&lt;h3&gt;Expiring Old Snapshots&lt;/h3&gt;
&lt;p&gt;Every Iceberg operation creates a snapshot. Over time, unused snapshots can accumulate, consuming metadata space and storage. Iceberg allows you to remove these safely using the expire_snapshots procedure.&lt;/p&gt;
&lt;p&gt;For example, to remove snapshots older than seven days:&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-python&quot;&gt;spark.sql(&amp;quot;&amp;quot;&amp;quot;
CALL polaris.system.expire_snapshots(
  table =&amp;gt; &apos;polaris.db.sales&apos;,
  older_than =&amp;gt; TIMESTAMPADD(DAY, -7, CURRENT_TIMESTAMP)
)
&amp;quot;&amp;quot;&amp;quot;)
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;You can also specify how many snapshots to retain regardless of age:&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-python&quot;&gt;spark.sql(&amp;quot;&amp;quot;&amp;quot;
CALL polaris.system.expire_snapshots(
  table =&amp;gt; &apos;polaris.db.sales&apos;,
  retain_last =&amp;gt; 5
)
&amp;quot;&amp;quot;&amp;quot;)
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Polaris automatically tracks the catalog state after expiration, ensuring that all compute engines accessing the table remain synchronized with the current set of snapshots.&lt;/p&gt;
&lt;h3&gt;Monitoring with Metadata Tables&lt;/h3&gt;
&lt;p&gt;After compaction or expiration, you can verify changes using the metadata tables:&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-python&quot;&gt;spark.sql(&amp;quot;SELECT * FROM polaris.db.sales.snapshots&amp;quot;).show()
spark.sql(&amp;quot;SELECT * FROM polaris.db.sales.manifests&amp;quot;).show()
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;You’ll see fewer manifests and snapshots, confirming that Iceberg has reclaimed space and simplified query planning.&lt;/p&gt;
&lt;p&gt;Maintenance operations like compaction and snapshot expiration help keep your Iceberg tables fast and cost-efficient. Combined with Polaris’s centralized catalog, these operations stay consistent across all connected engines. Whether you’re using Spark, Dremio, Trino, or Flink, Polaris ensures a single source of truth for your Iceberg metadata, making performance optimization and governance effortless.&lt;/p&gt;
&lt;h2&gt;Writing Efficiently to Apache Iceberg with Spark&lt;/h2&gt;
&lt;p&gt;When working with Apache Iceberg tables in Spark, how you write data has a major impact on performance, metadata growth, and maintenance frequency. Iceberg is designed for incremental writes and schema evolution, but inefficient write patterns—like frequent small updates or poor partitioning—can lead to excessive snapshots and small files. By tuning Spark and table-level settings, you can reduce the need for costly compaction and keep your tables query-ready.&lt;/p&gt;
&lt;h3&gt;Optimize File Size and Shuffle Configuration&lt;/h3&gt;
&lt;p&gt;Each write produces data files that Spark generates in parallel tasks. If your partitions are too small or the number of shuffle tasks is too high, Spark creates many tiny files, increasing metadata overhead and slowing queries. To control this, adjust Spark’s shuffle and output configurations before writing:&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-python&quot;&gt;spark.conf.set(&amp;quot;spark.sql.shuffle.partitions&amp;quot;, 8)
spark.conf.set(&amp;quot;spark.sql.files.maxRecordsPerFile&amp;quot;, 5_000_000)
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;These settings reduce the number of output files per job and encourage larger Parquet files (typically &lt;code&gt;128–512 MB&lt;/code&gt; each). You can also call &lt;code&gt;.coalesce()&lt;/code&gt; or &lt;code&gt;.repartition()&lt;/code&gt; before writes to further control file output:&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-python&quot;&gt;df.coalesce(8).writeTo(&amp;quot;polaris.db.sales&amp;quot;).append()
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Balanced partitioning and file sizing keep your table fast and avoid unnecessary metadata bloat.&lt;/p&gt;
&lt;h3&gt;Use Table Properties to Guide Iceberg Behavior&lt;/h3&gt;
&lt;p&gt;Iceberg provides table-level configuration options that influence how data is written, compacted, and validated. You can define them during table creation or later using &lt;code&gt;ALTER TABLE&lt;/code&gt;.&lt;/p&gt;
&lt;p&gt;For example:&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-sql&quot;&gt;CREATE TABLE polaris.db.sales (
  id BIGINT,
  region STRING,
  sale_date DATE,
  amount DOUBLE
)
USING iceberg
PARTITIONED BY (days(sale_date))
TBLPROPERTIES (
  &apos;write.target-file-size-bytes&apos;=&apos;268435456&apos;,  -- 256 MB target file size
  &apos;commit.manifest-merge.enabled&apos;=&apos;true&apos;,       -- reduces manifest churn
  &apos;write.distribution-mode&apos;=&apos;hash&apos;,             -- distributes data evenly
  &apos;write.merge.mode&apos;=&apos;copy-on-write&apos;            -- ensures clean updates
);
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;You can also modify these settings later:&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-sql&quot;&gt;ALTER TABLE polaris.db.sales SET TBLPROPERTIES (
  &apos;write.target-file-size-bytes&apos;=&apos;536870912&apos;  -- 512 MB
);
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Setting appropriate table properties ensures consistent behavior across all engines—Spark, Dremio, or Flink—that share your Polaris catalog.&lt;/p&gt;
&lt;h3&gt;Batch and Append Data Strategically&lt;/h3&gt;
&lt;p&gt;Each write in Iceberg creates a new snapshot. If your application writes too frequently (e.g., per record or small microbatch), metadata grows quickly and queries slow down. Instead, buffer data into larger batches before committing:&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-python&quot;&gt;batch_df.writeTo(&amp;quot;polaris.db.sales&amp;quot;).append()
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;If you need streaming ingestion, tune the microbatch trigger interval and commit size. A five-minute trigger often balances latency and table stability better than writing every few seconds.&lt;/p&gt;
&lt;p&gt;For update-heavy workloads, consider using Merge-Into operations periodically rather than constant row-level updates:&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-sql&quot;&gt;MERGE INTO polaris.db.sales t
USING updates u
ON t.id = u.id
WHEN MATCHED THEN UPDATE SET amount = u.amount
WHEN NOT MATCHED THEN INSERT *
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;This avoids snapshot sprawl and makes compaction less frequent.&lt;/p&gt;
&lt;h3&gt;Align Partitioning with Query Patterns&lt;/h3&gt;
&lt;p&gt;Good partitioning reduces the number of files scanned per query. Avoid partitioning by high-cardinality columns like &lt;code&gt;user_id&lt;/code&gt;. Instead, use transforms that group data efficiently:&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-sql&quot;&gt;ALTER TABLE polaris.db.sales REPLACE PARTITION FIELD sale_date WITH days(sale_date)
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Or combine multiple transforms for balance:&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-sql&quot;&gt;CREATE TABLE polaris.db.sales (
  id BIGINT,
  region STRING,
  sale_date DATE,
  amount DOUBLE
)
USING iceberg
PARTITIONED BY (bucket(8, region), days(sale_date))
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;These partitioning rules make pruning effective and improve both reads and writes.&lt;/p&gt;
&lt;h3&gt;Tune Commit and Validation Settings&lt;/h3&gt;
&lt;p&gt;For large write jobs, commit coordination and validation can also affect performance. Iceberg supports asynchronous manifest merging and snapshot cleanup to reduce contention:&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-sql&quot;&gt;ALTER TABLE polaris.db.sales SET TBLPROPERTIES (
  &apos;commit.manifest-merge.enabled&apos;=&apos;true&apos;,
  &apos;commit.retry.num-retries&apos;=&apos;5&apos;,
  &apos;write.distribution-mode&apos;=&apos;hash&apos;
);
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;These settings help large concurrent writers (for example, in Spark and Flink) commit safely to the same table without conflicts.&lt;/p&gt;
&lt;p&gt;Efficient Iceberg write patterns come from tuning Spark and table properties together. Use larger file targets, consistent partitioning, and controlled batch sizes to minimize small files and snapshot churn. By applying these strategies, your Iceberg tables will stay lean and performant—reducing the need for manual compaction or cleanup. Combined with Apache Polaris, your catalog enforces consistent governance, authentication, and metadata management across every compute engine in your lakehouse.&lt;/p&gt;
&lt;h2&gt;12. Understanding How Polaris Manages Your Iceberg Tables&lt;/h2&gt;
&lt;p&gt;Once you have optimized your write strategy, it’s worth understanding what happens behind the scenes when you write data into Iceberg tables through Apache Polaris. Polaris acts as a centralized catalog—responsible for managing all metadata about your tables, snapshots, and permissions—ensuring that every write or read operation is consistent across tools like Spark, Dremio, Trino, and Flink.&lt;/p&gt;
&lt;p&gt;When Spark writes to an Iceberg table using Polaris, the process goes beyond simply saving files to MinIO or S3. Each commit updates a &lt;strong&gt;snapshot&lt;/strong&gt;—a precise record of table state including data files, manifests, and partition metadata. Polaris stores the metadata pointers, enforces ACID guarantees, and validates that every write operation maintains table consistency.&lt;/p&gt;
&lt;h3&gt;Coordinating Metadata and Commits&lt;/h3&gt;
&lt;p&gt;Each write to an Iceberg table involves several steps:&lt;/p&gt;
&lt;ol&gt;
&lt;li&gt;Spark writes data files (usually in Parquet format) to the storage layer, such as MinIO.&lt;/li&gt;
&lt;li&gt;Spark generates a manifest list describing these new data files.&lt;/li&gt;
&lt;li&gt;The Iceberg REST client, through Polaris, updates the catalog’s metadata location and commits the new snapshot.&lt;/li&gt;
&lt;li&gt;Polaris enforces isolation and conflict detection to ensure concurrent writers don’t overwrite each other’s work.&lt;/li&gt;
&lt;/ol&gt;
&lt;p&gt;Because Polaris manages these metadata transactions centrally, it becomes the single source of truth for all engines. This makes cross-engine interoperability reliable—Spark can write data, and Dremio or Trino can query it immediately without any manual refresh.&lt;/p&gt;
&lt;h3&gt;Governance and Security&lt;/h3&gt;
&lt;p&gt;Polaris also introduces a security layer around Iceberg. Instead of embedding access keys or S3 credentials in your Spark jobs, Polaris can &lt;strong&gt;vend temporary credentials&lt;/strong&gt; that enforce fine-grained access control. Each principal and catalog role determines what operations are allowed, ensuring that users and jobs interact only with the tables they are permitted to modify or query.&lt;/p&gt;
&lt;p&gt;This approach decouples data governance from compute infrastructure. You can manage permissions, audit access, and rotate credentials—all directly through Polaris—while still using open data lakehouse standards like Apache Iceberg.&lt;/p&gt;
&lt;h3&gt;Automatic Table Optimization in Dremio&lt;/h3&gt;
&lt;p&gt;If you use Dremio’s integrated catalog (built on Polaris), you also gain automated table optimization. Dremio monitors data size, file counts, and snapshot churn, then automatically runs compaction and metadata cleanup as needed. It maintains your Iceberg tables in an optimized state without requiring manual Spark procedures.&lt;/p&gt;
&lt;p&gt;That means you can focus on analytics, while Dremio and Polaris handle governance, credential management, and metadata consistency across all your compute platforms.&lt;/p&gt;
&lt;p&gt;With this understanding, you now have a complete end-to-end view of how Apache Spark and Apache Polaris work together to maintain a modern, open lakehouse. From efficient write strategies to managed metadata and automated optimization, you can confidently scale your Iceberg data platform knowing it’s governed, interoperable, and future-proof.&lt;/p&gt;
&lt;h2&gt;Next Steps and Expanding Your Lakehouse&lt;/h2&gt;
&lt;p&gt;Now that you’ve successfully set up Apache Polaris with Spark and Iceberg on your local machine, you’ve built a foundation for exploring the broader lakehouse ecosystem. This environment not only lets you understand Iceberg’s core table mechanics but also shows how a catalog like Polaris centralizes governance, metadata, and access control—key components of an interoperable lakehouse architecture.&lt;/p&gt;
&lt;h3&gt;Connect More Compute Engines&lt;/h3&gt;
&lt;p&gt;Polaris is designed to work seamlessly across multiple compute engines. Once your Iceberg tables are registered in Polaris, you can connect tools such as:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Dremio&lt;/strong&gt; – Query and optimize Iceberg tables visually through its integrated Polaris-based catalog.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Trino&lt;/strong&gt; – Use Polaris as a REST-based catalog for federated queries across your data lake.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Flink&lt;/strong&gt; – Stream data into Iceberg tables managed by Polaris for real-time analytics.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;DuckDB&lt;/strong&gt; or &lt;strong&gt;Python (PyIceberg)&lt;/strong&gt; – Interact directly with Iceberg tables for lightweight local exploration.&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;Each of these engines communicates through the same Polaris REST interface, ensuring that all metadata and access control remain consistent, no matter where you query from.&lt;/p&gt;
&lt;h3&gt;Experiment with Advanced Iceberg Features&lt;/h3&gt;
&lt;p&gt;Once you’re comfortable with the basics, try exploring Iceberg’s advanced capabilities:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Schema Evolution&lt;/strong&gt; – Add, rename, or delete columns without rewriting data.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Row-Level Deletes&lt;/strong&gt; – Use deletion vectors for efficient, fine-grained record removal.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Table Branching and Tagging&lt;/strong&gt; – Experiment safely with data changes using versioned metadata.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Snapshot Isolation&lt;/strong&gt; – Test concurrent writes to understand Iceberg’s transaction model.&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;These features are fully tracked by Polaris, giving you a reliable, auditable history of every change.&lt;/p&gt;
&lt;h3&gt;Extend with Automation and Orchestration&lt;/h3&gt;
&lt;p&gt;You can also automate your setup and maintenance workflows:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Use &lt;strong&gt;Airflow&lt;/strong&gt; or &lt;strong&gt;Cron&lt;/strong&gt; to run the &lt;code&gt;bootstrap.py&lt;/code&gt; script on a schedule, ensuring consistent initialization of catalogs and principals.&lt;/li&gt;
&lt;li&gt;Create periodic &lt;strong&gt;compaction&lt;/strong&gt; or &lt;strong&gt;snapshot expiration&lt;/strong&gt; jobs using Spark SQL.&lt;/li&gt;
&lt;li&gt;Deploy your Polaris setup in &lt;strong&gt;Kubernetes&lt;/strong&gt; using Helm or Docker Compose for multi-user testing environments.&lt;/li&gt;
&lt;/ul&gt;
&lt;h3&gt;Prepare for Cloud or Hybrid Deployment&lt;/h3&gt;
&lt;p&gt;The setup you’ve built locally with MinIO can easily extend to real cloud storage systems. Replace your MinIO endpoint with S3, GCS, or Azure Blob credentials, and Polaris will manage your Iceberg tables just as before—using the same metadata model and APIs.&lt;/p&gt;
&lt;p&gt;This local-to-cloud continuity is one of the greatest advantages of Iceberg and Polaris: your data architecture can scale from a personal laptop demo to a full production lakehouse without refactoring or vendor lock-in.&lt;/p&gt;
&lt;h3&gt;Wrapping Up&lt;/h3&gt;
&lt;p&gt;You’ve now seen how Apache Iceberg, Apache Polaris, and Apache Spark work together to form a robust, open lakehouse. Through this hands-on setup, you’ve learned how to:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Write and optimize Iceberg tables in Spark.&lt;/li&gt;
&lt;li&gt;Manage metadata, catalogs, and access through Polaris.&lt;/li&gt;
&lt;li&gt;Explore advanced Iceberg features safely and efficiently.&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;For larger-scale deployments—or if you want automated optimization, integrated governance, and performance acceleration—explore &lt;strong&gt;Dremio’s Intelligent Lakehouse Platform&lt;/strong&gt;, which builds directly on Apache Polaris and Iceberg to deliver a unified, self-service analytics experience.&lt;/p&gt;
</content:encoded><author>Alex Merced</author></item><item><title>The State of Apache Iceberg v4 - October 2025 Edition</title><link>https://iceberglakehouse.com/posts/2025-10-apache-iceberg-v4-october-2025/</link><guid isPermaLink="true">https://iceberglakehouse.com/posts/2025-10-apache-iceberg-v4-october-2025/</guid><description>
**Get Data Lakehouse Books:**
- [Apache Iceberg: The Definitive Guide](https://drmevn.fyi/tableformatblog)
- [Apache Polaris: The Defintive Guide](ht...</description><pubDate>Tue, 14 Oct 2025 09:00:00 GMT</pubDate><content:encoded>&lt;p&gt;&lt;strong&gt;Get Data Lakehouse Books:&lt;/strong&gt;&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;a href=&quot;https://drmevn.fyi/tableformatblog&quot;&gt;Apache Iceberg: The Definitive Guide&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&quot;https://drmevn.fyi/tableformatblog-62P6t&quot;&gt;Apache Polaris: The Defintive Guide&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&quot;https://hubs.la/Q03GfY4f0&quot;&gt;Architecting an Apache Iceberg Lakehouse&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&quot;https://www.puppygraph.com/ebooks/apache-iceberg-digest-vol-1&quot;&gt;The Apache Iceberg Digest: Vol. 1&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;&lt;strong&gt;Lakehouse Community:&lt;/strong&gt;&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;a href=&quot;https://www.datalakehousehub.com&quot;&gt;Join the Data Lakehouse Community&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&quot;https://lakehouseblogs.com&quot;&gt;Data Lakehouse Blog Roll&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&quot;https://osscommunity.com&quot;&gt;OSS Community Listings&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&quot;https://developer.dremio.com&quot;&gt;Dremio Lakehouse Developer Hub&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;hr&gt;
&lt;p&gt;Apache Iceberg has come a long way since its early days of bringing reliable ACID transactions and schema evolution to the data lake. It helped teams move beyond brittle Hive tables and built the foundation for modern lakehouse architectures. But with wider adoption came new challenges—especially as workloads shifted from batch-heavy pipelines to streaming ingestion, faster commits, and more interactive use cases.&lt;/p&gt;
&lt;p&gt;That pressure has exposed some cracks in the foundation. Write-heavy applications hit metadata bottlenecks. Query planners struggle with inefficient stats. Teams managing large tables face complex migrations due to rigid path references.&lt;/p&gt;
&lt;p&gt;The Apache Iceberg community has responded with a set of focused, forward-looking proposals that make up the v4 specification. These aren’t just incremental tweaks. They represent a clear architectural shift toward scalability, operational simplicity, and real-time readiness.&lt;/p&gt;
&lt;p&gt;In this post, we’ll walk through the key features proposed for Iceberg v4, why they matter, and what they mean for data engineers, architects, and teams building at scale.&lt;/p&gt;
&lt;h2&gt;The New Iceberg Vision: Performance Meets Portability&lt;/h2&gt;
&lt;p&gt;Apache Iceberg was initially built for reliable batch analytics on cloud object storage. It solved core problems like schema evolution, snapshot isolation, and data consistency across distributed files. That foundation made it a favorite for building open data lakehouses.&lt;/p&gt;
&lt;p&gt;But today’s data platforms are evolving fast. Teams are mixing streaming and batch. Ingest rates are higher. Table sizes are bigger. Query expectations are more demanding. Managing metadata at scale has become one of the biggest friction points.&lt;/p&gt;
&lt;p&gt;The proposals in Iceberg v4 address these shifts head-on. Together, they aim to:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Reduce write overhead&lt;/strong&gt; so commits scale with ingestion speed&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Improve query planning&lt;/strong&gt; by making metadata easier to scan and use&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Simplify operations&lt;/strong&gt; like moving, cloning, or backing up tables&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;In short, Iceberg is being re-tuned for modern workloads—ones that demand both speed and flexibility. The v4 changes aren’t just about performance. They’re about making Iceberg easier to run, easier to optimize, and better suited for the next generation of data systems.&lt;/p&gt;
&lt;h2&gt;Proposal 1: Single-File Commits – Cutting Down Metadata Overhead&lt;/h2&gt;
&lt;p&gt;Every commit to an Iceberg table today creates at least two new metadata files: one for the updated manifest list, and another for any changed manifests. In fast-moving environments—like streaming ingestion or micro-batch pipelines—this adds up quickly.&lt;/p&gt;
&lt;p&gt;The result? Write amplification. For every small data change, there’s a burst of I/O to update metadata. Over time, this leads to thousands of small metadata files, bloated storage, and a slowdown in commit throughput. Teams often have to schedule compaction jobs to clean up the data.&lt;/p&gt;
&lt;p&gt;The v4 proposal introduces &lt;strong&gt;Single-File Commits&lt;/strong&gt;, a new way to consolidate all metadata changes into a single file per commit. This reduces:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;The number of file system operations per commit&lt;/li&gt;
&lt;li&gt;The coordination overhead for concurrent writers&lt;/li&gt;
&lt;li&gt;The need for frequent compaction&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;By minimizing I/O and simplifying commit logic, this change unlocks faster ingestion and makes Iceberg friendlier to real-time workflows. It also means fewer moving parts to manage and fewer edge cases to debug in production.&lt;/p&gt;
&lt;h2&gt;Proposal 2: Parquet for Metadata – Smarter Query Planning&lt;/h2&gt;
&lt;p&gt;Today, Iceberg stores metadata files—like manifests and manifest lists—in &lt;strong&gt;Apache Avro&lt;/strong&gt;, a row-based format. While this made sense early on, it’s become a bottleneck for query performance.&lt;/p&gt;
&lt;p&gt;Why? Because most query engines don’t need every field in the metadata. For example, if a planner wants to filter files based on a column’s min and max values, it only needs that one field. But with Avro, it has to read and deserialize entire rows just to access a few columns.&lt;/p&gt;
&lt;p&gt;The proposed change in Iceberg v4 is to &lt;strong&gt;use Parquet instead of Avro&lt;/strong&gt; for metadata files. Since Parquet is a columnar format, engines can:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Read only the fields they need&lt;/li&gt;
&lt;li&gt;Skip over irrelevant parts of the file&lt;/li&gt;
&lt;li&gt;Load metadata faster and use less memory&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;This shift isn’t just about speed—it enables smarter planning. Engines can project just the stats they care about, sort and filter more effectively, and better optimize execution plans. It’s a small architectural change with a big ripple effect across the query lifecycle.&lt;/p&gt;
&lt;h2&gt;Proposal 3: Column Statistics Overhaul – Better Skipping, Smarter Queries&lt;/h2&gt;
&lt;p&gt;Metadata isn&apos;t just about file paths—it&apos;s also about understanding what’s inside each file. Iceberg uses column-level statistics to help query engines skip files that don’t match filter conditions. But the current stats format has limitations that hold back performance.&lt;/p&gt;
&lt;p&gt;Right now, statistics are:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Flat and untyped, with no indication of data type&lt;/li&gt;
&lt;li&gt;Stored as generic key-value pairs&lt;/li&gt;
&lt;li&gt;Lacking detail on things like null counts or nested fields&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;These gaps make it hard for query planners to fully optimize their logic. For example, it&apos;s difficult to distinguish between a missing value and a null, or to reason about nested data structures like structs and arrays.&lt;/p&gt;
&lt;p&gt;The v4 spec proposes a &lt;strong&gt;redesigned statistics format&lt;/strong&gt; with:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Type information for every stat&lt;/li&gt;
&lt;li&gt;Projectable structures for selective reads&lt;/li&gt;
&lt;li&gt;Support for more detailed metrics, including null counts and nested fields&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;This richer structure enables more precise file pruning and better cost-based optimization. Engines can make smarter decisions about which files to read and which filters to push down—leading to faster queries, less I/O, and improved overall performance.&lt;/p&gt;
&lt;h2&gt;Proposal 4: Relative Paths – Making Tables Portable Again&lt;/h2&gt;
&lt;p&gt;In current versions of Iceberg, metadata files store &lt;strong&gt;absolute file paths&lt;/strong&gt;. That might seem fine at first—until you try to move a table.&lt;/p&gt;
&lt;p&gt;If you change storage accounts, rename a bucket, or migrate between environments, every path in every metadata file becomes invalid. Fixing that means scanning and rewriting all metadata—an expensive, error-prone operation that often requires a distributed job.&lt;/p&gt;
&lt;p&gt;The v4 proposal introduces support for &lt;strong&gt;relative paths&lt;/strong&gt; in metadata. Instead of locking a table to a fixed storage location, file references are stored relative to a base URI defined in the table metadata.&lt;/p&gt;
&lt;p&gt;This change unlocks several real-world benefits:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Simpler migrations&lt;/strong&gt; across cloud regions or storage platforms&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Easier disaster recovery&lt;/strong&gt; with portable backups&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Less brittle operations&lt;/strong&gt; when storage configurations evolve&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;Relative paths decouple the logical structure of a table from its physical location. That means fewer rewrites, less maintenance overhead, and more flexibility when managing Iceberg tables at scale.&lt;/p&gt;
&lt;h2&gt;Iceberg’s Direction: Toward Operational Simplicity&lt;/h2&gt;
&lt;p&gt;Taken together, these proposals reflect a clear shift in how the Iceberg community is thinking about the format—not just as a technical layer, but as an operational foundation for modern data platforms.&lt;/p&gt;
&lt;p&gt;Here’s what’s changing:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;From batch-first to real-time ready&lt;/strong&gt;: Single-file commits and smarter stats make Iceberg more suitable for streaming ingestion and low-latency use cases.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;From fixed to flexible&lt;/strong&gt;: Relative paths reduce the coupling between metadata and storage, making operations like migration and backup less painful.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;From rigid to optimized&lt;/strong&gt;: Moving to columnar metadata and richer statistics gives query engines more room to optimize without heavy lifting.&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;This is Iceberg growing up.&lt;/p&gt;
&lt;p&gt;The format has always prioritized correctness and openness. Now it’s doubling down on speed, scalability, and ease of use—especially for teams managing hundreds or thousands of tables across dynamic environments.&lt;/p&gt;
&lt;p&gt;Whether you&apos;re building AI pipelines, federated queries, or traditional dashboards, these changes aim to reduce the friction and complexity of working with large-scale tables. It’s about making Iceberg not just powerful, but practical.&lt;/p&gt;
&lt;h2&gt;A Glimpse Ahead: v4 in Context&lt;/h2&gt;
&lt;p&gt;Just a few months ago, Apache Iceberg v3 was approved—bringing meaningful improvements to the table format. That release introduced new data types, deletion vectors, and other enhancements that expanded what Iceberg can represent and how it supports evolving workloads.&lt;/p&gt;
&lt;p&gt;Right now, the ecosystem is heads-down implementing v3 features across engines, catalogs, and query layers. You’ll see more engines support features like row-level deletes and richer data modeling as v3 adoption matures.&lt;/p&gt;
&lt;p&gt;The proposals for v4 aren’t intended to replace that momentum—they build on it.&lt;/p&gt;
&lt;p&gt;Think of v3 as expanding what Iceberg can do. V4 focuses on how efficiently and cleanly it can perform the task. These early discussions around v4 offer a forward-looking roadmap for how Iceberg will continue to evolve—toward higher throughput, better portability, and more brilliant query performance.&lt;/p&gt;
&lt;p&gt;While these changes are still in the design and discussion phase, they signal where Iceberg is heading. For data teams investing in the lakehouse stack today, it’s reassuring that the foundation will only get stronger over time.&lt;/p&gt;
</content:encoded><author>Alex Merced</author></item><item><title>The Ultimate Guide to Open Table Formats - Iceberg, Delta Lake, Hudi, Paimon, and DuckLake</title><link>https://iceberglakehouse.com/posts/2025-09-ultimate-guide-to-open-table-formats/</link><guid isPermaLink="true">https://iceberglakehouse.com/posts/2025-09-ultimate-guide-to-open-table-formats/</guid><description>
**Get Data Lakehouse Books:**
- [Apache Iceberg: The Definitive Guide](https://drmevn.fyi/tableformatblog)
- [Apache Polaris: The Defintive Guide](ht...</description><pubDate>Wed, 24 Sep 2025 09:00:00 GMT</pubDate><content:encoded>&lt;p&gt;&lt;strong&gt;Get Data Lakehouse Books:&lt;/strong&gt;&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;a href=&quot;https://drmevn.fyi/tableformatblog&quot;&gt;Apache Iceberg: The Definitive Guide&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&quot;https://drmevn.fyi/tableformatblog-62P6t&quot;&gt;Apache Polaris: The Defintive Guide&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&quot;https://hubs.la/Q03GfY4f0&quot;&gt;Architecting an Apache Iceberg Lakehouse&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;&lt;strong&gt;&lt;a href=&quot;https://www.datalakehousehub.com&quot;&gt;Join the Data Lakehouse Community&lt;/a&gt;&lt;/strong&gt;
&lt;strong&gt;&lt;a href=&quot;https://lakehouseblogs.com&quot;&gt;Data Lakehouse Blog Roll&lt;/a&gt;&lt;/strong&gt;&lt;/p&gt;
&lt;hr&gt;
&lt;p&gt;Modern lakehouse stacks live or die by &lt;strong&gt;how&lt;/strong&gt; they manage tables on cheap, scalable object storage. That “how” is the job of &lt;strong&gt;open table formats&lt;/strong&gt;, the layer that turns piles of Parquet/ORC files into reliable, ACID-compliant &lt;strong&gt;tables&lt;/strong&gt; with schema evolution, time travel, and efficient query planning. If you’ve ever wrestled with brittle Hive tables, small-file explosions, or “append-only” lakes that can’t handle updates and deletes, you already know why this layer matters.&lt;/p&gt;
&lt;p&gt;In this guide, we’ll demystify the five formats you’re most likely to encounter:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Apache Iceberg&lt;/strong&gt; -  snapshot- and manifest–driven, engine-agnostic, fast for large-scale analytics.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Delta Lake&lt;/strong&gt; -  transaction-log–based, deeply integrated with Spark/Databricks, strong batch/stream unification.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Apache Hudi&lt;/strong&gt; -  built for upserts, deletes, and incremental processing; flexible COW/MOR modes.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Apache Paimon&lt;/strong&gt; -  streaming-first with an LSM-like design for high-velocity updates and near-real-time reads.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;DuckLake&lt;/strong&gt; -  a fresh, catalog-centric approach that uses a relational database for metadata (SQL all the way down).&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;We’ll start beginner-friendly, clarifying &lt;strong&gt;what&lt;/strong&gt; a table format is and &lt;strong&gt;why&lt;/strong&gt; it’s essential, then progressively dive into expert-level topics: &lt;strong&gt;metadata internals&lt;/strong&gt; (snapshots, logs, manifests, LSM levels), &lt;strong&gt;row-level change strategies&lt;/strong&gt; (COW, MOR, delete vectors), &lt;strong&gt;performance trade-offs&lt;/strong&gt;, &lt;strong&gt;ecosystem support&lt;/strong&gt; (Spark, Flink, Trino/Presto, DuckDB, warehouses), and &lt;strong&gt;adoption trends&lt;/strong&gt; you should factor into your roadmap.&lt;/p&gt;
&lt;p&gt;By the end, you’ll have a practical mental model to choose the right format for your workloads, whether you’re optimizing petabyte-scale analytics, enabling near-real-time CDC, or simplifying your metadata layer for developer velocity.&lt;/p&gt;
&lt;h2&gt;Why Open Table Formats Exist&lt;/h2&gt;
&lt;p&gt;Before diving into each format, it’s worth understanding &lt;em&gt;why&lt;/em&gt; open table formats became necessary in the first place.&lt;/p&gt;
&lt;p&gt;Traditional data lakes, built on raw files like CSV, JSON, or Parquet, were cheap and scalable, but brittle. They had no concept of &lt;strong&gt;transactions&lt;/strong&gt;, which meant if two jobs wrote data at the same time, you could easily end up with partial or corrupted results. Schema evolution was painful, renaming or reordering columns could break queries, and updating or deleting even a single row often meant rewriting entire partitions.&lt;/p&gt;
&lt;p&gt;Meanwhile, enterprises still needed &lt;strong&gt;database-like features&lt;/strong&gt;, updates, deletes, versioning, auditing, on their data lakes. That tension set the stage for open table formats. These formats layer &lt;strong&gt;metadata and transaction protocols&lt;/strong&gt; on top of files to give the data lake the brains of a database while keeping its open, flexible nature.&lt;/p&gt;
&lt;p&gt;In practice, open table formats deliver several critical capabilities:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;ACID Transactions:&lt;/strong&gt; Ensure reliability for concurrent reads and writes.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Schema Evolution:&lt;/strong&gt; Add, drop, or rename fields without breaking downstream consumers.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Time Travel:&lt;/strong&gt; Query data as it existed at a specific point in time for auditing or recovery.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Efficient Queries:&lt;/strong&gt; Push down filters and prune partitions/files using metadata rather than scanning everything.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Row-Level Mutations:&lt;/strong&gt; Support upserts, merges, and deletes on immutable storage layers.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Multi-Engine Interoperability:&lt;/strong&gt; Enable the same table to be queried by Spark, Flink, Trino, Presto, DuckDB, warehouses, and more.&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;In other words, table formats solve the “wild west of files” problem, turning data lakes into &lt;strong&gt;lakehouses&lt;/strong&gt; that balance scalability with structure. The differences among Iceberg, Delta, Hudi, Paimon, and DuckLake lie in &lt;em&gt;how&lt;/em&gt; they achieve this and &lt;em&gt;what trade-offs&lt;/em&gt; they make to optimize for batch, streaming, or simplicity.&lt;/p&gt;
&lt;p&gt;Next, we’ll walk through the &lt;strong&gt;history and evolution&lt;/strong&gt; of each format to see how these ideas took shape.&lt;/p&gt;
&lt;h2&gt;The Evolution of Open Table Formats&lt;/h2&gt;
&lt;p&gt;The journey of open table formats reflects the challenges companies faced as data lakes scaled from terabytes to petabytes. Each format emerged to solve specific pain points:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Apache Hudi (2016)&lt;/strong&gt; – Created at Uber to solve &lt;em&gt;freshness&lt;/em&gt; and &lt;em&gt;incremental ingestion&lt;/em&gt;. Hudi pioneered row-level upserts and deletes on data lakes, enabling near real-time pipelines on Hadoop-sized datasets.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Delta Lake (2017–2018)&lt;/strong&gt; – Developed by Databricks to unify &lt;em&gt;batch and streaming&lt;/em&gt; in Spark. Its transaction log design (_delta_log) gave data lakes database-like commits and time-travel capabilities, making it a cornerstone of the “lakehouse” concept.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Apache Iceberg (2018)&lt;/strong&gt; – Born at Netflix to overcome Hive’s scalability and schema evolution limitations. Its snapshot/manifest-based metadata model provided atomic commits, partition evolution, and reliable time-travel at massive scale, quickly becoming an industry favorite.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Apache Paimon (2022)&lt;/strong&gt; – Emerging from Alibaba’s Flink ecosystem, Paimon was built &lt;em&gt;streaming-first&lt;/em&gt;. Its LSM-tree design optimized for high-throughput upserts and continuous compaction, positioning it as a bridge between real-time CDC ingestion and analytics.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;DuckLake (2025)&lt;/strong&gt; – The newest entrant, introduced by the DuckDB/MotherDuck team. Instead of managing JSON or Avro metadata files, DuckLake stores all table metadata in a relational database. This catalog-centric design aims to simplify consistency, enable multi-table transactions, and drastically speed up query planning.&lt;/p&gt;
&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;These formats represent &lt;strong&gt;waves of innovation&lt;/strong&gt;:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;First wave (Hudi, Delta): Improving upon the concept of tables on the data lake.&lt;/li&gt;
&lt;li&gt;Second wave (Iceberg): focusing on batch reliability, schema evolution, and interoperability.&lt;/li&gt;
&lt;li&gt;Third wave (Paimon, DuckLake): rethinking the architecture for real-time data and metadata simplicity.&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;Next, we’ll dive into &lt;strong&gt;Apache Iceberg&lt;/strong&gt; in detail, its metadata structure, features, and why it has become the default choice for many modern lakehouse deployments.&lt;/p&gt;
&lt;h2&gt;Apache Iceberg: The Batch-First Powerhouse&lt;/h2&gt;
&lt;p&gt;&lt;strong&gt;Background &amp;amp; Origins&lt;/strong&gt;&lt;br&gt;
Apache Iceberg was born at Netflix in 2018 and donated to the Apache Software Foundation in 2019. Its mission was clear: fix the long-standing problems of Hive tables, unreliable schema changes, expensive directory scans, and lack of true atomicity. Iceberg introduced a clean-slate design that scaled to petabytes while guaranteeing &lt;strong&gt;ACID transactions&lt;/strong&gt;, &lt;strong&gt;schema evolution&lt;/strong&gt;, and &lt;strong&gt;time-travel queries&lt;/strong&gt;.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Metadata Structure&lt;/strong&gt;&lt;br&gt;
Iceberg’s metadata model is built on a hierarchy of files:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Table metadata file (JSON):&lt;/strong&gt; tracks schema versions, partition specs, snapshots, and properties.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Snapshots:&lt;/strong&gt; each commit creates a new snapshot, representing the table’s full state at that point in time.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Manifest lists &amp;amp; manifests (Avro):&lt;/strong&gt; hierarchical indexes of data files, enabling partition pruning and column-level stats without scanning entire directories.&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;This design avoids reliance on directory listings, making planning queries over millions of files feasible.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Core Features&lt;/strong&gt;&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Schema Evolution:&lt;/strong&gt; Add, drop, or rename columns without breaking queries, thanks to internal column IDs.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Partition Evolution:&lt;/strong&gt; Change partitioning strategies (e.g., switch from daily to hourly partitions) without rewriting historical data.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Time Travel:&lt;/strong&gt; Query the table as of a specific snapshot ID or timestamp.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Hidden Partitioning:&lt;/strong&gt; Abstracts partition logic from users while still enabling efficient pruning.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Optimistic Concurrency:&lt;/strong&gt; Writers atomically commit new snapshots, with conflict detection to prevent corruption.&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;&lt;strong&gt;Row-Level Changes&lt;/strong&gt;&lt;br&gt;
Initially copy-on-write, Iceberg now also supports &lt;strong&gt;delete files&lt;/strong&gt; for merge-on-read semantics. Deletes can be tracked separately and applied at read time, reducing write amplification for frequent updates. Background compaction later consolidates these into optimized Parquet files.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Ecosystem &amp;amp; Adoption&lt;/strong&gt;&lt;br&gt;
Iceberg’s neutrality and technical strengths have driven broad adoption. It is supported in:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Engines:&lt;/strong&gt; Spark, Flink, Trino, Presto, Hive, Impala, DuckDB.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Cloud platforms:&lt;/strong&gt; AWS Athena, AWS Glue, Snowflake, BigQuery, Dremio, and more.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Catalogs:&lt;/strong&gt; Hive Metastore, AWS Glue, Apache Nessie, Polaris.&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;By late 2024, Iceberg had become the &lt;strong&gt;de facto industry standard&lt;/strong&gt; for open table formats, with adoption by Netflix, Apple, LinkedIn, Adobe, and major cloud vendors. Its community-driven governance and rapid innovation ensure it continues to evolve, recent features like &lt;strong&gt;row-level delete vectors&lt;/strong&gt; and &lt;strong&gt;REST catalogs&lt;/strong&gt; are making it even more capable.&lt;/p&gt;
&lt;p&gt;Next, we’ll look at &lt;strong&gt;Delta Lake&lt;/strong&gt;, the transaction-log–driven format that became the backbone of Databricks’ lakehouse vision.&lt;/p&gt;
&lt;h2&gt;Delta Lake: The Transaction-Log&lt;/h2&gt;
&lt;p&gt;&lt;strong&gt;Background &amp;amp; Origins&lt;/strong&gt;&lt;br&gt;
Delta Lake was introduced by Databricks around 2017–2018 to address Spark’s biggest gap: reliable transactions on cloud object storage. Open-sourced in 2019 under the Linux Foundation, Delta Lake became the backbone of Databricks’ &lt;strong&gt;lakehouse&lt;/strong&gt; pitch, combining data warehouse reliability with the scalability of data lakes. Its design centered on a simple but powerful idea: use a &lt;strong&gt;transaction log&lt;/strong&gt; to coordinate all changes.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Metadata Structure&lt;/strong&gt;&lt;br&gt;
At the core of every Delta table is the &lt;code&gt;_delta_log&lt;/code&gt; directory:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;JSON transaction files:&lt;/strong&gt; Each commit appends a JSON file describing added/removed data files, schema changes, and table properties.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Checkpoints (Parquet):&lt;/strong&gt; Periodic checkpoints compact the log for faster reads, storing the authoritative list of active files at a given version.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Versioning:&lt;/strong&gt; Every commit is versioned sequentially, making time-travel queries straightforward (&lt;code&gt;VERSION AS OF&lt;/code&gt; or &lt;code&gt;TIMESTAMP AS OF&lt;/code&gt;).&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;This log-based design is simple and easy to reconstruct: replay JSON logs from the last checkpoint to reach the latest state.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Core Features&lt;/strong&gt;&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;ACID Transactions:&lt;/strong&gt; Ensures consistent reads and writes, even under concurrent Spark jobs.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Schema Enforcement &amp;amp; Evolution:&lt;/strong&gt; Protects against incompatible writes while allowing schema growth.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Time Travel:&lt;/strong&gt; Query historical versions for auditing or rollback.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Unified Batch &amp;amp; Streaming:&lt;/strong&gt; Spark Structured Streaming and batch jobs can read/write the same Delta table, reducing architectural complexity.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Performance Optimizations:&lt;/strong&gt; Features like Z-order clustering, data skipping, and caching improve query speed (especially in Databricks’ runtime).&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Change Data Feed (CDF):&lt;/strong&gt; Exposes row-level changes between versions, useful for downstream syncs and CDC pipelines.&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;&lt;strong&gt;Row-Level Changes&lt;/strong&gt;&lt;br&gt;
Delta primarily uses &lt;strong&gt;copy-on-write&lt;/strong&gt;: updates and deletes rewrite entire Parquet files while marking old ones as removed in the log. This guarantees atomicity but can be expensive at scale. To mitigate, Delta introduced &lt;strong&gt;deletion vectors&lt;/strong&gt; (in newer releases), which track row deletions without rewriting whole files, closer to merge-on-read semantics. Upserts are supported via SQL &lt;code&gt;MERGE INTO&lt;/code&gt;, commonly used for database change capture workloads.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Ecosystem &amp;amp; Adoption&lt;/strong&gt;&lt;br&gt;
Delta Lake is strongest in the &lt;strong&gt;Spark ecosystem&lt;/strong&gt; and is the default format in Databricks. It’s also supported by:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Engines:&lt;/strong&gt; Spark (native), Flink, Trino/Presto (via connectors).&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Clouds:&lt;/strong&gt; AWS EMR, Azure Synapse, and some GCP services.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Libraries:&lt;/strong&gt; Delta Standalone (Java), Delta Rust, and integrations for Python beyond Spark.&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;While its openness has improved since Delta 2.0, much of its adoption remains tied to Databricks. Still, Delta Lake is one of the most widely used formats in production, powering pipelines at thousands of organizations.&lt;/p&gt;
&lt;p&gt;Next, we’ll explore &lt;strong&gt;Apache Hudi&lt;/strong&gt;, the pioneer of incremental processing and near-real-time data lake ingestion.&lt;/p&gt;
&lt;h2&gt;Apache Hudi: The Incremental Pioneer&lt;/h2&gt;
&lt;p&gt;&lt;strong&gt;Background &amp;amp; Origins&lt;/strong&gt;&lt;br&gt;
Apache Hudi (short for &lt;em&gt;Hadoop Upserts Deletes and Incrementals&lt;/em&gt;) was created at Uber in 2016 to solve a pressing challenge: keeping Hive tables up to date with fresh, continuously changing data. Uber needed a way to ingest ride updates, user changes, and event streams into their Hadoop data lake without waiting hours for batch jobs. Open-sourced in 2017 and donated to Apache in 2019, Hudi became the first widely adopted table format to support &lt;strong&gt;row-level upserts and deletes&lt;/strong&gt; directly on data lakes.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Metadata Structure&lt;/strong&gt;&lt;br&gt;
Hudi organizes tables around a &lt;strong&gt;commit timeline&lt;/strong&gt; stored in a &lt;code&gt;.hoodie&lt;/code&gt; directory:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Commit files:&lt;/strong&gt; Metadata describing which data files were added/removed at each commit.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;COW vs MOR modes:&lt;/strong&gt;
&lt;ul&gt;
&lt;li&gt;&lt;em&gt;Copy-on-Write (COW):&lt;/em&gt; Updates replace entire Parquet files, similar to Iceberg/Delta.&lt;/li&gt;
&lt;li&gt;&lt;em&gt;Merge-on-Read (MOR):&lt;/em&gt; Updates land in small Avro &lt;strong&gt;delta log files&lt;/strong&gt;, merged with base Parquet files at read time.&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Indexes:&lt;/strong&gt; Bloom filters or hash indexes help locate records by primary key, making upserts efficient.&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;This dual-mode design gives engineers control over the trade-off between &lt;strong&gt;write latency&lt;/strong&gt; and &lt;strong&gt;read latency&lt;/strong&gt;.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Core Features&lt;/strong&gt;&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Upserts &amp;amp; Deletes by Key:&lt;/strong&gt; Guarantees a single latest record per primary key, ideal for CDC ingestion.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Incremental Pulls:&lt;/strong&gt; Query only the rows changed since a given commit, enabling efficient downstream pipelines.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Compaction:&lt;/strong&gt; Background jobs merge log files into larger Parquet files for query efficiency.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Savepoints &amp;amp; Rollbacks:&lt;/strong&gt; Manage table states explicitly, ensuring recovery from bad data loads.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Flexible Indexing:&lt;/strong&gt; Choose partitioned, global, or custom indexes to balance performance with storage cost.&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;&lt;strong&gt;Row-Level Changes&lt;/strong&gt;&lt;br&gt;
Hudi was designed for this problem. In COW mode, updates rewrite files. In MOR mode, updates are appended as &lt;strong&gt;log blocks&lt;/strong&gt;, making them queryable almost immediately. Readers can choose:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;em&gt;Snapshot mode&lt;/em&gt; (base + logs for freshest data).&lt;/li&gt;
&lt;li&gt;&lt;em&gt;Read-optimized mode&lt;/em&gt; (compacted base files for speed).&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;Deletes are handled similarly, either as soft deletes in logs or hard deletes during compaction.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Ecosystem &amp;amp; Adoption&lt;/strong&gt;&lt;br&gt;
Hudi integrates tightly with:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Engines:&lt;/strong&gt; Spark (native datasource), Flink (growing support), Hive, Trino/Presto.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Clouds:&lt;/strong&gt; AWS EMR and AWS Glue have built-in Hudi support, making it popular on S3.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Streaming:&lt;/strong&gt; Confluent Kafka, Debezium, and Flink CDC can stream directly into Hudi tables.&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;While Iceberg and Delta now dominate conversations, Hudi remains a strong choice for &lt;strong&gt;near real-time ingestion and CDC use cases&lt;/strong&gt;, particularly in AWS-centric stacks. Its flexibility (COW vs MOR) and incremental consumption features make it especially valuable for pipelines that need &lt;strong&gt;fast data freshness without sacrificing reliability&lt;/strong&gt;.&lt;/p&gt;
&lt;p&gt;Next, we’ll examine &lt;strong&gt;Apache Paimon&lt;/strong&gt;, the streaming-first format that extends Hudi’s incremental vision with an LSM-tree architecture.&lt;/p&gt;
&lt;h2&gt;Apache Paimon: Streaming-First by Design&lt;/h2&gt;
&lt;p&gt;&lt;strong&gt;Background &amp;amp; Origins&lt;/strong&gt;&lt;br&gt;
Apache Paimon began life as &lt;strong&gt;Flink Table Store&lt;/strong&gt; at Alibaba in 2022, targeting the need for continuous, real-time data ingestion directly into data lakes. It entered the Apache Incubator in 2023 under the name &lt;em&gt;Paimon&lt;/em&gt;. Unlike Iceberg or Delta, which started with batch analytics and later added streaming features, Paimon was &lt;em&gt;streaming-first&lt;/em&gt;. Its mission: make data lakes act like a materialized view that is always up to date.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Metadata &amp;amp; Architecture&lt;/strong&gt;&lt;br&gt;
Paimon uses a &lt;strong&gt;Log-Structured Merge-tree (LSM) design&lt;/strong&gt; inspired by database internals:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;MemTables and flushes:&lt;/strong&gt; Incoming data is written to in-memory buffers, then flushed to small immutable files.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Multi-level compaction:&lt;/strong&gt; Files are continuously merged into larger sorted files in the background.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Snapshots:&lt;/strong&gt; Each compaction or commit produces a new snapshot, allowing both &lt;em&gt;batch queries&lt;/em&gt; and &lt;em&gt;streaming reads&lt;/em&gt;.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Primary-key awareness:&lt;/strong&gt; Tables can enforce keys and apply merge rules (e.g., last-write-wins or aggregate merges).&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;This architecture makes &lt;strong&gt;frequent row-level changes cheap&lt;/strong&gt; (append-only writes) while deferring heavy merges to compaction tasks.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Core Features&lt;/strong&gt;&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Real-Time Upserts &amp;amp; Deletes:&lt;/strong&gt; Native support for continuous CDC ingestion with efficient row-level operations.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Merge Engines:&lt;/strong&gt; Configurable rules for handling key collisions (e.g., overwrite, aggregate, or log-append).&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Dual Read Modes:&lt;/strong&gt; Query as a static snapshot (batch) or as a change stream (streaming).&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Streaming/Batch Unification:&lt;/strong&gt; The same table can power batch analytics and real-time dashboards.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Deletion Vectors:&lt;/strong&gt; Efficiently tracks row deletions without rewriting base files.&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;&lt;strong&gt;Row-Level Changes&lt;/strong&gt;&lt;br&gt;
Unlike Iceberg (COW with delete files) or Delta (COW with deletion vectors), Paimon is natively &lt;strong&gt;merge-on-read&lt;/strong&gt;. Updates and deletes are appended as small log segments, queryable immediately. Background compaction gradually merges them into optimized columnar files. This makes Paimon highly efficient for &lt;strong&gt;high-velocity workloads&lt;/strong&gt; like IoT streams, CDC pipelines, or real-time leaderboards.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Ecosystem &amp;amp; Adoption&lt;/strong&gt;&lt;br&gt;
Paimon integrates tightly with &lt;strong&gt;Apache Flink&lt;/strong&gt;, where it feels like a natural extension of Flink SQL. It also has growing support for Spark, Hive, Trino/Presto, and OLAP systems like StarRocks and Doris. Adoption is strongest among teams building &lt;strong&gt;streaming lakehouses&lt;/strong&gt;, particularly those already invested in Flink. While younger than Iceberg or Delta, Paimon is rapidly attracting attention as organizations push for sub-minute data freshness.&lt;/p&gt;
&lt;p&gt;Next, we’ll turn to &lt;strong&gt;DuckLake&lt;/strong&gt;, the newest entrant that rethinks table metadata management by moving it entirely into SQL databases.&lt;/p&gt;
&lt;h2&gt;DuckLake: Metadata Reimagined with SQL&lt;/h2&gt;
&lt;p&gt;&lt;strong&gt;Background &amp;amp; Origins&lt;/strong&gt;&lt;br&gt;
DuckLake is the newest table format, introduced in 2025 by the DuckDB and MotherDuck teams. Unlike earlier formats that manage metadata with JSON logs or Avro manifests, DuckLake flips the script: it stores &lt;strong&gt;all table metadata in a relational SQL database&lt;/strong&gt;. This approach is inspired by how cloud warehouses like Snowflake and BigQuery already manage metadata internally, but DuckLake makes it open and interoperable.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Metadata &amp;amp; Architecture&lt;/strong&gt;&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;SQL Catalog:&lt;/strong&gt; Metadata such as snapshots, schemas, file lists, and statistics are persisted as ordinary relational tables.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Transactions:&lt;/strong&gt; Updates to metadata happen through standard SQL transactions, ensuring strong ACID guarantees without relying on object-store semantics.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Multi-Table Transactions:&lt;/strong&gt; Because it’s database-backed, DuckLake supports atomic operations across multiple tables, something file-based formats struggle with.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;File Storage:&lt;/strong&gt; Data remains in Parquet files on cloud or local storage, DuckLake just replaces the metadata layer with SQL.&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;This design dramatically reduces the complexity of planning queries (no manifest scanning), makes commits faster, and enables features like &lt;strong&gt;cross-table consistency&lt;/strong&gt; (possible in Apache Iceberg if using the Nessie catalog).&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Core Features&lt;/strong&gt;&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;SQL-Native Metadata:&lt;/strong&gt; Easy to query, debug, or extend using plain SQL.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Fast Commits &amp;amp; Planning:&lt;/strong&gt; Small updates don’t require writing multiple manifest files, just SQL inserts/updates.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Cross-Table Atomicity:&lt;/strong&gt; Multi-table changes commit together, a unique strength.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Familiar Deployment:&lt;/strong&gt; The catalog can run on DuckDB, PostgreSQL, or any transactional SQL database.&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;&lt;strong&gt;Row-Level Changes&lt;/strong&gt;&lt;br&gt;
DuckLake handles updates and deletes via &lt;strong&gt;copy-on-write&lt;/strong&gt; on Parquet files, but the metadata transaction is nearly instantaneous. Row-level changes are coordinated by the SQL catalog, avoiding the latency and eventual consistency pitfalls of cloud storage–based logs. In effect, DuckLake behaves like Iceberg for data files but with much faster commit cycles.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Ecosystem &amp;amp; Adoption&lt;/strong&gt;&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Primary Engine:&lt;/strong&gt; DuckDB, via a DuckLake extension.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Potential Integrations:&lt;/strong&gt; Any SQL-aware engine could adopt DuckLake, since the catalog is just relational tables.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Use Cases:&lt;/strong&gt; Analytics sandboxes, developer-friendly data apps, and teams seeking simplicity without deploying heavy metadata services.&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;As of 2025, DuckLake is still young but has sparked excitement by simplifying lakehouse architecture. It’s best seen as a complement to more mature formats, with particular appeal to DuckDB users and teams tired of managing complex metadata stacks.&lt;/p&gt;
&lt;p&gt;Next, we’ll step back and &lt;strong&gt;compare all five formats side by side&lt;/strong&gt;, looking at metadata design, row-level update strategies, ecosystem support, and adoption trends.&lt;/p&gt;
&lt;h2&gt;Comparing the Open Table Formats&lt;/h2&gt;
&lt;p&gt;Now that we’ve walked through each format individually, let’s compare them across the dimensions that matter most to data engineers and architects.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;1. Metadata Architecture&lt;/strong&gt;&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Iceberg:&lt;/strong&gt; Hierarchical &lt;em&gt;snapshots + manifests&lt;/em&gt;. Excellent for pruning large datasets but metadata can be complex.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Delta Lake:&lt;/strong&gt; Sequential &lt;em&gt;transaction log&lt;/em&gt; (&lt;code&gt;_delta_log&lt;/code&gt;). Simple and efficient for versioning, but logs can grow large without checkpoints.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Hudi:&lt;/strong&gt; &lt;em&gt;Commit timeline&lt;/em&gt; with optional delta logs. Flexible but more operational overhead.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Paimon:&lt;/strong&gt; &lt;em&gt;LSM-tree style&lt;/em&gt; compaction with snapshots. Streaming-friendly and highly write-efficient.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;DuckLake:&lt;/strong&gt; Metadata in a &lt;em&gt;SQL database&lt;/em&gt;. Simplifies commits and query planning, enables multi-table transactions.&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;&lt;strong&gt;2. Row-Level Changes&lt;/strong&gt;&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Iceberg:&lt;/strong&gt; Copy-on-write by default, with &lt;em&gt;delete files&lt;/em&gt; for merge-on-read.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Delta Lake:&lt;/strong&gt; Copy-on-write, plus &lt;em&gt;deletion vectors&lt;/em&gt; in newer versions.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Hudi:&lt;/strong&gt; Dual modes: COW for read-optimized, MOR for low-latency upserts.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Paimon:&lt;/strong&gt; Always merge-on-read via &lt;em&gt;LSM-tree segments&lt;/em&gt;, optimized for frequent updates.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;DuckLake:&lt;/strong&gt; Copy-on-write, but with faster commit cycles thanks to SQL-backed metadata.&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;&lt;strong&gt;3. Ecosystem Support&lt;/strong&gt;&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Iceberg:&lt;/strong&gt; Widest engine support (Spark, Flink, Trino, Presto, Hive, Snowflake, Athena, BigQuery, Dremio, DuckDB).&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Delta Lake:&lt;/strong&gt; Deep Spark and Databricks integration; expanding connectors for Flink and Trino.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Hudi:&lt;/strong&gt; Strong in Spark, Hive, Presto, and AWS (Glue, EMR). Flink support growing.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Paimon:&lt;/strong&gt; Native to Flink; Spark and Trino integration improving; also ties to OLAP systems like Doris/StarRocks.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;DuckLake:&lt;/strong&gt; Early-stage, centered on DuckDB; potential for other SQL engines to adopt.&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;&lt;strong&gt;4. Adoption Trends&lt;/strong&gt;&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Iceberg:&lt;/strong&gt; Emerging as the &lt;em&gt;industry standard&lt;/em&gt; for open table formats, with broad vendor alignment.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Delta Lake:&lt;/strong&gt; Dominant within Databricks/Spark ecosystems; adoption tied to Databricks customers.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Hudi:&lt;/strong&gt; Niche but strong in CDC and near real-time use cases; proven at scale in companies like Uber.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Paimon:&lt;/strong&gt; Rising fast in the Flink/streaming community; positioned as the “streaming lakehouse” format.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;DuckLake:&lt;/strong&gt; Newest entrant, appealing for simplicity and developer-friendliness; adoption still experimental.&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;Next, we’ll step back and examine &lt;strong&gt;industry trends&lt;/strong&gt; shaping the adoption of these formats and what they signal for the future of the lakehouse ecosystem.&lt;/p&gt;
&lt;h2&gt;Industry Trends in Table Format Adoption&lt;/h2&gt;
&lt;p&gt;The “table format wars” of the past few years are starting to settle into clear patterns of adoption. While no single format dominates every use case, the industry is coalescing around certain choices based on scale, latency, and ecosystem needs.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Iceberg as the Default Standard&lt;/strong&gt;&lt;br&gt;
Iceberg has emerged as the most widely supported and vendor-neutral choice. Cloud providers like AWS, Google, and Snowflake have all added native support, and query engines like Trino, Presto, Hive, and Flink integrate with it out-of-the-box. Its Apache governance and cross-engine compatibility make it the safe long-term bet for enterprises standardizing on a single open format.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Delta Lake in the Spark/Databricks World&lt;/strong&gt;&lt;br&gt;
Delta Lake remains the default in Spark- and Databricks-heavy shops. Its simplicity (transaction logs) and seamless batch/stream integration continue to attract teams already invested in Spark. While its ecosystem is narrower than Iceberg’s, Delta Lake’s deep integration with Databricks runtime and machine learning workflows ensures strong adoption in that ecosystem.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Hudi in CDC and Incremental Ingestion&lt;/strong&gt;&lt;br&gt;
Hudi carved out a niche in &lt;strong&gt;change data capture (CDC)&lt;/strong&gt; and &lt;strong&gt;near real-time ingestion&lt;/strong&gt;. Telecom, fintech, and e-commerce companies still rely on Hudi for incremental pipelines, especially on AWS where Glue and EMR make it easy to deploy. While Iceberg and Delta have added incremental features, Hudi’s head start and MOR tables keep it relevant for low-latency ingestion scenarios.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Paimon and the Rise of Streaming Lakehouses&lt;/strong&gt;&lt;br&gt;
As real-time analytics demand grows, Paimon is gaining momentum in the &lt;strong&gt;Flink community&lt;/strong&gt; and among companies building streaming-first pipelines. Its LSM-tree design positions it as the go-to choice for high-velocity data, IoT streams, and CDC-heavy architectures. Although young, its momentum signals a broader shift: the next wave of lakehouse innovation is about &lt;strong&gt;sub-minute freshness&lt;/strong&gt;.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;DuckLake and Metadata Simplification&lt;/strong&gt;&lt;br&gt;
DuckLake reflects a newer trend: &lt;strong&gt;rethinking metadata management&lt;/strong&gt;. By moving metadata into SQL databases, it dramatically simplifies operations and enables cross-table transactions. Adoption is still experimental, but DuckLake has sparked interest among teams who want lakehouse features without managing complex catalogs or metastores. Its trajectory will likely influence how future formats handle metadata.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Convergence and Interoperability&lt;/strong&gt;&lt;br&gt;
One notable trend: features are converging. Iceberg now supports row-level deletes via delete files; Delta added deletion vectors; Hudi and Paimon both emphasize streaming upserts. Tooling is also evolving toward interoperability, catalog services like Apache Nessie and Polaris aim to support multiple formats, and BI engines increasingly connect to all.&lt;/p&gt;
&lt;p&gt;In short:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Iceberg&lt;/strong&gt; is becoming the industry’s lingua franca.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Delta&lt;/strong&gt; thrives in Databricks-first stacks.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Hudi&lt;/strong&gt; holds ground in CDC and incremental ingestion.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Paimon&lt;/strong&gt; is rising with real-time streaming needs.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;DuckLake&lt;/strong&gt; challenges conventions with SQL-backed simplicity.&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;Next, we’ll wrap up with &lt;strong&gt;guidance on how to choose the right format&lt;/strong&gt; based on your workloads, ecosystem, and data engineering priorities.&lt;/p&gt;
&lt;h2&gt;Choosing the Right Open Table Format&lt;/h2&gt;
&lt;p&gt;With five strong options on the table, Iceberg, Delta Lake, Hudi, Paimon, and DuckLake, the choice depends less on “which is best” and more on &lt;strong&gt;which aligns with your workloads, ecosystem, and priorities&lt;/strong&gt;. Here’s how to think about it:&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;When to Choose Apache Iceberg&lt;/strong&gt;&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;You want the broadest &lt;strong&gt;engine and vendor support&lt;/strong&gt; (Spark, Flink, Trino, Presto, Hive, Dremio, Snowflake, BigQuery, etc.).&lt;/li&gt;
&lt;li&gt;Your workloads are &lt;strong&gt;batch-heavy&lt;/strong&gt; and prioritize consistent snapshots, schema evolution, and large-scale analytics.&lt;/li&gt;
&lt;li&gt;You want to standardize on the &lt;strong&gt;emerging industry default&lt;/strong&gt; with the widest community and neutral Apache governance.&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;&lt;strong&gt;When to Choose Delta Lake&lt;/strong&gt;&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Your data stack is &lt;strong&gt;Databricks-first&lt;/strong&gt; or heavily Spark-centric.&lt;/li&gt;
&lt;li&gt;You need seamless &lt;strong&gt;batch + streaming unification&lt;/strong&gt; with Spark Structured Streaming.&lt;/li&gt;
&lt;li&gt;You value Databricks’ ecosystem of optimizations (e.g., Z-order, caching, machine learning integrations).&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;&lt;strong&gt;When to Choose Apache Hudi&lt;/strong&gt;&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;You need &lt;strong&gt;frequent upserts and deletes&lt;/strong&gt; on data lakes.&lt;/li&gt;
&lt;li&gt;Your pipelines depend on &lt;strong&gt;incremental consumption&lt;/strong&gt; of data (only new/changed rows since the last commit).&lt;/li&gt;
&lt;li&gt;You want a proven option for &lt;strong&gt;CDC ingestion&lt;/strong&gt; and near real-time pipelines, especially on &lt;strong&gt;AWS Glue/EMR&lt;/strong&gt;.&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;&lt;strong&gt;When to Choose Apache Paimon&lt;/strong&gt;&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Your workloads are &lt;strong&gt;streaming-first&lt;/strong&gt;, with high-velocity CDC or IoT data.&lt;/li&gt;
&lt;li&gt;You want to unify &lt;strong&gt;real-time and batch&lt;/strong&gt; processing within the same table.&lt;/li&gt;
&lt;li&gt;You’re already invested in &lt;strong&gt;Apache Flink&lt;/strong&gt; and want a table format purpose-built for it.&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;&lt;strong&gt;When to Choose DuckLake&lt;/strong&gt;&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;You want &lt;strong&gt;simplicity&lt;/strong&gt; in metadata management (SQL instead of JSON/Avro manifests).&lt;/li&gt;
&lt;li&gt;You’re working in &lt;strong&gt;DuckDB/MotherDuck&lt;/strong&gt; environments or need lightweight lakehouse capabilities.&lt;/li&gt;
&lt;li&gt;You value &lt;strong&gt;fast commits, easy debugging, and multi-table atomicity&lt;/strong&gt;, even if the format is newer and less battle-tested.&lt;/li&gt;
&lt;/ul&gt;
&lt;h3&gt;Final Takeaway&lt;/h3&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Iceberg&lt;/strong&gt;: the &lt;em&gt;universal standard&lt;/em&gt; for long-term interoperability.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Delta&lt;/strong&gt;: the &lt;em&gt;Databricks/Spark-native&lt;/em&gt; option.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Hudi&lt;/strong&gt;: the &lt;em&gt;incremental/CDC pioneer&lt;/em&gt;.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Paimon&lt;/strong&gt;: the &lt;em&gt;streaming-first disruptor&lt;/em&gt;.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;DuckLake&lt;/strong&gt;: the &lt;em&gt;metadata simplifier&lt;/em&gt;.&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;No matter which you choose, adopting an open table format is the key to turning your data lake into a true &lt;strong&gt;lakehouse&lt;/strong&gt;: reliable, flexible, and future-proof.&lt;/p&gt;
&lt;h2&gt;Conclusion&lt;/h2&gt;
&lt;p&gt;Open table formats are no longer niche, they’re the foundation of the modern data stack. Whether your challenge is batch analytics, real-time ingestion, or simplifying metadata, there’s a format designed to meet your needs. The smart path forward isn’t just picking one blindly, but aligning your choice with your &lt;strong&gt;data velocity, tooling ecosystem, and long-term governance strategy&lt;/strong&gt;.&lt;/p&gt;
&lt;p&gt;In practice, many organizations run more than one format side by side. The good news: as open standards mature, interoperability and ecosystem support are expanding, making it easier to evolve over time without locking yourself into a dead end.&lt;/p&gt;
&lt;p&gt;The lakehouse era is here, and open table formats are its backbone.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Get Data Lakehouse Books:&lt;/strong&gt;&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;a href=&quot;https://drmevn.fyi/tableformatblog&quot;&gt;Apache Iceberg: The Definitive Guide&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&quot;https://drmevn.fyi/tableformatblog-62P6t&quot;&gt;Apache Polaris: The Defintive Guide&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&quot;https://hubs.la/Q03GfY4f0&quot;&gt;Architecting an Apache Iceberg Lakehouse&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;&lt;strong&gt;&lt;a href=&quot;https://www.datalakehousehub.com&quot;&gt;Join the Data Lakehouse Community&lt;/a&gt;&lt;/strong&gt;
&lt;strong&gt;&lt;a href=&quot;https://lakehouseblogs.com&quot;&gt;Data Lakehouse Blog Roll&lt;/a&gt;&lt;/strong&gt;&lt;/p&gt;
</content:encoded><author>Alex Merced</author></item><item><title>The 2025 &amp; 2026 Ultimate Guide to the Data Lakehouse and the Data Lakehouse Ecosystem</title><link>https://iceberglakehouse.com/posts/2025-09-2026-guide-to-data-lakehouses/</link><guid isPermaLink="true">https://iceberglakehouse.com/posts/2025-09-2026-guide-to-data-lakehouses/</guid><description>
- [Join the Data Lakehouse Community](https://www.datalakehousehub.com)
- [Data Lakehouse Blog Listings](https://lakehouseblogs.com)

*Year-end 2025 ...</description><pubDate>Tue, 23 Sep 2025 09:00:00 GMT</pubDate><content:encoded>&lt;ul&gt;
&lt;li&gt;&lt;a href=&quot;https://www.datalakehousehub.com&quot;&gt;Join the Data Lakehouse Community&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&quot;https://lakehouseblogs.com&quot;&gt;Data Lakehouse Blog Listings&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;&lt;em&gt;Year-end 2025 reflections, looking ahead to 2026&lt;/em&gt;&lt;/p&gt;
&lt;p&gt;Over the past few years, data platforms have crossed a tipping point. Rigid, centralized warehouses proved great at trustworthy BI but struggled with diverse data and elastic scale. Open data lakes delivered low-cost storage and freedom of choice, yet lacked the transactional rigor and performance guarantees analytics teams rely on. In 2025, the data &lt;strong&gt;lakehouse&lt;/strong&gt; matured from an idea into an operating model: open table formats, transactional metadata, and multi-engine access over a single, governed body of data.&lt;/p&gt;
&lt;p&gt;This guide distills what changed, why it matters, and how to put it to work. We’ll start by clarifying &lt;em&gt;where warehouses shine and crack&lt;/em&gt;, &lt;em&gt;where lakes empower and swamp&lt;/em&gt;, and why older directory-based table designs (think classic Hive tables) hit scaling and consistency limits. From there, we’ll show how modern table formats, Apache Iceberg, Delta Lake, Apache Hudi, and Apache Paimon, solved those limits by tracking &lt;strong&gt;files and snapshots&lt;/strong&gt; instead of folders, enabling ACID transactions, time travel, and intelligent pruning at petabyte scale.&lt;/p&gt;
&lt;p&gt;2025 also cemented a practical reference architecture. A successful lakehouse now looks less like a monolith and more like a &lt;strong&gt;layered system&lt;/strong&gt;: cloud object storage for durability and cost, an open table format for transactions and evolution, ingestion that blends batch and streaming, a catalog for governance and discoverability, and a flexible consumption layer that serves SQL, BI, notebooks, and AI agents with consistent semantics.&lt;/p&gt;
&lt;p&gt;Why now? Three forces converged this year:&lt;/p&gt;
&lt;ol&gt;
&lt;li&gt;&lt;strong&gt;Streaming-by-default&lt;/strong&gt; workloads turned “daily batch” into “continuous micro-batch,” demanding exactly-once commits and small-file management.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;AI and agentic workflows&lt;/strong&gt; moved from proofs of concept to production, generating highly variable, ad-hoc queries that require low-latency acceleration without brittle hand-tuning.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Open interoperability&lt;/strong&gt; became table stakes, organizations want one source of truth read by many engines, not many copies of truth managed by many teams.&lt;/li&gt;
&lt;/ol&gt;
&lt;p&gt;This guide is an accessible deep dive for Data Engineers and Data Architects. You’ll get a clear mental model of the formats, their metadata structures, and the operational playbook: compaction, snapshot expiration, partition evolution, and reflection/materialization strategies for speed at scale. We’ll also survey the ingestion and streaming ecosystem (connectors, CDC, stream processors), Python-native options for lakehouse workloads (Polars, DuckDB, DataFusion, Daft, Dask), and emerging edge patterns where inference runs close to the data.&lt;/p&gt;
&lt;p&gt;Finally, we’ll close with a curated reading list, books and long-form resources that stood out in 2025, and pragmatic guidance on choosing components in 2026 without locking yourself in. If your mandate is to deliver trustworthy, performant, and AI-ready analytics on open data, this guide is your map.&lt;/p&gt;
&lt;h2&gt;The Challenges in Modern Data Architecture&lt;/h2&gt;
&lt;p&gt;The rise of the lakehouse didn’t happen in a vacuum. It emerged as a response to the very real challenges of &lt;em&gt;yesterday’s dominant architectures&lt;/em&gt;, data warehouses and data lakes. Understanding their strengths and weaknesses sets the stage for why the lakehouse model became inevitable.&lt;/p&gt;
&lt;h3&gt;Data Warehouses: Strength in Structure, Weakness in Flexibility&lt;/h3&gt;
&lt;p&gt;Data warehouses provided the first true enterprise-scale analytics platforms. They enforced &lt;strong&gt;schema-on-write&lt;/strong&gt;, ensuring data quality and making business intelligence consistent across the organization. For years, this was invaluable: clean, curated, trusted dashboards.&lt;/p&gt;
&lt;p&gt;But the cracks widened in the 2010s and 2020s:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Rigid schemas:&lt;/strong&gt; Every change to a source system meant heavy ETL work to keep the warehouse schema in sync. New data types, JSON, images, sensor streams, didn’t fit neatly into tables.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;High costs:&lt;/strong&gt; Warehouses couple compute and storage. Scaling for more data or users often meant overpaying for resources you didn’t fully use.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Latency in data freshness:&lt;/strong&gt; The ETL pipelines that fed warehouses ran daily or hourly, leaving decision-makers working with stale data.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Limited AI/ML support:&lt;/strong&gt; Warehouses excel at structured SQL queries but aren’t designed to handle the diverse, unstructured, and large-scale data needed for machine learning.&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;Warehouses solved consistency but at the price of agility.&lt;/p&gt;
&lt;h3&gt;Data Lakes: Flexibility Meets the “Data Swamp”&lt;/h3&gt;
&lt;p&gt;Enter data lakes. By shifting to &lt;strong&gt;schema-on-read&lt;/strong&gt;, organizations gained the freedom to store &lt;em&gt;anything&lt;/em&gt;: logs, media, documents, semi-structured JSON, raw database dumps. Storage costs plummeted thanks to cloud object stores like S3 and ADLS, and data scientists loved having raw, unmodeled data at their fingertips.&lt;/p&gt;
&lt;p&gt;But flexibility introduced new pain:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Data quality issues:&lt;/strong&gt; With no enforced schema, data lakes quickly devolved into “data swamps”, vast, uncurated collections of files that few trusted.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Poor governance:&lt;/strong&gt; Security, lineage, and access controls were bolted on, often inconsistently. Teams struggled to know what data existed and whether it was safe to use.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Performance bottlenecks:&lt;/strong&gt; Query engines like Hive, Spark, or Presto had to scan massive directories of files. Without transactional guarantees, concurrent writes could corrupt datasets or leave analysts with incomplete results.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;High operational overhead:&lt;/strong&gt; Managing partitions, small files, and manual compactions became part of daily operations.&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;Lakes solved agility but at the price of trust.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;The gap was clear:&lt;/strong&gt; warehouses offered &lt;strong&gt;trust but no flexibility&lt;/strong&gt;, while lakes offered &lt;strong&gt;flexibility but no trust&lt;/strong&gt;. By the early 2020s, organizations wanted the best of both, structured reliability &lt;em&gt;and&lt;/em&gt; open flexibility, laying the foundation for the modern data lakehouse.&lt;/p&gt;
&lt;h2&gt;What Is Hive and the Challenges of Hive Tables&lt;/h2&gt;
&lt;p&gt;Before the lakehouse era, &lt;strong&gt;Apache Hive&lt;/strong&gt; was the workhorse that made large-scale data in Hadoop clusters queryable with SQL. Hive introduced the &lt;em&gt;Hive Metastore&lt;/em&gt;, which stored table definitions (schemas, partitions, and locations), enabling analysts to run SQL-like queries over files sitting in HDFS or cloud storage. It was one of the first major attempts to give a data-lake-like environment a relational feel.&lt;/p&gt;
&lt;p&gt;But Hive’s approach, tracking &lt;strong&gt;directories of files&lt;/strong&gt; as tables, brought structural limitations that became bottlenecks as datasets and expectations grew.&lt;/p&gt;
&lt;h3&gt;Directory-Centric Table Management&lt;/h3&gt;
&lt;p&gt;In Hive, each table maps to a folder, and each partition to a subfolder. Query engines scan these directories at runtime to discover files. While this worked when data volumes were modest, modern cloud object stores made directory scans painfully slow. Listing millions of files before executing a query often dominated total query time.&lt;/p&gt;
&lt;h3&gt;Lack of ACID Transactions&lt;/h3&gt;
&lt;p&gt;Hive tables were essentially &lt;strong&gt;append-only&lt;/strong&gt;. Without built-in transactions, concurrent writers risked corrupting tables, and readers could encounter partial data during an update. Later ACID extensions attempted to patch this with delta files and compaction, but these added complexity and overhead, and weren’t consistently supported across engines.&lt;/p&gt;
&lt;h3&gt;Painful Updates and Schema Evolution&lt;/h3&gt;
&lt;p&gt;Modifying data in Hive tables was inefficient:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Updates and deletes&lt;/strong&gt; required rewriting entire partitions or entire tables.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Schema changes&lt;/strong&gt; (like renaming a column) often broke downstream jobs or forced costly rewrites.&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;The result: rigid datasets that were expensive to maintain and slow to evolve with business needs.&lt;/p&gt;
&lt;h3&gt;The Small Files Problem&lt;/h3&gt;
&lt;p&gt;Hive ingestion pipelines, especially those running frequently, created floods of small files. Query performance degraded because engines had to open and read from thousands of tiny files. Without built-in small-file management, engineers had to implement periodic compaction jobs to maintain performance.&lt;/p&gt;
&lt;p&gt;Hive was a critical stepping stone: it proved the value of SQL on big data and inspired the metadata-driven approach all lakehouse formats now follow. But its reliance on directory-based tracking and limited support for transactional, evolving workloads ultimately constrained its ability to power the next generation of data platforms.&lt;/p&gt;
&lt;h2&gt;The Innovation of Tracking Tables by Tracking Files vs. Tracking Directories&lt;/h2&gt;
&lt;p&gt;The turning point from Hive-style tables to modern lakehouse formats came with a deceptively simple idea:&lt;br&gt;
&lt;strong&gt;stop tracking directories of files, and start tracking individual files in metadata.&lt;/strong&gt;&lt;/p&gt;
&lt;h3&gt;Why Directories Fall Short&lt;/h3&gt;
&lt;p&gt;Directory-based tracking (as in Hive) meant that the engine had to:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;List every file in a partition directory before running a query.&lt;/li&gt;
&lt;li&gt;Infer table state from the file system at query time.&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;This created major problems in cloud storage, where operations like &lt;code&gt;LIST&lt;/code&gt; are slow and expensive. It also made concurrency hard, two jobs writing to the same folder could overwrite each other’s files without the catalog knowing until it was too late.&lt;/p&gt;
&lt;h3&gt;File-Level Tracking&lt;/h3&gt;
&lt;p&gt;Modern table formats introduced &lt;strong&gt;file-level manifests&lt;/strong&gt;: structured metadata that explicitly records every file that belongs to a table, along with statistics about its contents. Instead of scanning folders, engines read this compact metadata to know exactly which files to use.&lt;/p&gt;
&lt;p&gt;Benefits include:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Faster planning:&lt;/strong&gt; Queries skip expensive directory listings, instead reading a few manifest files that describe thousands of data files.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Atomic commits:&lt;/strong&gt; Updates create a new manifest (or snapshot) in a single operation. Readers either see the old version or the new one, never a half-written state.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Schema evolution:&lt;/strong&gt; Metadata can track multiple schema versions, allowing columns to be added, renamed, or dropped without rewriting entire datasets.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Partition independence:&lt;/strong&gt; Partitioning is recorded in metadata, not folder structures, enabling &lt;em&gt;hidden partitions&lt;/em&gt; and even partition evolution over time.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Fine-grained deletes and upserts:&lt;/strong&gt; Since every file is individually tracked, formats can support row-level operations by marking old files as deleted and adding new ones.&lt;/li&gt;
&lt;/ul&gt;
&lt;h3&gt;Snapshots and Time Travel&lt;/h3&gt;
&lt;p&gt;By treating metadata itself as a versioned object, table formats unlocked &lt;strong&gt;time travel&lt;/strong&gt;: the ability to query data as it existed at any point in time. Each snapshot references a specific set of files, creating a complete, immutable view of the table at that moment.&lt;/p&gt;
&lt;p&gt;This shift, from directories to files, from implicit state to explicit metadata, transformed raw data lakes into reliable, database-like systems. It’s the foundation that made the lakehouse architecture possible and paved the way for the new generation of table formats.&lt;/p&gt;
&lt;h2&gt;The New Generation of Data Lake Tables: Iceberg, Delta, Hudi, and Paimon&lt;/h2&gt;
&lt;p&gt;With file-level tracking as the breakthrough, several open-source projects emerged to redefine how data lakes operate. These &lt;strong&gt;table formats&lt;/strong&gt; provide the transactional, metadata-rich foundation that transforms a raw data lake into a full-fledged lakehouse. Each project shares core principles, ACID transactions, schema evolution, and time travel, but emphasizes different strengths.&lt;/p&gt;
&lt;h3&gt;Apache Iceberg&lt;/h3&gt;
&lt;p&gt;Born at Netflix and now a top-level Apache project, &lt;strong&gt;Iceberg&lt;/strong&gt; is designed for &lt;strong&gt;engine-agnostic interoperability&lt;/strong&gt;. Its hierarchical metadata structure (table metadata → manifest lists → manifest files) allows scaling to billions of files while still enabling fast query planning.&lt;br&gt;
Key features:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Hidden partitioning and partition evolution.&lt;/li&gt;
&lt;li&gt;Broad engine support (Spark, Flink, Trino, Presto, Dremio, and more).&lt;/li&gt;
&lt;li&gt;Strong focus on openness through the &lt;strong&gt;REST catalog API&lt;/strong&gt;.&lt;/li&gt;
&lt;li&gt;Rich schema evolution, including column renames and type promotions.&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;Iceberg has become a de facto standard for enterprises seeking an open, future-proof lakehouse.&lt;/p&gt;
&lt;h3&gt;Delta Lake&lt;/h3&gt;
&lt;p&gt;Originally created by Databricks, &lt;strong&gt;Delta Lake&lt;/strong&gt; popularized the concept of a transactional log for data lakes. It uses an &lt;strong&gt;append-only transaction log&lt;/strong&gt; (&lt;code&gt;_delta_log&lt;/code&gt;) with JSON entries and Parquet checkpoints to track file state.&lt;br&gt;
Key features:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;ACID transactions tightly integrated with Apache Spark.&lt;/li&gt;
&lt;li&gt;Time travel and schema evolution.&lt;/li&gt;
&lt;li&gt;Optimizations like &lt;strong&gt;Z-Ordering&lt;/strong&gt; for clustering.&lt;/li&gt;
&lt;li&gt;Deep integration with the Databricks ecosystem, though community adoption beyond Spark is growing.&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;Delta remains particularly strong for teams standardized on Databricks or Spark-centric workflows.&lt;/p&gt;
&lt;h3&gt;Apache Hudi&lt;/h3&gt;
&lt;p&gt;Developed at Uber, &lt;strong&gt;Hudi&lt;/strong&gt; was one of the earliest attempts to bring database-like capabilities to data lakes. It excels at &lt;strong&gt;incremental processing&lt;/strong&gt; and &lt;strong&gt;change data capture (CDC)&lt;/strong&gt;.&lt;br&gt;
Key features:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Two storage modes: &lt;strong&gt;Copy-on-Write (CoW)&lt;/strong&gt; for read-optimized workloads, and &lt;strong&gt;Merge-on-Read (MoR)&lt;/strong&gt; for write-heavy, near-real-time use cases.&lt;/li&gt;
&lt;li&gt;Native upserts and deletes.&lt;/li&gt;
&lt;li&gt;Built-in indexing for record-level operations.&lt;/li&gt;
&lt;li&gt;Tight integrations with Spark, Flink, and Hive.&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;Hudi is especially attractive for pipelines that demand frequent updates and streaming ingestion.&lt;/p&gt;
&lt;h3&gt;Apache Paimon&lt;/h3&gt;
&lt;p&gt;A newer entrant, &lt;strong&gt;Paimon&lt;/strong&gt; (formerly Flink Table Store) emphasizes &lt;strong&gt;streaming-first lakehouse design&lt;/strong&gt;. It uses an &lt;strong&gt;LSM-tree style file organization&lt;/strong&gt; to unify batch and stream processing.&lt;br&gt;
Key features:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Native CDC and incremental queries.&lt;/li&gt;
&lt;li&gt;Deep integration with Apache Flink.&lt;/li&gt;
&lt;li&gt;Snapshot isolation with continuous compaction.&lt;/li&gt;
&lt;li&gt;Growing ecosystem to support Spark, Hive, and beyond.&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;Paimon fills a niche where &lt;strong&gt;real-time data ingestion and analytics converge&lt;/strong&gt;, making it compelling for event-driven architectures.&lt;/p&gt;
&lt;p&gt;Together, these formats represent the evolution from &lt;strong&gt;directory-based tables&lt;/strong&gt; to &lt;strong&gt;transactional, metadata-driven lakehouse systems&lt;/strong&gt;. Each brings a unique philosophy: Iceberg for openness, Delta for Spark-native simplicity, Hudi for streaming updates, and Paimon for unified batch-stream processing. Understanding their trade-offs is critical when designing a modern data platform.&lt;/p&gt;
&lt;h2&gt;Fundamental Architecture of the Data Lakehouse&lt;/h2&gt;
&lt;p&gt;At its core, the &lt;strong&gt;data lakehouse&lt;/strong&gt; is not a single product but an architectural pattern. It blends the scalability and openness of data lakes with the transactional reliability and governance of data warehouses. By 2025, a consensus emerged: a lakehouse succeeds when it clearly defines &lt;strong&gt;layers&lt;/strong&gt;, each with its own role but working together as a cohesive whole.&lt;/p&gt;
&lt;h3&gt;Storage as the Foundation&lt;/h3&gt;
&lt;p&gt;The lakehouse begins with &lt;strong&gt;cloud object storage&lt;/strong&gt; (e.g., Amazon S3, Azure Data Lake Storage, Google Cloud Storage). This layer offers low-cost, durable, infinitely scalable storage for all file types. Unlike warehouses, it decouples compute from storage, multiple engines can read from the same data without duplicating it.&lt;/p&gt;
&lt;h3&gt;Metadata and Table Formats&lt;/h3&gt;
&lt;p&gt;On top of storage sits a &lt;strong&gt;table format&lt;/strong&gt;, the metadata layer that turns a set of files into a logical table. Formats like Iceberg, Delta, Hudi, and Paimon bring:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;ACID transactions&lt;/strong&gt;&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Schema enforcement and evolution&lt;/strong&gt;&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Partition pruning and statistics&lt;/strong&gt; for efficient queries&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Time travel&lt;/strong&gt; through snapshot-based metadata&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;This layer is what transforms a “data swamp” into structured, queryable datasets.&lt;/p&gt;
&lt;h3&gt;Catalog and Governance&lt;/h3&gt;
&lt;p&gt;A &lt;strong&gt;catalog&lt;/strong&gt; connects metadata to the outside world. It tracks what tables exist and their locations, while enforcing access policies and governance rules. Think of it as the bridge between storage and consumption. Examples include Hive Metastore, AWS Glue, Unity Catalog, Dremio Catalog and open-source options like Nessie or Apache Polaris.&lt;/p&gt;
&lt;h3&gt;Compute and Federation&lt;/h3&gt;
&lt;p&gt;Query engines like &lt;strong&gt;Dremio, Trino, Spark, and Flink&lt;/strong&gt; sit on top, accessing tables via the catalog. These engines provide federation, joining and querying data from multiple systems, and execute transformations, BI queries, or machine learning pipelines. The lakehouse architecture allows multiple engines to share the same data without conflict.&lt;/p&gt;
&lt;h3&gt;Consumption and Semantics&lt;/h3&gt;
&lt;p&gt;Finally, end users connect through &lt;strong&gt;BI dashboards, notebooks, or AI systems&lt;/strong&gt;. A semantic layer often sits here, defining consistent metrics and business concepts across tools. This ensures a “single version of truth” for everyone consuming data.&lt;/p&gt;
&lt;p&gt;This layered design, storage, table format, catalog, compute, and consumption, has become the reference architecture for the modern data lakehouse. It solves the warehouse vs. lake tradeoff by delivering &lt;strong&gt;flexibility, trust, and performance in one unified stack&lt;/strong&gt;.&lt;/p&gt;
&lt;h2&gt;Metadata Structures Across Modern Table Formats&lt;/h2&gt;
&lt;p&gt;While all lakehouse table formats share the principle of tracking files rather than directories, each one implements its own &lt;strong&gt;metadata architecture&lt;/strong&gt;. Understanding these differences is crucial for choosing the right format and for operating them at scale.&lt;/p&gt;
&lt;h3&gt;Apache Iceberg&lt;/h3&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Snapshots:&lt;/strong&gt; Every commit creates a new snapshot that references a set of files.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Manifests:&lt;/strong&gt; Each snapshot points to &lt;em&gt;manifest lists&lt;/em&gt;, which then point to &lt;em&gt;manifest files&lt;/em&gt;. These manifest files contain the actual list of data files, along with stats like min/max values for columns.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Table Metadata File:&lt;/strong&gt; A JSON/Avro file storing schema versions, partition specs, snapshot history, and pointers to the current snapshot.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Strengths:&lt;/strong&gt; Hierarchical design scales to billions of files, supports hidden partitioning, and makes time travel lightweight.&lt;/li&gt;
&lt;/ul&gt;
&lt;h3&gt;Delta Lake&lt;/h3&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Transaction Log:&lt;/strong&gt; All operations are recorded in an append-only &lt;code&gt;_delta_log&lt;/code&gt; directory as JSON files.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Checkpoints:&lt;/strong&gt; Periodically, the log is compacted into Parquet checkpoint files for faster reads.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Table State:&lt;/strong&gt; Current table state is reconstructed by combining the latest checkpoint with newer JSON entries.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Strengths:&lt;/strong&gt; Simple, linear log model tightly integrated with Spark; efficient for workloads within the Databricks ecosystem.&lt;/li&gt;
&lt;/ul&gt;
&lt;h3&gt;Apache Hudi&lt;/h3&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Timeline:&lt;/strong&gt; A series of commit, deltacommit, and compaction files in a &lt;code&gt;.hoodie&lt;/code&gt; directory describe changes.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Storage Modes:&lt;/strong&gt;
&lt;ul&gt;
&lt;li&gt;&lt;em&gt;Copy-on-Write (CoW):&lt;/em&gt; rewrites files on update.&lt;/li&gt;
&lt;li&gt;&lt;em&gt;Merge-on-Read (MoR):&lt;/em&gt; writes delta logs and later compacts them with base files.&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Indexing:&lt;/strong&gt; Optional record-level indexes accelerate upserts and deletes.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Strengths:&lt;/strong&gt; Optimized for streaming ingestion and CDC use cases, with incremental pull queries.&lt;/li&gt;
&lt;/ul&gt;
&lt;h3&gt;Apache Paimon&lt;/h3&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;LSM-Tree Inspired:&lt;/strong&gt; Uses log segments and compaction levels, optimized for high-frequency updates.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Snapshots:&lt;/strong&gt; Metadata tracks current file sets and supports branching for consistent queries.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Changelog Streams:&lt;/strong&gt; Natively emits row-level changes for downstream streaming consumers.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Strengths:&lt;/strong&gt; Built for unified batch and streaming, with strong Flink integration.&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;&lt;strong&gt;In summary:&lt;/strong&gt;&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Iceberg emphasizes &lt;strong&gt;scalability and cross-engine interoperability&lt;/strong&gt;.&lt;/li&gt;
&lt;li&gt;Delta Lake focuses on &lt;strong&gt;simplicity and Spark-native performance&lt;/strong&gt;.&lt;/li&gt;
&lt;li&gt;Hudi delivers &lt;strong&gt;real-time upserts and incremental views&lt;/strong&gt;.&lt;/li&gt;
&lt;li&gt;Paimon pioneers &lt;strong&gt;streaming-first design with changelogs&lt;/strong&gt;.&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;These metadata designs reflect each project’s philosophy, and they form the backbone of how the modern lakehouse balances flexibility, consistency, and speed.&lt;/p&gt;
&lt;h2&gt;Implementing a Lakehouse: The Five Core Layers&lt;/h2&gt;
&lt;p&gt;Designing a modern lakehouse isn’t about choosing a single tool, it’s about assembling the right components across &lt;strong&gt;five architectural layers&lt;/strong&gt;. Each layer has its own responsibilities, and together they create a system that is scalable, governed, and usable for analytics and AI.&lt;/p&gt;
&lt;p&gt;&lt;a href=&quot;https://hubs.la/Q03GfY4f0&quot;&gt;Deep Dive into the 5 layers is the core of the book &amp;quot;Architecting an Apache Iceberg Lakehouse&amp;quot;&lt;/a&gt;&lt;/p&gt;
&lt;h3&gt;1. Storage Layer&lt;/h3&gt;
&lt;p&gt;This is the foundation: &lt;strong&gt;low-cost, durable storage&lt;/strong&gt; capable of holding structured, semi-structured, and unstructured data.&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Common choices: Amazon S3, Azure Data Lake Storage, Google Cloud Storage, or on-prem HDFS/MinIO.&lt;/li&gt;
&lt;li&gt;Data is stored in open formats such as Parquet, ORC, or Avro.&lt;/li&gt;
&lt;li&gt;Separation of storage from compute allows multiple engines to share the same data without duplication.&lt;/li&gt;
&lt;/ul&gt;
&lt;h3&gt;2. Table Format Layer&lt;/h3&gt;
&lt;p&gt;Here, the &lt;strong&gt;metadata format&lt;/strong&gt; gives structure and reliability to raw files.&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Options: Apache Iceberg, Delta Lake, Apache Hudi, Apache Paimon.&lt;/li&gt;
&lt;li&gt;Capabilities include ACID transactions, schema evolution, partition pruning, and time travel.&lt;/li&gt;
&lt;li&gt;This layer transforms the lake from a “data swamp” into a transactional system of record.&lt;/li&gt;
&lt;/ul&gt;
&lt;h3&gt;3. Ingestion Layer&lt;/h3&gt;
&lt;p&gt;The ingestion layer handles &lt;strong&gt;data movement into the lakehouse&lt;/strong&gt;.&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Batch ingestion:&lt;/strong&gt; Tools like Fivetran, Airbyte, Estuary, Hevo or custom ETL jobs land data periodically.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Streaming ingestion:&lt;/strong&gt; Systems like Confluent, Aiven, StreamNative, RisingWave, Kafka, Redpanda, Pulsar, or Flink push events into table formats in near real-time.&lt;/li&gt;
&lt;li&gt;Goal: balance freshness, cost, and reliability while avoiding problems like excessive small files.&lt;/li&gt;
&lt;/ul&gt;
&lt;h3&gt;4. Catalog &amp;amp; Governance Layer&lt;/h3&gt;
&lt;p&gt;The catalog is the &lt;strong&gt;central registry&lt;/strong&gt; of your tables, schemas, and access rules.&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Examples: Hive Metastore, AWS Glue, Unity Catalog, Dremio Catalog, open-source catalogs like Nessie or Apache Polaris.&lt;/li&gt;
&lt;li&gt;Responsibilities: discovery, schema validation, access control, lineage, and auditability.&lt;/li&gt;
&lt;li&gt;Acts as the bridge between storage and compute, ensuring data is both secure and discoverable.&lt;/li&gt;
&lt;/ul&gt;
&lt;h3&gt;5. Federation &amp;amp; Consumption Layer&lt;/h3&gt;
&lt;p&gt;At the top, query engines and semantic layers make data consumable.&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Query federation engines&lt;/strong&gt; like Dremio or Trino can join lakehouse tables with other sources.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Consumption tools&lt;/strong&gt; include BI platforms (Tableau, Power BI, Looker), notebooks, and AI agents.&lt;/li&gt;
&lt;li&gt;A semantic layer ensures consistency by defining metrics and business terms across all tools.&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;&lt;strong&gt;In practice:&lt;/strong&gt; these five layers form the blueprint of every successful lakehouse. They separate concerns, storage, metadata, movement, governance, and consumption, while enabling interoperability. The result is a unified platform that scales with data growth, adapts to new workloads, and keeps analytics both flexible and trustworthy.&lt;/p&gt;
&lt;h2&gt;Lakehouse Ingestion&lt;/h2&gt;
&lt;p&gt;Once the foundational layers are in place, the next challenge is &lt;strong&gt;getting data into the lakehouse&lt;/strong&gt; efficiently and reliably. Ingestion strategies determine not only data freshness but also table health, file organization, and downstream usability.&lt;/p&gt;
&lt;h3&gt;Batch Ingestion&lt;/h3&gt;
&lt;p&gt;Batch remains the most common entry point:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;ETL/ELT Services:&lt;/strong&gt; Tools like Fivetran and Airbyte extract data from SaaS applications, relational databases, and APIs, then land it in cloud object storage. Many now write directly into open table formats (Iceberg, Delta, Hudi) rather than dumping raw CSVs.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Custom Jobs:&lt;/strong&gt; Python, Spark, or dbt pipelines often transform and load data on schedules, nightly, hourly, or in micro-batches.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Advantages:&lt;/strong&gt; Predictable loads, simpler monitoring, and often easier cost control.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Challenges:&lt;/strong&gt; Data freshness is limited by schedule, and frequent batches can generate lots of small files if not managed.&lt;/li&gt;
&lt;/ul&gt;
&lt;h3&gt;Streaming Ingestion&lt;/h3&gt;
&lt;p&gt;Real-time data is no longer a luxury, it’s an expectation in 2026:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Event Streams:&lt;/strong&gt; Platforms like Apache Kafka, Redpanda, Aiven, Confluent, RisingWave, StreamNative, and Apache Pulsar capture streams of events (e.g., clickstream, IoT data) and push them into the lakehouse using connectors or stream processors.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;CDC Pipelines:&lt;/strong&gt; Change data capture tools (Debezium, Estuary Flow) replicate updates from operational databases into Iceberg or Delta tables with low latency.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Stream Processing Engines:&lt;/strong&gt; Apache Flink and Spark Structured Streaming can apply transformations inline, then commit results directly to lakehouse tables.&lt;/li&gt;
&lt;/ul&gt;
&lt;h3&gt;Small File Management&lt;/h3&gt;
&lt;p&gt;One critical concern in ingestion is avoiding a &lt;strong&gt;small files problem&lt;/strong&gt;:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Each tiny file adds overhead to query planning.&lt;/li&gt;
&lt;li&gt;Solutions include writer-side batching, file-size thresholds, and downstream compaction jobs.&lt;/li&gt;
&lt;li&gt;Modern ingestion platforms often integrate with the table format’s APIs to commit larger, optimized files.&lt;/li&gt;
&lt;/ul&gt;
&lt;h3&gt;Reliability and Governance&lt;/h3&gt;
&lt;p&gt;Ingestion isn’t just about moving bytes, it’s about ensuring &lt;strong&gt;trustworthy pipelines&lt;/strong&gt;:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Idempotency:&lt;/strong&gt; Re-runs shouldn’t create duplicates.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Schema Drift Handling:&lt;/strong&gt; New source columns should be gracefully added to lakehouse tables with metadata updates.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Monitoring:&lt;/strong&gt; Data observability platforms (Monte Carlo, Bigeye) can alert when loads fail or data volumes deviate unexpectedly.&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;In short, ingestion is where &lt;strong&gt;data quality meets data freshness&lt;/strong&gt;. A strong strategy combines &lt;strong&gt;batch tools for breadth&lt;/strong&gt; (ingesting from many SaaS and DB sources) with &lt;strong&gt;streaming pipelines for depth&lt;/strong&gt; (real-time operational data), all while keeping file sizes healthy and metadata consistent.&lt;/p&gt;
&lt;h2&gt;Lakehouse Streaming&lt;/h2&gt;
&lt;p&gt;If 2025 was the year of “batch meets real-time,” then 2026 is the year of &lt;strong&gt;streaming-first lakehouses&lt;/strong&gt;. Instead of treating streaming as an afterthought, the modern lakehouse expects ingestion, processing, and query serving to happen continuously. This shift is powered by both table format features (incremental commits, changelogs) and by the streaming ecosystem maturing around open lakehouse standards.&lt;/p&gt;
&lt;h3&gt;Confluent&lt;/h3&gt;
&lt;p&gt;As the commercial steward of Apache Kafka, &lt;strong&gt;Confluent&lt;/strong&gt; has led in making streams and tables converge. Their &lt;strong&gt;Tableflow and Stream Designer&lt;/strong&gt; products now write directly to Iceberg and Delta Lake, providing exactly-once guarantees and seamless CDC ingestion. This reduces the need for custom Flink or Spark jobs, Kafka topics become queryable lakehouse tables in real time.&lt;/p&gt;
&lt;h3&gt;Aiven&lt;/h3&gt;
&lt;p&gt;&lt;strong&gt;Aiven&lt;/strong&gt;, a managed open-source data platform provider, has expanded its Kafka, Flink, and Postgres services with native &lt;strong&gt;Iceberg integrations&lt;/strong&gt;. Their goal: give teams a turnkey way to capture events, run stream transformations, and land results directly into a governed lakehouse, without stitching together multiple vendors.&lt;/p&gt;
&lt;h3&gt;Redpanda&lt;/h3&gt;
&lt;p&gt;&lt;strong&gt;Redpanda&lt;/strong&gt; brings Kafka-API-compatible streaming with higher throughput and lower latency, and in 2025 introduced &lt;strong&gt;Iceberg Topics&lt;/strong&gt;. With this feature, every topic can materialize into an Iceberg table automatically, combining log storage with table metadata. This means developers can treat the same data as both a stream and a table, depending on the workload.&lt;/p&gt;
&lt;h3&gt;StreamNative&lt;/h3&gt;
&lt;p&gt;Built around Apache Pulsar, &lt;strong&gt;StreamNative&lt;/strong&gt; pushes the lakehouse deeper into event-driven architectures. Pulsar’s tiered storage, combined with integrations for Iceberg and Delta, means historical message backlogs can be instantly queryable as tables. Their work on unifying messaging and lakehouse storage blurs the boundary between stream broker and data platform.&lt;/p&gt;
&lt;h3&gt;RisingWave&lt;/h3&gt;
&lt;p&gt;&lt;strong&gt;RisingWave&lt;/strong&gt; focuses on &lt;strong&gt;streaming databases&lt;/strong&gt;: continuously maintaining materialized views over streams. Its integration with Iceberg allows those real-time views to be published directly into the lakehouse, governed alongside batch data. This bridges operational analytics (e.g., monitoring metrics in near real time) with historical analytics in the same architecture.&lt;/p&gt;
&lt;h3&gt;Other Notables&lt;/h3&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Materialize:&lt;/strong&gt; A streaming database that outputs real-time materialized views, often targeting data lakes and warehouses as sinks.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;ksqlDB:&lt;/strong&gt; Kafka-native SQL for defining streaming transformations, which can also materialize tables into downstream lakehouse storage.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Apache Flink:&lt;/strong&gt; Still the backbone of many custom streaming-to-lakehouse pipelines, powering advanced transformations before committing results to Iceberg, Hudi, or Delta.&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;&lt;strong&gt;Takeaway:&lt;/strong&gt; Streaming is no longer bolted onto the lakehouse, it is &lt;strong&gt;embedded&lt;/strong&gt;. Whether through Kafka, Redpanda, Pulsar, Flink, or streaming databases like RisingWave and Materialize, streams now flow directly into transactional tables. The result is a lakehouse where batch and real-time are not two separate worlds but a single, unified system delivering always-fresh data.&lt;/p&gt;
&lt;h2&gt;Lakehouse Catalogs: Architecture, Compatibility &amp;amp; When to Use Which&lt;/h2&gt;
&lt;p&gt;A &lt;strong&gt;lakehouse catalog&lt;/strong&gt; is the control plane for your open tables, it tracks metadata locations, permissions, and exposes standard APIs to every engine. Below is a concise, practitioner-focused map of today’s major options and how they fit into a multi-engine, multi-cloud lakehouse.&lt;/p&gt;
&lt;h3&gt;Apache Polaris (Incubating)&lt;/h3&gt;
&lt;p&gt;&lt;strong&gt;What it is:&lt;/strong&gt; An open-source, fully featured &lt;strong&gt;Apache Iceberg REST&lt;/strong&gt; catalog designed for vendor-neutral, multi-engine interoperability. Backed by multiple vendors and born from a cross-industry push to standardize the Iceberg catalog layer.&lt;br&gt;
&lt;strong&gt;Where it shines:&lt;/strong&gt; Teams standardizing on &lt;strong&gt;Iceberg&lt;/strong&gt; who want a portable, community-governed catalog that Spark, Flink, Trino, Dremio, StarRocks/Doris can all use via the Iceberg REST API.&lt;br&gt;
&lt;strong&gt;Notable:&lt;/strong&gt; Open governance and REST-by-default avoid lock-in and simplify multi-engine access. Also has the feature to federate other catalogs and soon other table sources.&lt;/p&gt;
&lt;h3&gt;Apache Gravitino (Incubating)&lt;/h3&gt;
&lt;p&gt;&lt;strong&gt;What it is:&lt;/strong&gt; A &lt;strong&gt;geo-distributed, federated metadata lake&lt;/strong&gt; that manages metadata &lt;strong&gt;in place&lt;/strong&gt; across heterogeneous sources (file stores, RDBMS, streams) and exposes a unified view to engines like Spark/Trino/Flink.
&lt;strong&gt;Where it shines:&lt;/strong&gt; Hybrid/multi-cloud estates with multiple catalogs and sources that need one governance and discovery layer without migrations.
&lt;strong&gt;Notable:&lt;/strong&gt; “Catalog of catalogs” approach; can present Iceberg/Hive/Paimon/Hudi catalogs under one umbrella.&lt;/p&gt;
&lt;h3&gt;AWS Glue Data Catalog (+ Lake Formation)&lt;/h3&gt;
&lt;p&gt;&lt;strong&gt;What it is:&lt;/strong&gt; AWS’s managed Hive-compatible catalog with &lt;strong&gt;first-party governance&lt;/strong&gt; (Lake Formation) and native support for &lt;strong&gt;Iceberg/Delta/Hudi&lt;/strong&gt; tables in S3, consumed by Athena, EMR, Redshift Spectrum, and Glue jobs.
&lt;strong&gt;Where it shines:&lt;/strong&gt; All-in AWS lakehouses needing centralized metadata and fine-grained access control enforced across AWS analytics services.
&lt;strong&gt;Notable:&lt;/strong&gt; Managed, integrated, and convenient, cloud-specific by design.&lt;/p&gt;
&lt;h3&gt;Microsoft OneLake Catalog (Fabric)&lt;/h3&gt;
&lt;p&gt;&lt;strong&gt;What it is:&lt;/strong&gt; The &lt;strong&gt;central catalog&lt;/strong&gt; for Microsoft Fabric’s “OneLake”, a tenant-wide, Delta-native lake with unified discovery (“Explore”) and governance (“Govern”) experiences.
&lt;strong&gt;Where it shines:&lt;/strong&gt; Fabric-centric stacks that want a single catalog for Spark, SQL, Power BI, and Real-Time Analytics over &lt;strong&gt;Delta&lt;/strong&gt; tables in ADLS/OneLake.
&lt;strong&gt;Notable:&lt;/strong&gt; Deeply integrated SaaS experience; shortcuts/mirroring help connect external sources, but it’s Azure/Fabric-scoped.&lt;/p&gt;
&lt;h3&gt;Google BigLake (Metastore + Iceberg)&lt;/h3&gt;
&lt;p&gt;&lt;strong&gt;What it is:&lt;/strong&gt; Google’s open lakehouse layer: &lt;strong&gt;BigLake Metastore&lt;/strong&gt; catalogs &lt;strong&gt;Iceberg&lt;/strong&gt; tables on GCS; BigQuery reads them natively while Spark/Flink and other engines use the &lt;strong&gt;Iceberg REST&lt;/strong&gt; interface.
&lt;strong&gt;Where it shines:&lt;/strong&gt; GCP stacks wanting warehouse-grade operations (BigQuery) over &lt;strong&gt;open Iceberg tables&lt;/strong&gt; stored in customer buckets with multi-engine access.
&lt;strong&gt;Notable:&lt;/strong&gt; Managed table maintenance and unified governance via Dataplex/BigLake; Iceberg-first approach.&lt;/p&gt;
&lt;h3&gt;Project Nessie&lt;/h3&gt;
&lt;p&gt;&lt;strong&gt;What it is:&lt;/strong&gt; A &lt;strong&gt;Git-like, transactional catalog&lt;/strong&gt; for data lakes that adds &lt;strong&gt;branches, tags, time-travel, and cross-table commits&lt;/strong&gt; on top of Iceberg.
&lt;strong&gt;Where it shines:&lt;/strong&gt; Teams needing dev/test isolation, reproducibility, or multi-table atomic commits in an &lt;strong&gt;Iceberg&lt;/strong&gt; lakehouse.
&lt;strong&gt;Notable:&lt;/strong&gt; Works with Spark, Flink, Trino, Dremio; deploy anywhere (K8s/containers). Complements standard catalogs with versioning semantics.&lt;/p&gt;
&lt;h3&gt;Unity Catalog (Open Source)&lt;/h3&gt;
&lt;p&gt;&lt;strong&gt;What it is:&lt;/strong&gt; An &lt;strong&gt;open-sourced&lt;/strong&gt; universal catalog for data &amp;amp; AI assets with multi-format (Delta, &lt;strong&gt;Iceberg&lt;/strong&gt; via REST/UniForm, files) and multi-engine ambitions; compatible with &lt;strong&gt;Hive Metastore API&lt;/strong&gt; and &lt;strong&gt;Iceberg REST&lt;/strong&gt;.&lt;br&gt;
&lt;strong&gt;Where it shines:&lt;/strong&gt; Enterprises seeking &lt;strong&gt;broad governance&lt;/strong&gt; (tables, files, functions, ML models) and consistent policies across engines/clouds.
&lt;strong&gt;Notable:&lt;/strong&gt; Recent updates added external engine read GA and write preview for Iceberg via REST, expanding interoperability.&lt;/p&gt;
&lt;h3&gt;Lakekeeper&lt;/h3&gt;
&lt;p&gt;&lt;strong&gt;What it is:&lt;/strong&gt; A lightweight, &lt;strong&gt;Rust-based Apache Iceberg REST catalog&lt;/strong&gt; focused on speed, security (OIDC/OPA), and simplicity; Apache-licensed.
&lt;strong&gt;Where it shines:&lt;/strong&gt; Teams wanting a small, fast &lt;strong&gt;Iceberg&lt;/strong&gt; catalog they can self-host, integrate with Trino/Spark, and plug into modern authz.
&lt;strong&gt;Notable:&lt;/strong&gt; Ecosystem-first design; good fit for DIY open lakehouses and CICD-style deployments.&lt;/p&gt;
&lt;h3&gt;Quick Guide: Picking the Right Catalog&lt;/h3&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Open, vendor-neutral Iceberg core:&lt;/strong&gt; &lt;em&gt;Polaris&lt;/em&gt; (add &lt;em&gt;Nessie&lt;/em&gt; if you need Git-style branching &amp;amp; multi-table commits).&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Federate many sources/catalogs across regions/clouds:&lt;/strong&gt; &lt;em&gt;Gravitino&lt;/em&gt;.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Deep cloud integration:&lt;/strong&gt; &lt;em&gt;Glue/Lake Formation&lt;/em&gt; (AWS), &lt;em&gt;OneLake Catalog&lt;/em&gt; (Azure/Fabric), &lt;em&gt;BigLake Metastore&lt;/em&gt; (GCP) or Dremio Catalog (Managed Polaris Service from Dremio).&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Broad data &amp;amp; AI governance (tables + files + models):&lt;/strong&gt; &lt;em&gt;Unity Catalog (OSS)&lt;/em&gt;; growing multi-engine support including Iceberg REST.&lt;/li&gt;
&lt;/ul&gt;
&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;Tip:&lt;/strong&gt; For multi-engine &lt;strong&gt;Iceberg&lt;/strong&gt; lakehouses, a common pattern is: &lt;strong&gt;Polaris as the primary REST catalog&lt;/strong&gt; for engines, with &lt;strong&gt;Nessie&lt;/strong&gt; layered in when you need branches/isolated environments. Cloud-native teams may still register those tables in their cloud catalogs for service-level features (e.g., Athena/BigQuery/Power BI), but keep the &lt;strong&gt;source of truth open&lt;/strong&gt;.&lt;/p&gt;
&lt;/blockquote&gt;
&lt;h2&gt;Lakehouse Optimization&lt;/h2&gt;
&lt;p&gt;Once a lakehouse is in production, the focus shifts from building to &lt;strong&gt;sustaining performance and efficiency at scale&lt;/strong&gt;. Without ongoing optimization, query times creep up, storage costs balloon, and data reliability weakens. The key is to manage both &lt;strong&gt;physical data layout&lt;/strong&gt; and &lt;strong&gt;metadata growth&lt;/strong&gt;.&lt;/p&gt;
&lt;h3&gt;Compaction and Small File Management&lt;/h3&gt;
&lt;p&gt;Frequent batch loads and streaming pipelines often generate thousands of small Parquet or ORC files.&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Problem:&lt;/strong&gt; Query engines spend more time opening files than scanning data.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Solution:&lt;/strong&gt; Table formats support compaction actions, rewriting many small files into fewer large ones (hundreds of MBs to 1GB).&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Examples:&lt;/strong&gt;
&lt;ul&gt;
&lt;li&gt;Iceberg’s &lt;code&gt;rewriteDataFiles&lt;/code&gt; action merges small files efficiently (Dremio also has an &lt;code&gt;OPTIMIZE&lt;/code&gt; command for Iceberg tables).&lt;/li&gt;
&lt;li&gt;Delta Lake offers the &lt;code&gt;OPTIMIZE&lt;/code&gt; command (with Z-Ordering for clustering).&lt;/li&gt;
&lt;li&gt;Hudi provides asynchronous background compaction for MoR tables.&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;/ul&gt;
&lt;h3&gt;Snapshot Expiration and Metadata Cleanup&lt;/h3&gt;
&lt;p&gt;Modern formats keep snapshots for time travel, but unchecked, these create &lt;strong&gt;metadata bloat&lt;/strong&gt;.&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Iceberg:&lt;/strong&gt; &lt;code&gt;expireSnapshots&lt;/code&gt; safely removes old snapshots and associated data files.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Delta Lake:&lt;/strong&gt; &lt;code&gt;VACUUM&lt;/code&gt; cleans up unreferenced files after a retention period.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Hudi:&lt;/strong&gt; Timeline service supports configurable retention of commits and delta logs.&lt;br&gt;
Regular cleanup keeps both storage and query planning efficient.&lt;/li&gt;
&lt;/ul&gt;
&lt;h3&gt;Partitioning and Clustering&lt;/h3&gt;
&lt;p&gt;Good partition design reduces data scanned per query.&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Iceberg:&lt;/strong&gt; Hidden partitions abstract complexity away from end users.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Delta:&lt;/strong&gt; Z-Ordering clusters data across dimensions for multidimensional pruning.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Hudi:&lt;/strong&gt; Can cluster records within files to optimize MoR query performance.&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;Partition evolution, changing partition strategy over time without breaking old data, is now supported by most formats and prevents schema rigidity.&lt;/p&gt;
&lt;h3&gt;Query Acceleration&lt;/h3&gt;
&lt;p&gt;Beyond storage optimization, &lt;strong&gt;query acceleration&lt;/strong&gt; techniques deliver speed at scale.&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Dremio&apos;s Reflections and materialized views&lt;/strong&gt; in platforms like Dremio provide always-fresh, cache-like performance boosts without manual tuning.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Column stats and bloom filters&lt;/strong&gt; stored in metadata allow engines to skip files entirely when filters exclude them.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Vectorized execution and Arrow-based memory models&lt;/strong&gt; reduce CPU costs across query engines.&lt;/li&gt;
&lt;/ul&gt;
&lt;h3&gt;Format-Specific Optimizations&lt;/h3&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Iceberg:&lt;/strong&gt; Manifest merging, hidden partitioning, and metadata caching.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Delta:&lt;/strong&gt; Frequent checkpoints and file skipping using data skipping indexes.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Hudi:&lt;/strong&gt; Incremental queries for consuming only new changes.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Paimon:&lt;/strong&gt; Continuous compaction to reconcile streaming write amplification.&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;&lt;strong&gt;Bottom line:&lt;/strong&gt; Lakehouse optimization is not a one-off task, it’s an ongoing discipline. By managing file sizes, pruning metadata, evolving partitions, and using acceleration features, organizations keep performance predictable and costs controlled, even as data volumes and workloads scale into 2026.&lt;/p&gt;
&lt;h2&gt;The Intelligent Data Lakehouse Built for Agentic AI with Dremio&lt;/h2&gt;
&lt;p&gt;By the end of 2025, the conversation about data platforms shifted from “how do we manage data?” to “how do we make data &lt;strong&gt;intelligent and AI-ready&lt;/strong&gt;?” This is where the &lt;strong&gt;intelligent lakehouse&lt;/strong&gt; comes in, and where Dremio stands out as the reference implementation.&lt;/p&gt;
&lt;h3&gt;From Static Analytics to Agentic Workloads&lt;/h3&gt;
&lt;p&gt;Traditional BI queries are predictable: weekly reports, dashboards, and KPIs. AI-driven workloads are not. &lt;strong&gt;Agentic AI systems&lt;/strong&gt;, large language models and autonomous agents, generate dynamic, ad-hoc queries that span datasets in unpredictable ways. This requires:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Consistent low-latency responses.&lt;/li&gt;
&lt;li&gt;A platform that can optimize itself without human intervention.&lt;/li&gt;
&lt;li&gt;Seamless integration between structured data, semantic meaning, and AI agents.&lt;/li&gt;
&lt;/ul&gt;
&lt;h3&gt;Dremio as the Intelligent Lakehouse&lt;/h3&gt;
&lt;p&gt;Dremio is more than a query engine; it’s a &lt;strong&gt;self-optimizing lakehouse platform&lt;/strong&gt; built natively on Apache Iceberg and Arrow. Key capabilities include:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Reflections:&lt;/strong&gt; Always-fresh materializations that accelerate queries automatically. Unlike traditional materialized views, reflections are invisible to end users, the optimizer decides when to use them, making acceleration adaptive to changing workloads.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Semantic Layer:&lt;/strong&gt; A unified place to define datasets, metrics, and business concepts. This ensures that whether it’s an analyst writing SQL or an AI agent generating queries, results remain consistent and governed.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;AI-Ready APIs:&lt;/strong&gt; Through Arrow Flight and REST endpoints, Dremio streams data directly into Python, notebooks, or AI frameworks with zero-copy efficiency. This bridges the gap between analytics and machine learning pipelines.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Open Standards:&lt;/strong&gt; By embracing Iceberg, Polaris (for catalogs), and Arrow, Dremio ensures interoperability, your AI agents or external engines can interact with the same governed data without lock-in.&lt;/li&gt;
&lt;li&gt;All the above allow Agentic AI applications connecting to Dremio through Dremio&apos;s MCP server successful in enabling Agentic Analytics.&lt;/li&gt;
&lt;/ul&gt;
&lt;h3&gt;Why This Matters for Agentic AI&lt;/h3&gt;
&lt;p&gt;AI agents thrive on &lt;strong&gt;autonomy and adaptability&lt;/strong&gt;. They need a platform that:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Handles &lt;strong&gt;ever-changing queries&lt;/strong&gt; without brittle pre-optimizations.&lt;/li&gt;
&lt;li&gt;Keeps acceleration aligned with shifting patterns (autonomous reflections).&lt;/li&gt;
&lt;li&gt;Provides &lt;strong&gt;governed access&lt;/strong&gt; so that AI doesn’t hallucinate unauthorized or inconsistent definitions of metrics.&lt;/li&gt;
&lt;li&gt;Scales seamlessly from small exploratory prompts to massive training-data extractions.&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;&lt;strong&gt;In short:&lt;/strong&gt; Dremio delivers the intelligent lakehouse, a platform that not only stores and serves data but actively &lt;strong&gt;adapts to how humans and AI consume it&lt;/strong&gt;. As agentic AI moves from hype to everyday practice in 2026, this intelligence layer will be the key to transforming raw data into reliable, actionable, and AI-ready insights.&lt;/p&gt;
&lt;h2&gt;Python for the Lakehouse&lt;/h2&gt;
&lt;p&gt;Python has become the lingua franca of modern data engineering and data science, and the lakehouse ecosystem is no exception. By 2026, a rich set of Python-first tools and frameworks have emerged that make it easier to ingest, process, analyze, and serve data directly from open table formats like Apache Iceberg, Delta, and Hudi. These tools not only enable lightweight experimentation but also power production-grade pipelines that rival traditional big data stacks.&lt;/p&gt;
&lt;h3&gt;DuckDB&lt;/h3&gt;
&lt;p&gt;Often described as the “SQLite for analytics,” &lt;strong&gt;DuckDB&lt;/strong&gt; is an in-process analytical database that excels at local workloads:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Direct Parquet &amp;amp; Iceberg Reads:&lt;/strong&gt; DuckDB can query Parquet files and integrate with Iceberg catalogs, making it a natural fit for small-to-medium lakehouse use cases.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Speed:&lt;/strong&gt; Its vectorized execution engine makes it extremely fast for analytical queries on a single machine.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Python Integration:&lt;/strong&gt; Native bindings allow seamless use within notebooks or Python apps.&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;DuckDB has become the go-to for prototyping, ad hoc exploration, and embedding analytics directly into applications.&lt;/p&gt;
&lt;h3&gt;Dask&lt;/h3&gt;
&lt;p&gt;&lt;strong&gt;Dask&lt;/strong&gt; is a parallel computing framework for Python that scales workflows from laptops to clusters.&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Flexible API:&lt;/strong&gt; Works with familiar NumPy, pandas, and scikit-learn APIs while distributing workloads.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Lakehouse Integration:&lt;/strong&gt; Reads and writes Parquet, and combined with Iceberg connectors, it enables scalable transformations on lakehouse data.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Ecosystem Fit:&lt;/strong&gt; Useful for machine learning preprocessing and large-scale data transformations where Spark might be overkill.&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;Dask democratizes distributed compute for teams already invested in Python.&lt;/p&gt;
&lt;h3&gt;Daft&lt;/h3&gt;
&lt;p&gt;A newer entrant, &lt;strong&gt;Daft&lt;/strong&gt; positions itself as a distributed data processing engine optimized for AI and ML workloads.&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Arrow-Native:&lt;/strong&gt; Built on Apache Arrow for fast columnar in-memory processing.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Flexible Backends:&lt;/strong&gt; Runs locally or on clusters, supporting both CPUs and GPUs.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Lakehouse Ready:&lt;/strong&gt; Reads directly from Parquet and Iceberg sources, enabling high-performance pipelines that integrate analytics and ML training.&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;Daft is gaining traction for teams that want a modern, Pythonic alternative to Spark for big data and AI-centric workflows.&lt;/p&gt;
&lt;h3&gt;Bauplan&lt;/h3&gt;
&lt;p&gt;&lt;strong&gt;Bauplan Labs&lt;/strong&gt; brings a &lt;em&gt;serverless, Python-first lakehouse&lt;/em&gt; approach.&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Pipeline-as-Code:&lt;/strong&gt; Data pipelines are written in Python and executed in a serverless runtime that scales automatically.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Version Control for Data:&lt;/strong&gt; Bauplan integrates Iceberg tables with Git-like branching via catalogs like Nessie, making schema and data versioning first-class features.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Developer Experience:&lt;/strong&gt; With Arrow under the hood, Bauplan emphasizes reproducibility, modular pipelines, and minimal infrastructure overhead.&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;Bauplan is designed for teams that want the power of the lakehouse without the complexity of managing heavy infrastructure.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;In practice:&lt;/strong&gt;&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;DuckDB&lt;/strong&gt; is the Swiss Army knife for local analytics.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Dask&lt;/strong&gt; scales familiar Python workflows across clusters.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Daft&lt;/strong&gt; brings Arrow-native distributed compute optimized for AI.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Bauplan&lt;/strong&gt; simplifies pipeline execution with a serverless lakehouse model.&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;Together, these tools give Python along other libraries like Polaris, Ibis, SQLFrame and others developers an end-to-end toolkit for building, maintaining, and consuming modern data lakehouses.&lt;/p&gt;
&lt;h2&gt;Graphs in the Data Lakehouse with PuppyGraph&lt;/h2&gt;
&lt;p&gt;While the data lakehouse excels at tabular and relational analytics, many real-world problems are &lt;strong&gt;graph-shaped&lt;/strong&gt;: fraud rings, identity networks, supply chains, lineage tracking, and recommendation systems. Traditionally, these problems required loading data into a &lt;strong&gt;specialized graph database&lt;/strong&gt;, an extra layer of ETL and storage that added cost and complexity. &lt;strong&gt;PuppyGraph&lt;/strong&gt; changes this equation by bringing graph analytics &lt;strong&gt;directly into the lakehouse.&lt;/strong&gt;&lt;/p&gt;
&lt;h3&gt;What is PuppyGraph?&lt;/h3&gt;
&lt;p&gt;PuppyGraph is a &lt;strong&gt;cloud-native graph engine&lt;/strong&gt; designed to run on top of existing data in your lakehouse. Instead of requiring a proprietary graph database, PuppyGraph lets you &lt;strong&gt;query your Iceberg, Delta, Hudi, or Hive tables as a graph&lt;/strong&gt;. It connects directly to open table formats, relational databases, and warehouses, automatically sharding and scaling queries without duplicating data. This means you can turn your existing datasets into a &lt;strong&gt;graph model in minutes&lt;/strong&gt;, with no ETL.&lt;/p&gt;
&lt;h3&gt;Integration with the Lakehouse&lt;/h3&gt;
&lt;p&gt;PuppyGraph integrates seamlessly with:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Apache Iceberg&lt;/strong&gt; (including REST catalogs like Tabular or Polaris)&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Delta Lake and Apache Hudi&lt;/strong&gt;&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Hive Metastore and AWS Glue&lt;/strong&gt;&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Databases and Warehouses&lt;/strong&gt; such as PostgreSQL, MySQL, Redshift, BigQuery, and DuckDB&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;Each source is treated as a &lt;em&gt;catalog&lt;/em&gt;. PuppyGraph lets you define a &lt;strong&gt;graph schema&lt;/strong&gt; across one or many catalogs, effectively federating multiple data sources into a single graph. For example, you can link customer nodes in PostgreSQL with transaction edges in Iceberg, &lt;strong&gt;all without moving the data.&lt;/strong&gt;&lt;/p&gt;
&lt;h3&gt;Querying Graphs at Scale&lt;/h3&gt;
&lt;p&gt;Because PuppyGraph queries &lt;strong&gt;directly against Parquet-backed tables&lt;/strong&gt;, you can run multi-hop traversals and graph algorithms over your lakehouse data. It supports popular graph query languages:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Gremlin (Apache TinkerPop)&lt;/strong&gt;&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;openCypher&lt;/strong&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;This ensures compatibility with existing graph tooling and reduces the learning curve. Performance is optimized for &lt;strong&gt;large, complex traversals&lt;/strong&gt;: PuppyGraph has demonstrated &lt;strong&gt;6-hop traversals over hundreds of millions of edges in under a second&lt;/strong&gt;. Cached mode allows even faster repeated queries, often surpassing the performance of traditional graph databases.&lt;/p&gt;
&lt;h3&gt;Use Cases&lt;/h3&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Fraud Detection:&lt;/strong&gt; Traverse transaction graphs in real time to uncover hidden fraud rings.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Cybersecurity:&lt;/strong&gt; Model logins, access patterns, and network flows as a graph to detect threats.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Supply Chain Optimization:&lt;/strong&gt; Connect suppliers, shipments, and logistics into a graph for bottleneck analysis.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Customer 360:&lt;/strong&gt; Combine relational and behavioral data into a graph to better understand customer journeys.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Graph + AI:&lt;/strong&gt; PuppyGraph supports &lt;strong&gt;Graph RAG (Retrieval Augmented Generation)&lt;/strong&gt;, enabling LLMs and agents to query structured relationships for better context and reasoning.&lt;/li&gt;
&lt;/ul&gt;
&lt;h3&gt;Why It Matters&lt;/h3&gt;
&lt;p&gt;By &lt;strong&gt;plugging directly into the lakehouse&lt;/strong&gt;, PuppyGraph removes the wall between tabular and graph analytics. Data engineers and architects can:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Avoid &lt;strong&gt;data duplication and ETL pipelines&lt;/strong&gt; into separate graph stores.&lt;/li&gt;
&lt;li&gt;Keep governance and security consistent via existing catalogs.&lt;/li&gt;
&lt;li&gt;Support &lt;strong&gt;SQL, BI, and graph queries side-by-side&lt;/strong&gt; on the same data.&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;In essence, PuppyGraph makes the lakehouse not just the foundation for relational analytics and AI, but also a &lt;strong&gt;native home for graph workloads&lt;/strong&gt;, all with the same open formats and scalable storage.&lt;/p&gt;
&lt;h2&gt;Edge Inference for the Lakehouse: Spice AI&lt;/h2&gt;
&lt;p&gt;As organizations embrace AI-driven applications, the &lt;strong&gt;edge&lt;/strong&gt; has become a critical deployment target. Instead of sending all data to centralized clusters, inference can increasingly happen &lt;strong&gt;close to where data is generated&lt;/strong&gt;, IoT devices, factories, mobile applications, or regional data centers. The lakehouse, traditionally viewed as a central hub, is now extending outward. Platforms like &lt;strong&gt;Spice AI&lt;/strong&gt; make this possible.&lt;/p&gt;
&lt;h3&gt;Why Edge Inference Matters&lt;/h3&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Latency:&lt;/strong&gt; Inference needs to happen in milliseconds, not seconds. Shipping every query to the cloud adds unacceptable delays for use cases like predictive maintenance or fraud detection.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Cost Efficiency:&lt;/strong&gt; Processing locally reduces bandwidth and cloud compute costs, especially when dealing with high-volume sensor or event data.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Resilience:&lt;/strong&gt; Edge inference continues to function even with intermittent network connectivity, syncing back to the lakehouse when available.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Privacy &amp;amp; Compliance:&lt;/strong&gt; Processing data locally helps meet regulatory requirements by minimizing the movement of sensitive information.&lt;/li&gt;
&lt;/ul&gt;
&lt;h3&gt;Spice AI at the Edge&lt;/h3&gt;
&lt;p&gt;Spice AI positions itself as an &lt;strong&gt;operational data lakehouse&lt;/strong&gt; tailored for real-time and AI workloads. At the edge, this means:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Federated Querying with DataFusion:&lt;/strong&gt; Spice uses the Rust-based DataFusion engine (part of the Arrow ecosystem) to execute high-performance queries locally. This allows lightweight nodes to join, filter, and aggregate data directly.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Vector + Relational Search:&lt;/strong&gt; Spice combines vector search (for embeddings) with SQL-style queries. This means an edge application can run both semantic AI lookups and structured analytics in one step.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Lightweight Runtimes:&lt;/strong&gt; Spice can run in containers or edge environments, consuming a small footprint while still supporting open table formats like Iceberg, Delta, and Hudi.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Hybrid Sync:&lt;/strong&gt; Results and inferences can be materialized locally, then synchronized back to the central lakehouse when connectivity is restored, ensuring global consistency without sacrificing local responsiveness.&lt;/li&gt;
&lt;/ul&gt;
&lt;h3&gt;Example Use Cases&lt;/h3&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Manufacturing IoT:&lt;/strong&gt; Edge devices monitor sensor streams, detect anomalies with on-device inference, and sync flagged events to the lakehouse for broader analysis.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Retail:&lt;/strong&gt; In-store applications recommend products in real time based on customer behavior while syncing aggregated insights centrally.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Telecom/5G:&lt;/strong&gt; Local edge inference supports real-time network optimization while global models are trained and governed in the lakehouse.&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;&lt;strong&gt;In summary:&lt;/strong&gt; Edge inference extends the reach of the lakehouse from the cloud to the edge, enabling AI applications to be both &lt;strong&gt;real-time and governed&lt;/strong&gt;.&lt;/p&gt;
&lt;h2&gt;DuckLake: Simplifying Lakehouse Metadata with SQL&lt;/h2&gt;
&lt;p&gt;While formats like Iceberg, Delta, and Hudi advanced the lakehouse by bringing ACID transactions to data lakes, they also introduced operational complexity: JSON manifests, Avro metadata files, separate catalog services, and eventual consistency challenges. &lt;strong&gt;DuckLake&lt;/strong&gt; takes a fresh approach by asking a simple question: &lt;em&gt;what if the entire metadata layer was just stored in a relational database?&lt;/em&gt;&lt;/p&gt;
&lt;h3&gt;What is DuckLake?&lt;/h3&gt;
&lt;p&gt;DuckLake is a new open table format developed by the DuckDB team. Its core idea is to &lt;strong&gt;move all catalog and table metadata into a SQL database&lt;/strong&gt;, while keeping table data as Parquet files in object storage or local filesystems. This means no manifest lists, no Hive Metastore, and no extra catalog API services—just SQL tables that track schemas, snapshots, and file pointers.&lt;/p&gt;
&lt;h3&gt;Architecture&lt;/h3&gt;
&lt;p&gt;DuckLake splits the lakehouse into two layers:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Catalog Database:&lt;/strong&gt; Any ACID-compliant database (DuckDB, SQLite, Postgres, MySQL, or even MotherDuck) stores all metadata—schemas, table versions, statistics, and transactions.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Data Layer:&lt;/strong&gt; Standard Parquet files (and optional delete files) stored in directories or S3 buckets hold the actual table data.&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;This design yields &lt;strong&gt;fast commits&lt;/strong&gt; (a single SQL transaction to update metadata), &lt;strong&gt;strong consistency&lt;/strong&gt; (no reliance on eventually consistent file stores), and &lt;strong&gt;simpler operations&lt;/strong&gt; (just back up or replicate the metadata DB). It also enables advanced features like &lt;strong&gt;multi-table transactions&lt;/strong&gt;, &lt;strong&gt;time travel&lt;/strong&gt;, and &lt;strong&gt;transactional schema changes&lt;/strong&gt; without a complex stack.&lt;/p&gt;
&lt;h3&gt;Integration with DuckDB&lt;/h3&gt;
&lt;p&gt;DuckLake ships as a DuckDB extension. Once installed, users can:&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-sql&quot;&gt;INSTALL ducklake;
LOAD ducklake;
ATTACH &apos;ducklake:mycatalog.ducklake&apos; AS lakehouse;
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;From there, you can create tables, insert, update, delete, and query with full ACID guarantees. Multiple DuckDB instances can share the same DuckLake if the catalog is in a multi-user database like Postgres, effectively making DuckDB “multiplayer” with a shared lakehouse.&lt;/p&gt;
&lt;h3&gt;Interoperability&lt;/h3&gt;
&lt;p&gt;DuckLake is its own format, but it’s designed to interoperate:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Iceberg:&lt;/strong&gt; Parquet and delete files are compatible, and DuckLake can import Iceberg metadata directly, even preserving snapshot history.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Delta:&lt;/strong&gt; DuckDB continues to support Delta Lake separately; data can be copied between Delta and DuckLake when needed.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Hudi:&lt;/strong&gt; Not natively supported yet, but Hudi’s Parquet files can be queried as plain Parquet.&lt;/p&gt;
&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;This makes DuckLake a flexible companion in mixed-format environments and a potential bridge for migrating or experimenting.&lt;/p&gt;
&lt;h3&gt;Use Cases&lt;/h3&gt;
&lt;p&gt;Local &amp;amp; Embedded Lakehouses: Run a mini data warehouse on your laptop with DuckDB + DuckLake, no heavy services required.&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Small Team Data Warehouses:&lt;/strong&gt; Share a DuckLake catalog in Postgres for concurrent analytics across a team.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Streaming &amp;amp; CDC:&lt;/strong&gt; Handle high-frequency small writes efficiently without metadata file bloat.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;CI/CD Pipelines:&lt;/strong&gt; Spin up ephemeral lakehouses in tests with time travel and rollback for validation.&lt;/p&gt;
&lt;/li&gt;
&lt;/ul&gt;
&lt;h3&gt;Limitations &amp;amp; Roadmap&lt;/h3&gt;
&lt;p&gt;DuckLake is still young (v0.3 as of late 2025). At present:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;
&lt;p&gt;Ecosystem support is centered on DuckDB/MotherDuck.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;No built-in fine-grained governance; relies on the underlying DB’s permissions.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;No branching/merge semantics like Project Nessie, though time travel is supported..&lt;/p&gt;
&lt;/li&gt;
&lt;/ul&gt;
&lt;h2&gt;Books on the Data Lakehouse and Open Table Formats&lt;/h2&gt;
&lt;p&gt;For data engineers and architects who want to go beyond blogs and documentation, books provide the depth and structured learning needed to master the lakehouse paradigm. Between 2023 and early 2026, O’Reilly, Manning, and Packt have released (or announced) a range of titles that cover the architecture, theory, and practice of the data lakehouse, including the major open table formats, Apache Iceberg, Delta Lake, and Apache Hudi.&lt;/p&gt;
&lt;h3&gt;O’Reilly Media&lt;/h3&gt;
&lt;ul&gt;
&lt;li&gt;
&lt;p&gt;&lt;a href=&quot;https://www.oreilly.com/library/view/apache-iceberg-the/9781098148614/&quot;&gt;&lt;strong&gt;Apache Iceberg: The Definitive Guide – Data Lakehouse Functionality, Performance, and Scalability on the Data Lake&lt;/strong&gt;&lt;/a&gt;&lt;br&gt;
&lt;em&gt;Tomer Shiran, Jason Hughes, and Alex Merced&lt;/em&gt; (Jun 2024)&lt;br&gt;
Comprehensive deep dive into Apache Iceberg’s architecture, metadata model, features like partition evolution and time travel, and integrations across engines such as Spark, Flink, Trino, and Dremio.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;a href=&quot;https://www.oreilly.com/library/view/apache-polaris-the/9798341608139/&quot;&gt;&lt;strong&gt;Apache Polaris: The Definitive Guide – Enriching Apache Iceberg Lakehouse with a robust open-source catalog&lt;/strong&gt;&lt;/a&gt;&lt;br&gt;
&lt;em&gt;Alex Merced, and Andrew Madson&lt;/em&gt; (Jun 2024)&lt;br&gt;
Revolutionize your understanding of modern data management with Apache Polaris (incubating), the open source catalog designed for data lakehouse industry standard Apache Iceberg. This comprehensive guide takes you on a journey through the intricacies of Apache Iceberg data lakehouses, highlighting the pivotal role of Iceberg catalogs.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;a href=&quot;https://www.oreilly.com/library/view/delta-lake-up/9781098139711/&quot;&gt;&lt;strong&gt;Delta Lake: Up and Running – Modern Data Lakehouse Architectures with Delta Lake&lt;/strong&gt;&lt;/a&gt;&lt;br&gt;
&lt;em&gt;Bennie Haelen and Dan Davis&lt;/em&gt; (Oct 2023)&lt;br&gt;
Introductory and practical guide to Delta Lake, covering ACID transactions, schema enforcement, time travel, and how to build reliable data pipelines that unify batch and streaming.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;a href=&quot;https://www.oreilly.com/library/view/delta-lake-the/9781098151010/&quot;&gt;&lt;strong&gt;Delta Lake: The Definitive Guide – Modern Data Lakehouse Architectures with Data Lakes&lt;/strong&gt;&lt;/a&gt;&lt;br&gt;
&lt;em&gt;Denny Lee, Tristen Wentling, Scott Haines, and Prashanth Babu&lt;/em&gt; (Dec 2024)&lt;br&gt;
Written by core Delta Lake contributors, this book explores Delta’s transaction log, medallion architecture, deletion vectors, and advanced optimization strategies for enterprise-scale workloads.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;a href=&quot;https://www.oreilly.com/library/view/practical-lakehouse-architecture/9781098156145/&quot;&gt;&lt;strong&gt;Practical Lakehouse Architecture: Designing and Implementing Modern Data Platforms at Scale&lt;/strong&gt;&lt;/a&gt;&lt;br&gt;
&lt;em&gt;Gaurav Ashok Thalpati&lt;/em&gt; (Aug 2024)&lt;br&gt;
A broad architectural guide to designing, implementing, and migrating to lakehouse platforms. Covers design layers, governance, catalogs, and security with a practical step-by-step framework.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;a href=&quot;https://www.oreilly.com/library/view/apache-hudi-the/9781098173821/&quot;&gt;&lt;strong&gt;Apache Hudi: The Definitive Guide – Building Robust, Open, and High-Performance Lakehouses&lt;/strong&gt;&lt;/a&gt;&lt;br&gt;
&lt;em&gt;Shiyan Xu, Prashant Wason, Sudha Saktheeswaran, and Rebecca Bilbro&lt;/em&gt; (Forthcoming Dec 2025)&lt;br&gt;
Focuses on Hudi’s approach to incremental processing, upserts/deletes, clustering, and indexing. Demonstrates how to run production-ready lakehouses with streaming data ingestion.&lt;/p&gt;
&lt;/li&gt;
&lt;/ul&gt;
&lt;h3&gt;Manning Publications&lt;/h3&gt;
&lt;ul&gt;
&lt;li&gt;&lt;a href=&quot;https://hubs.la/Q03GfY4f0&quot;&gt;&lt;strong&gt;Architecting an Apache Iceberg Lakehouse&lt;/strong&gt;&lt;/a&gt;&lt;br&gt;
&lt;em&gt;Alex Merced&lt;/em&gt; (MEAP, 2025 – Forthcoming 2026)&lt;br&gt;
A hands-on, architecture-first guide to designing scalable Iceberg-based lakehouses. Covers all five layers (storage, table formats, ingestion, catalog, consumption) with exercises and real-world design trade-offs.&lt;/li&gt;
&lt;/ul&gt;
&lt;h3&gt;Packt Publishing&lt;/h3&gt;
&lt;ul&gt;
&lt;li&gt;
&lt;p&gt;&lt;a href=&quot;https://www.packtpub.com/en-us/product/building-modern-data-applications-using-databricks-lakehouse-9781804617205&quot;&gt;&lt;strong&gt;Building Modern Data Applications Using Databricks Lakehouse&lt;/strong&gt;&lt;/a&gt;&lt;br&gt;
&lt;em&gt;Will Girten&lt;/em&gt; (Oct 2024)&lt;br&gt;
Practical guide to deploying end-to-end pipelines on Databricks Lakehouse using Delta Lake and Unity Catalog, including batch and streaming workflows, governance, and CI/CD.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;a href=&quot;https://www.packtpub.com/en-us/product/engineering-lakehouses-with-open-table-formats-9781836207221&quot;&gt;&lt;strong&gt;Engineering Lakehouses with Open Table Formats&lt;/strong&gt;&lt;/a&gt;&lt;br&gt;
&lt;em&gt;Dipankar Mazumdar and Vinoth Govindarajan&lt;/em&gt; (Dec 2025)&lt;br&gt;
Covers Iceberg, Hudi, and Delta Lake together, focusing on how to choose between them, optimize tables, and build interoperable, vendor-agnostic architectures. Includes hands-on examples with Spark, Flink, and Trino.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;a href=&quot;https://www.packtpub.com/en-us/product/data-engineering-with-databricks-cookbook-9781803246147&quot;&gt;&lt;strong&gt;Data Engineering with Databricks Cookbook&lt;/strong&gt;&lt;/a&gt;&lt;br&gt;
&lt;em&gt;Pulkit Chadha&lt;/em&gt; (May 2024)&lt;br&gt;
Recipe-based approach to building data pipelines on Databricks, with step-by-step instructions for managing Delta Lake tables, handling streaming ingestion, orchestrating workflows, and applying Unity Catalog governance.&lt;/p&gt;
&lt;/li&gt;
&lt;/ul&gt;
&lt;h3&gt;Takeaway&lt;/h3&gt;
&lt;p&gt;Whether you’re looking for a &lt;strong&gt;deep dive into a specific format&lt;/strong&gt; (Iceberg, Delta, or Hudi) or a &lt;strong&gt;broader perspective on lakehouse architecture&lt;/strong&gt;, these titles form the essential reading list for data engineers and architects in 2026. They not only document the current state of the technology but also provide practical frameworks and best practices to implement reliable, scalable, and open lakehouses.&lt;/p&gt;
&lt;h2&gt;Conclusion&lt;/h2&gt;
&lt;p&gt;The journey from &lt;strong&gt;data warehouses&lt;/strong&gt; to &lt;strong&gt;data lakes&lt;/strong&gt; and finally to the &lt;strong&gt;data lakehouse&lt;/strong&gt; reflects one constant: organizations need a platform that balances &lt;strong&gt;trust, flexibility, and performance&lt;/strong&gt;. Warehouses gave us governance but lacked agility. Lakes gave us scale and freedom but sacrificed reliability. The lakehouse unites these worlds by layering open table formats, catalogs, and intelligent query engines on top of low-cost object storage.&lt;/p&gt;
&lt;p&gt;By 2025, this model matured from a promise into a proven architecture. With formats like &lt;strong&gt;Apache Iceberg, Delta Lake, Hudi, and Paimon&lt;/strong&gt;, data teams now have open standards for transactional data at scale. Streaming-first ingestion, autonomous optimization, and catalog-driven governance have become baseline requirements. Looking ahead to 2026, the lakehouse is no longer just a central repository, it extends outward to power &lt;strong&gt;real-time analytics, agentic AI, and even edge inference&lt;/strong&gt;.&lt;/p&gt;
&lt;p&gt;For data engineers and architects, the message is clear:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Adopt &lt;strong&gt;open table formats&lt;/strong&gt; to avoid lock-in and ensure interoperability.&lt;/li&gt;
&lt;li&gt;Embrace a &lt;strong&gt;layered architecture&lt;/strong&gt; that separates storage, metadata, ingestion, catalog, and consumption.&lt;/li&gt;
&lt;li&gt;Optimize continuously, through compaction, snapshot expiration, and acceleration features, so performance scales with data.&lt;/li&gt;
&lt;li&gt;Prepare for the future where &lt;strong&gt;AI workloads are not occasional but constant&lt;/strong&gt;, demanding a platform that is both intelligent and adaptive.&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;The lakehouse has become the backbone of modern data platforms. As you step into 2026, building on this foundation isn’t just a best practice, it’s the path to delivering data that is truly &lt;strong&gt;trusted, governed, and AI-ready&lt;/strong&gt;.&lt;/p&gt;
</content:encoded><author>Alex Merced</author></item><item><title>Composable Analytics with Agents -  Leveraging Virtual Datasets and the Semantic Layer</title><link>https://iceberglakehouse.com/posts/2025-09-composable-analytics-with-agents/</link><guid isPermaLink="true">https://iceberglakehouse.com/posts/2025-09-composable-analytics-with-agents/</guid><description>
- **[Free Apache Iceberg Course](https://hello.dremio.com/webcast-an-apache-iceberg-lakehouse-crash-course-reg.html?utm_source=ev_external_blog&amp;utm_m...</description><pubDate>Wed, 17 Sep 2025 09:00:00 GMT</pubDate><content:encoded>&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;&lt;a href=&quot;https://hello.dremio.com/webcast-an-apache-iceberg-lakehouse-crash-course-reg.html?utm_source=ev_external_blog&amp;amp;utm_medium=influencer&amp;amp;utm_campaign=semantic_layer&amp;amp;utm_content=alexmerced&amp;amp;utm_term=semantic_layer&quot;&gt;Free Apache Iceberg Course&lt;/a&gt;&lt;/strong&gt;&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;&lt;a href=&quot;https://hello.dremio.com/wp-apache-iceberg-the-definitive-guide-reg.html?utm_source=ev_external_blog&amp;amp;utm_medium=influencer&amp;amp;utm_campaign=semantic_layer&amp;amp;utm_content=alexmerced&amp;amp;utm_term=semantic_layer&quot;&gt;Free Copy of “Apache Iceberg: The Definitive Guide”&lt;/a&gt;&lt;/strong&gt;&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;&lt;a href=&quot;https://hello.dremio.com/wp-apache-polaris-guide-reg.html?utm_source=ev_external_blog&amp;amp;utm_medium=influencer&amp;amp;utm_campaign=semantic_layer&amp;amp;utm_content=alexmerced&amp;amp;utm_term=semantic_layer&quot;&gt;Free Copy of “Apache Polaris: The Definitive Guide”&lt;/a&gt;&lt;/strong&gt;&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;&lt;a href=&quot;https://hubs.la/Q03GfY4f0?utm_source=merced&amp;amp;utm_medium=affiliate&amp;amp;utm_campaign=book_merced&amp;amp;a_aid=merced&amp;amp;a_bid=7eac4151&quot;&gt;Purchase &amp;quot;Architecting an Apache Iceberg Lakehouse&amp;quot;&lt;/a&gt;&lt;/strong&gt;&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;&lt;a href=&quot;https://medium.com/data-engineering-with-dremio/2025-guide-to-architecting-an-iceberg-lakehouse-9b19ed42c9de&quot;&gt;2025 Apache Iceberg Architecture Guide&lt;/a&gt;&lt;/strong&gt;&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;&lt;a href=&quot;https://youtube.com/playlist?list=PLsLAVBjQJO0p0Yq1fLkoHvt2lEJj5pcYe&amp;amp;si=WTSnqjXZv6Glkc3y&quot;&gt;Iceberg Lakehouse Engineering Video Playlist&lt;/a&gt;&lt;/strong&gt;&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;&lt;a href=&quot;https://medium.com/data-engineering-with-dremio/ultimate-directory-of-apache-iceberg-resources-e3e02efac62e&quot;&gt;Ultimate Apache Iceberg Resource Guide&lt;/a&gt;&lt;/strong&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;The promise of AI in analytics isn’t just faster answers, it’s &lt;strong&gt;smarter, more flexible insights&lt;/strong&gt;. For that to happen, AI agents need not only access to data but also the ability to compose, extend, and recombine datasets on the fly. This is where Dremio’s &lt;strong&gt;semantic layer&lt;/strong&gt; and &lt;strong&gt;virtual datasets&lt;/strong&gt; come into play, providing the foundation for what AtScale calls &lt;em&gt;composable analytics&lt;/em&gt;.&lt;/p&gt;
&lt;h2&gt;The Challenge: Static Models in a Dynamic World&lt;/h2&gt;
&lt;p&gt;Traditional analytics models are rigid. Business intelligence teams define metrics in dashboards or cubes, and changing them often requires IT involvement. This creates bottlenecks when business needs evolve, leaving AI agents with limited flexibility to adjust their workflows.&lt;/p&gt;
&lt;p&gt;For agentic AI, which thrives on &lt;strong&gt;iterative reasoning and adaptive workflows&lt;/strong&gt;, rigid models are a barrier.&lt;/p&gt;
&lt;h2&gt;Virtual Datasets: Building Blocks for Composable Analytics&lt;/h2&gt;
&lt;p&gt;Dremio addresses this challenge with &lt;strong&gt;virtual datasets (VDSs)&lt;/strong&gt;:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;No physical copies&lt;/strong&gt;: VDSs are views defined in the semantic layer, not duplicated data.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Composable&lt;/strong&gt;: VDSs can be combined, extended, or refined into new virtual models.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Governed&lt;/strong&gt;: Every dataset inherits security and lineage from the semantic layer.&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;Agents interacting through Dremio’s MCP server can query these VDSs directly, creating new analytic combinations without breaking governance or requiring new pipelines.&lt;/p&gt;
&lt;h2&gt;Agents + MCP: Extending Models on Demand&lt;/h2&gt;
&lt;p&gt;With MCP exposing tools like &lt;em&gt;Run SQL Query&lt;/em&gt; and &lt;em&gt;Run Semantic Search&lt;/em&gt;, agents can:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Discover governed VDSs in &lt;strong&gt;plain business language&lt;/strong&gt;.&lt;/li&gt;
&lt;li&gt;Combine datasets to answer multi-dimensional questions.&lt;/li&gt;
&lt;li&gt;Extend existing models with new calculations or filters.&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;For example, an agent could take a “Customer Revenue” VDS and extend it with a churn prediction metric, producing a new analytic model for marketing, all governed by Dremio’s semantic layer.&lt;/p&gt;
&lt;h2&gt;Composable Analytics Meets Composable Modeling&lt;/h2&gt;
&lt;p&gt;The AtScale community describes &lt;em&gt;composable analytics&lt;/em&gt; as the ability to assemble insights from modular building blocks. Dremio’s semantic layer aligns perfectly with this vision:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Reusability&lt;/strong&gt;: Metrics and datasets defined once can be reused everywhere.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Cross-functional consistency&lt;/strong&gt;: Finance, marketing, and operations share the same definitions.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Agent empowerment&lt;/strong&gt;: AI systems don’t just query data — they can compose new insights dynamically.&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;This brings composability from the human analyst’s world into the AI agent’s world.&lt;/p&gt;
&lt;h2&gt;Real-World Benefits&lt;/h2&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Faster iteration&lt;/strong&gt;: Agents adapt models to new questions without waiting for IT.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Democratized insights&lt;/strong&gt;: Business teams get answers in language they understand, grounded in governed metrics.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Cross-functional alignment&lt;/strong&gt;: Everyone — human or agent — works from the same semantic foundation.&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;The result is analytics that are not only AI-ready but also &lt;strong&gt;flexible, governed, and consistent across the enterprise&lt;/strong&gt;.&lt;/p&gt;
&lt;h2&gt;Conclusion&lt;/h2&gt;
&lt;p&gt;Composable analytics is the future of data-driven decision-making. By leveraging &lt;strong&gt;virtual datasets&lt;/strong&gt; and the &lt;strong&gt;semantic layer&lt;/strong&gt;, Dremio makes it possible for both humans and AI agents to build and extend insights in real time.&lt;/p&gt;
&lt;p&gt;With MCP providing the bridge and the semantic layer ensuring governance, enterprises can embrace a world where &lt;strong&gt;analytics are adaptive, modular, and truly agentic&lt;/strong&gt;.&lt;/p&gt;
</content:encoded><author>Alex Merced</author></item><item><title>The Endgame — Building an Autonomous Optimization Pipeline for Apache Iceberg</title><link>https://iceberglakehouse.com/posts/iceberg-autonomous-optimization-pipeline/</link><guid isPermaLink="true">https://iceberglakehouse.com/posts/iceberg-autonomous-optimization-pipeline/</guid><description>
- **[Free Apache Iceberg Course](https://hello.dremio.com/webcast-an-apache-iceberg-lakehouse-crash-course-reg.html?utm_source=ev_external_blog&amp;utm_m...</description><pubDate>Tue, 16 Sep 2025 09:00:00 GMT</pubDate><content:encoded>&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;&lt;a href=&quot;https://hello.dremio.com/webcast-an-apache-iceberg-lakehouse-crash-course-reg.html?utm_source=ev_external_blog&amp;amp;utm_medium=influencer&amp;amp;utm_campaign=optimization_blogs&amp;amp;utm_content=alexmerced&amp;amp;utm_term=external_blog&quot;&gt;Free Apache Iceberg Course&lt;/a&gt;&lt;/strong&gt;&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;&lt;a href=&quot;https://hello.dremio.com/wp-apache-iceberg-the-definitive-guide-reg.html?utm_source=ev_external_blog&amp;amp;utm_medium=influencer&amp;amp;utm_campaign=optimization_blogs&amp;amp;utm_content=alexmerced&amp;amp;utm_term=external_blog&quot;&gt;Free Copy of “Apache Iceberg: The Definitive Guide”&lt;/a&gt;&lt;/strong&gt;&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;&lt;a href=&quot;https://hello.dremio.com/wp-apache-polaris-guide-reg.html?utm_source=ev_external_blog&amp;amp;utm_medium=influencer&amp;amp;utm_campaign=optimization_blogs&amp;amp;utm_content=alexmerced&amp;amp;utm_term=external_blog&quot;&gt;Free Copy of “Apache Polaris: The Definitive Guide”&lt;/a&gt;&lt;/strong&gt;&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;&lt;a href=&quot;https://medium.com/data-engineering-with-dremio/2025-guide-to-architecting-an-iceberg-lakehouse-9b19ed42c9de&quot;&gt;2025 Apache Iceberg Architecture Guide&lt;/a&gt;&lt;/strong&gt;&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;&lt;a href=&quot;https://youtube.com/playlist?list=PLsLAVBjQJO0p0Yq1fLkoHvt2lEJj5pcYe&amp;amp;si=WTSnqjXZv6Glkc3y&quot;&gt;Iceberg Lakehouse Engineering Video Playlist&lt;/a&gt;&lt;/strong&gt;&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;&lt;a href=&quot;https://medium.com/data-engineering-with-dremio/ultimate-directory-of-apache-iceberg-resources-e3e02efac62e&quot;&gt;Ultimate Apache Iceberg Resource Guide&lt;/a&gt;&lt;/strong&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;h1&gt;The Endgame — Building an Autonomous Optimization Pipeline for Apache Iceberg&lt;/h1&gt;
&lt;p&gt;Over the past nine posts, we’ve walked through the strategies, techniques, and tools you can use to keep your Apache Iceberg tables optimized for performance, cost, and reliability. Now, it’s time to put it all together.&lt;/p&gt;
&lt;p&gt;In this final post of the series, we’ll explore how to build an &lt;strong&gt;autonomous optimization pipeline&lt;/strong&gt;—a system that intelligently monitors your Iceberg tables and triggers the right actions automatically, without manual intervention.&lt;/p&gt;
&lt;h2&gt;What Does Autonomous Optimization Look Like?&lt;/h2&gt;
&lt;p&gt;An autonomous pipeline for Iceberg optimization should:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Continuously monitor table metadata&lt;/li&gt;
&lt;li&gt;Detect symptoms of degradation (e.g., small files, bloated manifests)&lt;/li&gt;
&lt;li&gt;Dynamically trigger the right optimization actions&lt;/li&gt;
&lt;li&gt;Recover gracefully from failure&lt;/li&gt;
&lt;li&gt;Integrate seamlessly with ingestion and query operations&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;This makes your lakehouse &lt;strong&gt;self-healing&lt;/strong&gt;, scalable, and easier to maintain—especially across many datasets.&lt;/p&gt;
&lt;h2&gt;Core Components of the Pipeline&lt;/h2&gt;
&lt;h3&gt;1. &lt;strong&gt;Metadata Intelligence Layer&lt;/strong&gt;&lt;/h3&gt;
&lt;p&gt;Leverage Iceberg’s built-in metadata tables to:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Analyze file sizes and counts&lt;/li&gt;
&lt;li&gt;Track snapshot growth&lt;/li&gt;
&lt;li&gt;Monitor partition health&lt;/li&gt;
&lt;li&gt;Flag layout drift (e.g., outdated sort orders or clustering)&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;Example diagnostic query:&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-sql&quot;&gt;SELECT partition, COUNT(*) AS file_count, AVG(file_size_in_bytes) AS avg_file_size
FROM my_table.files
GROUP BY partition
HAVING COUNT(*) &amp;gt; 20 AND AVG(file_size_in_bytes) &amp;lt; 128000000;
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;This layer becomes the decision-maker for whether compaction or cleanup is needed.&lt;/p&gt;
&lt;h3&gt;2. Orchestration Layer&lt;/h3&gt;
&lt;p&gt;Use a scheduling tool like Airflow, Dagster, or dbt Cloud to:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;
&lt;p&gt;Run diagnostic checks on a schedule&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;Execute Spark/Flink optimization jobs conditionally&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;Log and track outcomes&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;Handle retries and alerting&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;A sample DAG might include:&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;check_small_files task&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;trigger_compaction task&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;expire_snapshots task&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;rewrite_manifests task&lt;/p&gt;
&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;Each can be run only if certain thresholds are met.&lt;/p&gt;
&lt;h3&gt;3. Execution Layer&lt;/h3&gt;
&lt;p&gt;Trigger physical optimizations using:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;
&lt;p&gt;Spark actions (RewriteDataFiles, ExpireSnapshots, RewriteManifests)&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;Flink background jobs (especially for streaming pipelines)&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;Dremio OPTIMIZE and VACUUM&lt;/p&gt;
&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;All actions should be:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;
&lt;p&gt;Scoped to affected partitions&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;Tuned for parallelism&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;Capable of partial progress&lt;/p&gt;
&lt;/li&gt;
&lt;/ul&gt;
&lt;h3&gt;4. Observability and Logging&lt;/h3&gt;
&lt;p&gt;Feed metrics into dashboards and alerts using tools like:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;
&lt;p&gt;Prometheus + Grafana&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;Datadog&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;CloudWatch&lt;/p&gt;
&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;Track:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;
&lt;p&gt;Number of files compacted&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;Snapshots expired&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;Runtime per job&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;Failed vs succeeded partitions&lt;/p&gt;
&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;This allows you to adjust thresholds and tuning parameters over time.&lt;/p&gt;
&lt;h3&gt;5. Storage Cleanup (GC)&lt;/h3&gt;
&lt;ul&gt;
&lt;li&gt;
&lt;p&gt;After snapshots are expired, unreferenced files need to be deleted.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;Ensure cleanup happens after expiration jobs, not in parallel.&lt;/p&gt;
&lt;/li&gt;
&lt;/ul&gt;
&lt;h2&gt;Benefits of an Autonomous Pipeline&lt;/h2&gt;
&lt;p&gt;&lt;strong&gt;Consistent Performance:&lt;/strong&gt; Tables stay fast without manual tuning&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Operational Efficiency:&lt;/strong&gt; No more ad hoc optimization jobs&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Scalability:&lt;/strong&gt; Works across 10 tables or 10,000 tables&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Governance-Ready:&lt;/strong&gt; All changes are tracked, repeatable, and policy-driven&lt;/p&gt;
&lt;h2&gt;Final Thoughts&lt;/h2&gt;
&lt;p&gt;Iceberg&apos;s flexibility and rich metadata layer make it uniquely suited to autonomous data management. By combining:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;
&lt;p&gt;Real-time metadata insight&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;Targeted optimization strategies&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;Smart orchestration&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;Catalog-aware execution&lt;/p&gt;
&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;You can build a lakehouse that optimizes itself—freeing your data team to focus on innovation, not maintenance.&lt;/p&gt;
&lt;h2&gt;Where to Go from Here&lt;/h2&gt;
&lt;p&gt;If you’ve followed this series from the beginning, you now have:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;
&lt;p&gt;A deep understanding of how Iceberg tables degrade&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;Tools to address compaction, clustering, and metadata bloat&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;The blueprint for a modern, self-tuning optimization pipeline&lt;/p&gt;
&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;Thanks for reading—and keep building faster, cleaner, and smarter Iceberg lakehouses.&lt;/p&gt;
</content:encoded><author>Alex Merced</author></item><item><title>Managing Large-Scale Optimizations — Parallelism, Checkpointing, and Fail Recovery</title><link>https://iceberglakehouse.com/posts/iceberg-large-scale-optimization/</link><guid isPermaLink="true">https://iceberglakehouse.com/posts/iceberg-large-scale-optimization/</guid><description>
- **[Free Apache Iceberg Course](https://hello.dremio.com/webcast-an-apache-iceberg-lakehouse-crash-course-reg.html?utm_source=ev_external_blog&amp;utm_m...</description><pubDate>Tue, 09 Sep 2025 09:00:00 GMT</pubDate><content:encoded>&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;&lt;a href=&quot;https://hello.dremio.com/webcast-an-apache-iceberg-lakehouse-crash-course-reg.html?utm_source=ev_external_blog&amp;amp;utm_medium=influencer&amp;amp;utm_campaign=optimization_blogs&amp;amp;utm_content=alexmerced&amp;amp;utm_term=external_blog&quot;&gt;Free Apache Iceberg Course&lt;/a&gt;&lt;/strong&gt;&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;&lt;a href=&quot;https://hello.dremio.com/wp-apache-iceberg-the-definitive-guide-reg.html?utm_source=ev_external_blog&amp;amp;utm_medium=influencer&amp;amp;utm_campaign=optimization_blogs&amp;amp;utm_content=alexmerced&amp;amp;utm_term=external_blog&quot;&gt;Free Copy of “Apache Iceberg: The Definitive Guide”&lt;/a&gt;&lt;/strong&gt;&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;&lt;a href=&quot;https://hello.dremio.com/wp-apache-polaris-guide-reg.html?utm_source=ev_external_blog&amp;amp;utm_medium=influencer&amp;amp;utm_campaign=optimization_blogs&amp;amp;utm_content=alexmerced&amp;amp;utm_term=external_blog&quot;&gt;Free Copy of “Apache Polaris: The Definitive Guide”&lt;/a&gt;&lt;/strong&gt;&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;&lt;a href=&quot;https://medium.com/data-engineering-with-dremio/2025-guide-to-architecting-an-iceberg-lakehouse-9b19ed42c9de&quot;&gt;2025 Apache Iceberg Architecture Guide&lt;/a&gt;&lt;/strong&gt;&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;&lt;a href=&quot;https://youtube.com/playlist?list=PLsLAVBjQJO0p0Yq1fLkoHvt2lEJj5pcYe&amp;amp;si=WTSnqjXZv6Glkc3y&quot;&gt;Iceberg Lakehouse Engineering Video Playlist&lt;/a&gt;&lt;/strong&gt;&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;&lt;a href=&quot;https://medium.com/data-engineering-with-dremio/ultimate-directory-of-apache-iceberg-resources-e3e02efac62e&quot;&gt;Ultimate Apache Iceberg Resource Guide&lt;/a&gt;&lt;/strong&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;h1&gt;Managing Large-Scale Optimizations — Parallelism, Checkpointing, and Fail Recovery&lt;/h1&gt;
&lt;p&gt;When working with Apache Iceberg at scale, optimization jobs can become heavy and time-consuming. Rewriting thousands of files, scanning massive partitions, and coordinating metadata updates requires careful execution planning—especially in environments with limited compute or strict SLAs.&lt;/p&gt;
&lt;p&gt;In this post, we’ll look at strategies for making compaction and metadata cleanup operations &lt;strong&gt;scalable, resilient, and efficient&lt;/strong&gt;, including:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Tuning parallelism&lt;/li&gt;
&lt;li&gt;Using partition pruning&lt;/li&gt;
&lt;li&gt;Applying checkpointing for long-running jobs&lt;/li&gt;
&lt;li&gt;Handling failures safely and automatically&lt;/li&gt;
&lt;/ul&gt;
&lt;h2&gt;Why Scaling Optimization Matters&lt;/h2&gt;
&lt;p&gt;As your Iceberg tables grow:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;File counts increase&lt;/li&gt;
&lt;li&gt;Partition cardinality rises&lt;/li&gt;
&lt;li&gt;Manifest files balloon&lt;/li&gt;
&lt;li&gt;Compaction jobs touch terabytes of data&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;Without scaling strategies:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Jobs may fail due to timeouts or memory errors&lt;/li&gt;
&lt;li&gt;Optimization may lag behind ingestion&lt;/li&gt;
&lt;li&gt;Query performance continues to degrade despite efforts&lt;/li&gt;
&lt;/ul&gt;
&lt;h2&gt;1. Leveraging Partition Pruning&lt;/h2&gt;
&lt;p&gt;Partition pruning ensures that only the parts of the table that need compaction are touched.&lt;/p&gt;
&lt;p&gt;Use metadata tables to target only problem areas:&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-sql&quot;&gt;SELECT partition
FROM my_table.files
GROUP BY partition
HAVING COUNT(*) &amp;gt; 20 AND AVG(file_size_in_bytes) &amp;lt; 100000000;
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;You can then pass this list to a compaction job to limit the scope of the rewrite.&lt;/p&gt;
&lt;h2&gt;2. Tuning Parallelism in Spark or Flink&lt;/h2&gt;
&lt;p&gt;Large optimization jobs should run with enough parallel tasks to distribute I/O and computation.&lt;/p&gt;
&lt;p&gt;In Spark:
Use &lt;code&gt;spark.sql.shuffle.partitions&lt;/code&gt; to increase default parallelism.&lt;/p&gt;
&lt;p&gt;Tune executor memory and cores to handle larger partitions.&lt;/p&gt;
&lt;p&gt;Use &lt;code&gt;.option(&amp;quot;partial-progress.enabled&amp;quot;, true)&lt;/code&gt; for better resilience in Iceberg actions.&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-scala&quot;&gt;spark.conf.set(&amp;quot;spark.sql.shuffle.partitions&amp;quot;, &amp;quot;200&amp;quot;)

Actions.forTable(spark, table)
  .rewriteDataFiles()
  .option(&amp;quot;min-input-files&amp;quot;, &amp;quot;5&amp;quot;)
  .option(&amp;quot;partial-progress.enabled&amp;quot;, &amp;quot;true&amp;quot;)
  .execute()
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;In Flink:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;
&lt;p&gt;Use fine-grained task managers&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;Enable incremental compaction and checkpointing&lt;/p&gt;
&lt;/li&gt;
&lt;/ul&gt;
&lt;h2&gt;3. Incremental and Windowed Compaction&lt;/h2&gt;
&lt;p&gt;Don’t try to compact the entire table at once. Instead:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;
&lt;p&gt;Group partitions into batches&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;Use rolling windows (e.g., compact N partitions per hour)&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;Resume from the last successfully compacted partition on failure&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;You can build this logic into orchestration tools like Airflow or Dagster.&lt;/p&gt;
&lt;/li&gt;
&lt;/ul&gt;
&lt;h2&gt;4. Checkpointing and Partial Progress&lt;/h2&gt;
&lt;p&gt;Iceberg supports partial progress mode in Spark:&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-scala&quot;&gt;.option(&amp;quot;partial-progress.enabled&amp;quot;, &amp;quot;true&amp;quot;)
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;This allows successfully compacted partitions to commit, even if others fail—making retries cheaper and safer.&lt;/p&gt;
&lt;p&gt;In Flink, this is handled more granularly via stateful streaming checkpointing.&lt;/p&gt;
&lt;h2&gt;5. Retry and Failover Strategies&lt;/h2&gt;
&lt;p&gt;Wrap compaction logic in robust retry mechanisms:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;
&lt;p&gt;Use exponential backoff&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;Separate retries by partition&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;Alert on repeated failures for human intervention&lt;/p&gt;
&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;For example, in Airflow:&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-python&quot;&gt;PythonOperator(
    task_id=&amp;quot;compact_partition&amp;quot;,
    python_callable=run_compaction,
    retries=3,
    retry_delay=timedelta(minutes=5)
)
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Also consider:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;
&lt;p&gt;Writing logs to object storage for audit&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;Emitting metrics to Prometheus/Grafana for observability&lt;/p&gt;
&lt;/li&gt;
&lt;/ul&gt;
&lt;h2&gt;6. Monitoring Job Health&lt;/h2&gt;
&lt;p&gt;Track:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;
&lt;p&gt;Job duration&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;Files rewritten vs skipped&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;Failed partitions&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;Number of manifests reduced&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;Snapshot size pre- and post-job&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;These metrics help tune parameters and detect regressions over time.&lt;/p&gt;
&lt;/li&gt;
&lt;/ul&gt;
&lt;h2&gt;Summary&lt;/h2&gt;
&lt;p&gt;Scaling Iceberg optimization jobs requires thoughtful execution planning:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;
&lt;p&gt;Use metadata to limit scope&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;Tune parallelism to avoid resource waste&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;Use partial progress and checkpointing to survive failure&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;Automate retries and monitor outcomes&lt;/p&gt;
&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;In the final post of this series, we’ll bring it all together—showing how to build a fully autonomous optimization pipeline using orchestration, metadata triggers, and smart defaults.&lt;/p&gt;
</content:encoded><author>Alex Merced</author></item><item><title>Unlocking the Power of Agentic AI with Apache Iceberg and Dremio</title><link>https://iceberglakehouse.com/posts/2025-09-agentic-ai-dremio-apache-iceberg/</link><guid isPermaLink="true">https://iceberglakehouse.com/posts/2025-09-agentic-ai-dremio-apache-iceberg/</guid><description>
## Free Resources  
- **[Free Apache Iceberg Course](https://hello.dremio.com/webcast-an-apache-iceberg-lakehouse-crash-course-reg.html?utm_source=ev...</description><pubDate>Fri, 05 Sep 2025 09:00:00 GMT</pubDate><content:encoded>&lt;h2&gt;Free Resources&lt;/h2&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;&lt;a href=&quot;https://hello.dremio.com/webcast-an-apache-iceberg-lakehouse-crash-course-reg.html?utm_source=ev_external_blog&amp;amp;utm_medium=influencer&amp;amp;utm_campaign=lakehouse-meetups&amp;amp;utm_content=alexmerced&amp;amp;utm_term=external_blog&quot;&gt;Free Apache Iceberg Course&lt;/a&gt;&lt;/strong&gt;&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;&lt;a href=&quot;https://hello.dremio.com/wp-apache-iceberg-the-definitive-guide-reg.html?utm_source=ev_external_blog&amp;amp;utm_medium=influencer&amp;amp;utm_campaign=lakehouse-meetups&amp;amp;utm_content=alexmerced&amp;amp;utm_term=external_blog&quot;&gt;Free Copy of “Apache Iceberg: The Definitive Guide”&lt;/a&gt;&lt;/strong&gt;&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;&lt;a href=&quot;https://hello.dremio.com/wp-apache-polaris-guide-reg.html?utm_source=ev_external_blog&amp;amp;utm_medium=influencer&amp;amp;utm_campaign=lakehouse-meetups&amp;amp;utm_content=alexmerced&amp;amp;utm_term=external_blog&quot;&gt;Free Copy of “Apache Polaris: The Definitive Guide”&lt;/a&gt;&lt;/strong&gt;&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;&lt;a href=&quot;https://hubs.la/Q03GfY4f0?utm_source=merced&amp;amp;utm_medium=affiliate&amp;amp;utm_campaign=book_merced&amp;amp;a_aid=merced&amp;amp;a_bid=7eac4151&quot;&gt;Purchase &amp;quot;Architecting an Apache Iceberg Lakehouse&amp;quot; (50% Off with Code MLMerced)&lt;/a&gt;&lt;/strong&gt;&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;&lt;a href=&quot;https://medium.com/data-engineering-with-dremio/2025-guide-to-architecting-an-iceberg-lakehouse-9b19ed42c9de&quot;&gt;2025 Apache Iceberg Architecture Guide&lt;/a&gt;&lt;/strong&gt;&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;&lt;a href=&quot;https://youtube.com/playlist?list=PLsLAVBjQJO0p0Yq1fLkoHvt2lEJj5pcYe&amp;amp;si=WTSnqjXZv6Glkc3y&quot;&gt;Iceberg Lakehouse Engineering Video Playlist&lt;/a&gt;&lt;/strong&gt;&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;&lt;a href=&quot;https://medium.com/data-engineering-with-dremio/ultimate-directory-of-apache-iceberg-resources-e3e02efac62e&quot;&gt;Ultimate Apache Iceberg Resource Guide&lt;/a&gt;&lt;/strong&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;Agentic AI is quickly moving from the whiteboard to production. These aren’t just smarter chatbots—they&apos;re intelligent systems that reason, learn, and act with autonomy. They summarize research, manage operations, and even coordinate complex workflows. But while models have become more capable, they still hit a wall without the right data infrastructure.&lt;/p&gt;
&lt;p&gt;That wall? It&apos;s not just about storage—it&apos;s about access, performance, and context.&lt;/p&gt;
&lt;p&gt;Many organizations building AI agents find themselves struggling with data silos, unpredictable performance, and a lack of clarity around what the data actually means. The result? Agents that stall, generate shallow results, or make the wrong decisions altogether.&lt;/p&gt;
&lt;p&gt;To unlock the full potential of Agentic AI, we need to rethink how our data platforms are designed. This is where Apache Iceberg and Dremio come in. Together, they provide a modern, open lakehouse architecture that solves the three core bottlenecks to AI success:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Frictionless access to enterprise data (without data wrangling or replication)&lt;/li&gt;
&lt;li&gt;Autonomous, high-performance query acceleration (built for dynamic workloads)&lt;/li&gt;
&lt;li&gt;A semantic layer that gives agents the context they need to understand and act&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;In this post, we’ll break down each of these challenges—and show how Iceberg and Dremio together build the intelligent data backbone your AI agents need to thrive.&lt;/p&gt;
&lt;h2&gt;The 3 Bottlenecks Blocking Agentic AI from Delivering Real Impact&lt;/h2&gt;
&lt;p&gt;As promising as Agentic AI is, most organizations hit the same three roadblocks on the path to real-world success. These aren&apos;t just technical hurdles—they&apos;re architectural challenges that undermine the speed, accuracy, and reliability of intelligent agents.&lt;/p&gt;
&lt;p&gt;Let’s break them down:&lt;/p&gt;
&lt;h3&gt;1. Access to Data: Silos, Bottlenecks, and Delays&lt;/h3&gt;
&lt;p&gt;AI agents need a holistic view of your enterprise to operate effectively—marketing data, operational logs, customer records, product telemetry, and more. But in most environments, that data is scattered across:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Cloud storage systems&lt;/li&gt;
&lt;li&gt;Operational databases&lt;/li&gt;
&lt;li&gt;SaaS platforms&lt;/li&gt;
&lt;li&gt;Departmental data warehouses&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;Each of these systems may have different governance rules, inconsistent formats, or delayed ETL pipelines. Worse, getting access often requires waiting on central data teams or replicating data manually. This slows down experimentation and limits what your agents can “see.”&lt;/p&gt;
&lt;h3&gt;2. Performant Access: When Every Millisecond Counts&lt;/h3&gt;
&lt;p&gt;Even when agents can access data, they still need it fast. AI workflows—especially agentic ones—are dynamic and unpredictable. One minute it’s a lookup query; the next it’s a multi-join aggregation across several sources. Traditional performance tuning—manual partitioning, index maintenance, and query tuning—can’t keep up.&lt;/p&gt;
&lt;p&gt;Agents can’t wait minutes for answers. They need sub-second response times to chain actions together effectively. Without autonomous performance management, latency becomes a dealbreaker.&lt;/p&gt;
&lt;h3&gt;3. Semantic Meaning: Knowing What the Data &lt;em&gt;Actually&lt;/em&gt; Means&lt;/h3&gt;
&lt;p&gt;Access and speed are critical—but so is &lt;strong&gt;understanding&lt;/strong&gt;. AI agents need context to interpret data correctly. What does &lt;code&gt;customer_type = 2&lt;/code&gt; actually mean? Is “margin” defined the same way in marketing and finance? Without a shared semantic layer, agents operate on guesswork.&lt;/p&gt;
&lt;p&gt;This is where many AI initiatives fail quietly. Outputs look correct on the surface but are misaligned with how the business actually thinks about its data.&lt;/p&gt;
&lt;p&gt;Solving these challenges requires more than patchwork fixes. It demands a new kind of data architecture—one that is open, intelligent, and built for automation. And that’s where Apache Iceberg and Dremio make all the difference.&lt;/p&gt;
&lt;h2&gt;Apache Iceberg: The Open Foundation for AI-Ready Data&lt;/h2&gt;
&lt;p&gt;When it comes to building a scalable, AI-optimized data platform, Apache Iceberg is the backbone that holds it all together. It’s not just another table format—it’s the evolution of how data is organized, versioned, and accessed in modern analytics and AI environments.&lt;/p&gt;
&lt;p&gt;Think of Iceberg like the index in a giant filing cabinet. It doesn’t just store your data—it brings order, consistency, and flexibility to your data lake, making it feel like a fully featured data warehouse without giving up the openness of object storage.&lt;/p&gt;
&lt;h3&gt;Why Apache Iceberg Matters for Agentic AI&lt;/h3&gt;
&lt;p&gt;Agentic AI requires access to data that is:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Consistent&lt;/strong&gt;: So the same query always returns the same answer.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Evolvable&lt;/strong&gt;: So schema changes don’t break downstream pipelines.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Portable&lt;/strong&gt;: So any tool—Spark, Flink, Dremio, or even your AI agents—can access it without vendor lock-in.&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;Apache Iceberg delivers all of this with features like:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Schema evolution&lt;/strong&gt;: Add, drop, rename columns without rewriting data.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Time travel&lt;/strong&gt;: Query data “as of” any point in time, ideal for audits or AI state comparisons.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Hidden partitioning&lt;/strong&gt;: Optimize performance without complicating your SQL.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;ACID transactions&lt;/strong&gt;: Ensure atomic, consistent updates in multi-writer environments.&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;By standardizing on Iceberg, your organization can avoid the “tool wars” between departments. Everyone works from the same data foundation, using the tools they prefer—whether it’s SQL notebooks, BI dashboards, or LLM-powered agents.&lt;/p&gt;
&lt;h3&gt;The Lakehouse Advantage&lt;/h3&gt;
&lt;p&gt;Iceberg unlocks the full potential of the &lt;strong&gt;lakehouse&lt;/strong&gt; model: combining the flexibility of a data lake with the performance and structure of a data warehouse. This modular approach means:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Teams aren’t forced to centralize around one compute engine.&lt;/li&gt;
&lt;li&gt;You avoid redundant data copies and ETL pipelines.&lt;/li&gt;
&lt;li&gt;AI agents can query directly from the lakehouse with open standards.&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;In short, Apache Iceberg makes your data open, unified, and production-grade—everything your intelligent agents need to act with confidence.&lt;/p&gt;
&lt;h2&gt;Dremio: The Intelligent Data Interface for Agentic AI&lt;/h2&gt;
&lt;p&gt;Apache Iceberg gives you the open foundation—but Dremio turns that foundation into an intelligent, AI-ready platform. Think of Dremio as the &lt;strong&gt;control plane&lt;/strong&gt; that gives both humans and AI agents seamless access to the data they need, with speed, security, and semantic understanding built in.&lt;/p&gt;
&lt;p&gt;Let’s explore how Dremio removes the remaining friction across access, performance, and context.&lt;/p&gt;
&lt;h3&gt;Unified Access Across All Data (Federation + Simplified Governance)&lt;/h3&gt;
&lt;p&gt;Even in the best-case scenario, not all your data will live in Iceberg tables. You still have data in relational databases, SaaS tools, cloud data warehouses, and more.&lt;/p&gt;
&lt;p&gt;This is where Dremio’s &lt;strong&gt;Zero-ETL Federation&lt;/strong&gt; shines. Dremio connects directly to all your sources—whether it’s Amazon S3, PostgreSQL, Salesforce, or MongoDB—and lets you query them &lt;strong&gt;in place&lt;/strong&gt;, without copying data or building fragile pipelines.&lt;/p&gt;
&lt;p&gt;Benefits for Agentic AI:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Agents can query the full landscape of enterprise data through a &lt;strong&gt;single interface&lt;/strong&gt;.&lt;/li&gt;
&lt;li&gt;Centralized access control means fewer credentials to manage or expose.&lt;/li&gt;
&lt;li&gt;Real-time insights from operational systems without waiting on ingestion jobs.&lt;/li&gt;
&lt;/ul&gt;
&lt;h3&gt;Autonomous Performance for Unpredictable Workloads&lt;/h3&gt;
&lt;p&gt;Agentic AI is dynamic by nature—queries change based on real-time decisions. You can&apos;t rely on hand-tuned optimizations or static dashboards.&lt;/p&gt;
&lt;p&gt;Dremio solves this with &lt;strong&gt;autonomous performance management&lt;/strong&gt;, including:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Automatic Iceberg Table Optimization&lt;/strong&gt;: Dremio continuously compacts small files, sorts data, and maintains metadata health to reduce I/O and boost query speed.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Reflections&lt;/strong&gt;: Dremio’s version of intelligent materialized views. They’re automatically created, updated incrementally, and substituted at query time—so your agents get faster results without changing their SQL.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Multi-layered Caching&lt;/strong&gt;: From query plans to result sets to object storage blocks, Dremio caches intelligently to accelerate repeat workloads and reduce cloud costs.&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;This means whether your AI agent is summarizing a dashboard or crunching through user logs, it gets fast, consistent results—without human intervention.&lt;/p&gt;
&lt;h3&gt;Built-in Semantic Layer for Shared Understanding&lt;/h3&gt;
&lt;p&gt;To generate meaningful insights, agents need to understand not just what data &lt;em&gt;is&lt;/em&gt;, but what it &lt;em&gt;means&lt;/em&gt;. Dremio provides a native &lt;strong&gt;semantic layer&lt;/strong&gt; that bridges that gap.&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Semantic Search&lt;/strong&gt;: Agents and users can discover datasets using natural language.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Data Modeling&lt;/strong&gt;: Define reusable business logic, KPIs, and metrics as views.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Auto-Generated Wikis&lt;/strong&gt;: Every dataset can include human-readable descriptions—great for onboarding both analysts and AI systems.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Fine-Grained Access Control&lt;/strong&gt;: Row- and column-level security ensures agents see only what they’re authorized to.&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;And with Dremio’s &lt;strong&gt;MCP server&lt;/strong&gt;, your agents can programmatically explore metadata, access semantic context, and generate more accurate queries.&lt;/p&gt;
&lt;p&gt;Dremio doesn’t just connect to your data—it understands it, optimizes it, and makes it consumable by anyone (or anything) that needs it. For Agentic AI, this is the difference between guesswork and precision.&lt;/p&gt;
&lt;h2&gt;Closing the Loop: Iceberg + Dremio = AI-Optimized Lakehouse&lt;/h2&gt;
&lt;p&gt;When you bring Apache Iceberg and Dremio together, you don’t just get a modern data stack—you get a foundation built for the realities of Agentic AI.&lt;/p&gt;
&lt;p&gt;Let’s recap how these technologies align to eliminate the core bottlenecks we explored earlier:&lt;/p&gt;
&lt;h3&gt;✅ Unlocking Access&lt;/h3&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Apache Iceberg&lt;/strong&gt; standardizes how data is stored, making it accessible across tools and teams.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Dremio&lt;/strong&gt; federates access to all your data sources—cloud, on-prem, SaaS, and more—without the overhead of ETL or manual integration.&lt;/li&gt;
&lt;li&gt;AI agents can now query the full enterprise landscape through a single interface, using a single set of credentials, securely and efficiently.&lt;/li&gt;
&lt;/ul&gt;
&lt;h3&gt;✅ Delivering Performance Autonomously&lt;/h3&gt;
&lt;ul&gt;
&lt;li&gt;Iceberg enables high-performance table management (partitioning, file pruning, metadata tracking).&lt;/li&gt;
&lt;li&gt;Dremio automates this further—handling compaction, caching, and query acceleration behind the scenes.&lt;/li&gt;
&lt;li&gt;Reflections, smart caching, and autonomous query optimization ensure agents get sub-second responses, no matter how complex or spontaneous the query.&lt;/li&gt;
&lt;/ul&gt;
&lt;h3&gt;✅ Embedding Context Through Semantics&lt;/h3&gt;
&lt;ul&gt;
&lt;li&gt;Iceberg brings structure to your data lake, but Dremio gives it &lt;strong&gt;meaning&lt;/strong&gt;.&lt;/li&gt;
&lt;li&gt;Through Dremio’s built-in semantic layer and MCP server, your AI agents can interpret, navigate, and reason about data the way your business does.&lt;/li&gt;
&lt;li&gt;Whether it’s knowing what “active customer” means or filtering by business unit, Dremio gives your agents the vocabulary to deliver trusted outcomes.&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;The result is a truly intelligent lakehouse—open, unified, performant, and semantically rich. One that doesn’t just serve humans, but empowers agents to act, adapt, and deliver real business value.&lt;/p&gt;
&lt;p&gt;If Agentic AI is your destination, Apache Iceberg and Dremio are the road and the vehicle that will take you there.&lt;/p&gt;
&lt;p&gt;&lt;a href=&quot;https://www.dremio.com/?utm_source=ev_external_blog&amp;amp;utm_medium=influencer&amp;amp;utm_campaign=agentic-ai&amp;amp;utm_content=alexmerced&amp;amp;utm_term=external_blog&quot;&gt;Get hands-on with Dremio and Apache Iceberg today&lt;/a&gt; and start building the intelligent data foundation your AI agents need to thrive.&lt;/p&gt;
</content:encoded><author>Alex Merced</author></item><item><title>Hidden Pitfalls — Compaction and Partition Evolution in Apache Iceberg</title><link>https://iceberglakehouse.com/posts/iceberg-partition-evolution-compaction/</link><guid isPermaLink="true">https://iceberglakehouse.com/posts/iceberg-partition-evolution-compaction/</guid><description>
- **[Free Apache Iceberg Course](https://hello.dremio.com/webcast-an-apache-iceberg-lakehouse-crash-course-reg.html?utm_source=ev_external_blog&amp;utm_m...</description><pubDate>Tue, 02 Sep 2025 09:00:00 GMT</pubDate><content:encoded>&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;&lt;a href=&quot;https://hello.dremio.com/webcast-an-apache-iceberg-lakehouse-crash-course-reg.html?utm_source=ev_external_blog&amp;amp;utm_medium=influencer&amp;amp;utm_campaign=optimization_blogs&amp;amp;utm_content=alexmerced&amp;amp;utm_term=external_blog&quot;&gt;Free Apache Iceberg Course&lt;/a&gt;&lt;/strong&gt;&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;&lt;a href=&quot;https://hello.dremio.com/wp-apache-iceberg-the-definitive-guide-reg.html?utm_source=ev_external_blog&amp;amp;utm_medium=influencer&amp;amp;utm_campaign=optimization_blogs&amp;amp;utm_content=alexmerced&amp;amp;utm_term=external_blog&quot;&gt;Free Copy of “Apache Iceberg: The Definitive Guide”&lt;/a&gt;&lt;/strong&gt;&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;&lt;a href=&quot;https://hello.dremio.com/wp-apache-polaris-guide-reg.html?utm_source=ev_external_blog&amp;amp;utm_medium=influencer&amp;amp;utm_campaign=optimization_blogs&amp;amp;utm_content=alexmerced&amp;amp;utm_term=external_blog&quot;&gt;Free Copy of “Apache Polaris: The Definitive Guide”&lt;/a&gt;&lt;/strong&gt;&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;&lt;a href=&quot;https://medium.com/data-engineering-with-dremio/2025-guide-to-architecting-an-iceberg-lakehouse-9b19ed42c9de&quot;&gt;2025 Apache Iceberg Architecture Guide&lt;/a&gt;&lt;/strong&gt;&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;&lt;a href=&quot;https://youtube.com/playlist?list=PLsLAVBjQJO0p0Yq1fLkoHvt2lEJj5pcYe&amp;amp;si=WTSnqjXZv6Glkc3y&quot;&gt;Iceberg Lakehouse Engineering Video Playlist&lt;/a&gt;&lt;/strong&gt;&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;&lt;a href=&quot;https://medium.com/data-engineering-with-dremio/ultimate-directory-of-apache-iceberg-resources-e3e02efac62e&quot;&gt;Ultimate Apache Iceberg Resource Guide&lt;/a&gt;&lt;/strong&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;h1&gt;Hidden Pitfalls — Compaction and Partition Evolution in Apache Iceberg&lt;/h1&gt;
&lt;p&gt;Apache Iceberg offers &lt;strong&gt;partition evolution&lt;/strong&gt;, allowing you to change how your data is partitioned over time without rewriting historical files. This is a major advantage over legacy file formats, but it also introduces new challenges—especially when it comes to &lt;strong&gt;compaction and query optimization&lt;/strong&gt;.&lt;/p&gt;
&lt;p&gt;In this post, we’ll explore how partition evolution can impact compaction, metadata management, and query performance—and how to avoid the most common pitfalls.&lt;/p&gt;
&lt;h2&gt;What Is Partition Evolution?&lt;/h2&gt;
&lt;p&gt;Partition evolution allows you to:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Add new partition fields&lt;/li&gt;
&lt;li&gt;Drop old partition fields&lt;/li&gt;
&lt;li&gt;Change partition transforms (e.g., from &lt;code&gt;day(ts)&lt;/code&gt; to &lt;code&gt;hour(ts)&lt;/code&gt;)&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;Unlike traditional systems that enforce a single static layout, Iceberg lets you evolve the partitioning strategy without rewriting or invalidating historical data.&lt;/p&gt;
&lt;h3&gt;Example:&lt;/h3&gt;
&lt;pre&gt;&lt;code class=&quot;language-sql&quot;&gt;-- Original partitioning
ALTER TABLE sales ADD PARTITION FIELD day(order_date);

-- Later evolve to hourly
ALTER TABLE sales DROP PARTITION FIELD day(order_date);
ALTER TABLE sales ADD PARTITION FIELD hour(order_date);
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Each snapshot will respect the partition spec that was active at the time the data was written.&lt;/p&gt;
&lt;h2&gt;The Pitfall: Compaction Across Partition Specs&lt;/h2&gt;
&lt;p&gt;When compaction jobs span files written under different partition specs, several challenges arise:&lt;/p&gt;
&lt;ol&gt;
&lt;li&gt;
&lt;p&gt;File Layout Inconsistency
Compaction may combine files that don’t share a common layout, resulting in mixed partition values that reduce query pruning efficiency.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;Reduced Predicate Pushdown
Query engines rely on partition columns for efficient pruning. If files are mixed across specs, pruning may be incomplete, increasing scan cost.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;Compaction Failures or Misbehavior
Some engines may fail to rewrite or rewrite files improperly when specs conflict, especially in older versions of Iceberg libraries or poorly configured environments.&lt;/p&gt;
&lt;/li&gt;
&lt;/ol&gt;
&lt;h2&gt;Best Practices to Manage Partition Evolution Safely&lt;/h2&gt;
&lt;h3&gt;1. Compact Within Partition Spec Versions&lt;/h3&gt;
&lt;p&gt;Query the files metadata table to identify which files belong to which spec:&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-sql&quot;&gt;Copy
Edit
SELECT spec_id, COUNT(*) AS file_count
FROM my_table.files
GROUP BY spec_id;
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Run compaction per spec_id to preserve consistency and avoid mixing files.&lt;/p&gt;
&lt;h3&gt;2. Track and Align Sorting and Clustering&lt;/h3&gt;
&lt;p&gt;When evolving partitions, ensure that sort orders are also updated. Mismatched sort and partition strategies can undermine clustering efforts.&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-sql&quot;&gt;SELECT spec_id, sort_order_id, COUNT(*) 
FROM my_table.files 
GROUP BY spec_id, sort_order_id;
&lt;/code&gt;&lt;/pre&gt;
&lt;h3&gt;3. Repartition Carefully and Gradually&lt;/h3&gt;
&lt;p&gt;Avoid abrupt changes like:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;
&lt;p&gt;Switching from coarse to fine partitioning (e.g., day to minute)&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;Dropping too many partition fields at once&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;These can lead to over-fragmentation and more small files unless paired with compaction and sort order realignment.&lt;/p&gt;
&lt;/li&gt;
&lt;/ul&gt;
&lt;h3&gt;4. Use Metadata Tables to Guide Evolution&lt;/h3&gt;
&lt;p&gt;Before evolving a partition spec:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;
&lt;p&gt;Inspect query patterns (e.g., WHERE clauses)&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;Evaluate partition sizes and access frequencies&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;Use tools like Dremio’s catalog lineage and query analyzer if available&lt;/p&gt;
&lt;/li&gt;
&lt;/ul&gt;
&lt;h3&gt;5. Communicate Changes Across Teams&lt;/h3&gt;
&lt;p&gt;If your tables are used across multiple teams or tools:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;
&lt;p&gt;Document changes to partitioning logic&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;Include schema and partition spec history in data documentation&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;Coordinate compaction jobs after major partition changes&lt;/p&gt;
&lt;/li&gt;
&lt;/ul&gt;
&lt;h2&gt;Summary&lt;/h2&gt;
&lt;p&gt;Partition evolution is one of Iceberg’s superpowers—but like all powerful features, it must be used wisely. To avoid performance and optimization issues:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;
&lt;p&gt;Don’t mix files with different partition specs in compaction jobs&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;Update sort orders and clustering with partition changes&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;Monitor partition usage and access patterns continuously&lt;/p&gt;
&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;In the next post, we’ll move from structural design to execution tuning—exploring how to scale compaction operations efficiently using parallelism, checkpointing, and fault tolerance.&lt;/p&gt;
</content:encoded><author>Alex Merced</author></item><item><title>Using Iceberg Metadata Tables to Determine When Compaction Is Needed</title><link>https://iceberglakehouse.com/posts/iceberg-metadata-triggered-compaction/</link><guid isPermaLink="true">https://iceberglakehouse.com/posts/iceberg-metadata-triggered-compaction/</guid><description>
- **[Free Apache Iceberg Course](https://hello.dremio.com/webcast-an-apache-iceberg-lakehouse-crash-course-reg.html?utm_source=ev_external_blog&amp;utm_m...</description><pubDate>Tue, 26 Aug 2025 09:00:00 GMT</pubDate><content:encoded>&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;&lt;a href=&quot;https://hello.dremio.com/webcast-an-apache-iceberg-lakehouse-crash-course-reg.html?utm_source=ev_external_blog&amp;amp;utm_medium=influencer&amp;amp;utm_campaign=optimization_blogs&amp;amp;utm_content=alexmerced&amp;amp;utm_term=external_blog&quot;&gt;Free Apache Iceberg Course&lt;/a&gt;&lt;/strong&gt;&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;&lt;a href=&quot;https://hello.dremio.com/wp-apache-iceberg-the-definitive-guide-reg.html?utm_source=ev_external_blog&amp;amp;utm_medium=influencer&amp;amp;utm_campaign=optimization_blogs&amp;amp;utm_content=alexmerced&amp;amp;utm_term=external_blog&quot;&gt;Free Copy of “Apache Iceberg: The Definitive Guide”&lt;/a&gt;&lt;/strong&gt;&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;&lt;a href=&quot;https://hello.dremio.com/wp-apache-polaris-guide-reg.html?utm_source=ev_external_blog&amp;amp;utm_medium=influencer&amp;amp;utm_campaign=optimization_blogs&amp;amp;utm_content=alexmerced&amp;amp;utm_term=external_blog&quot;&gt;Free Copy of “Apache Polaris: The Definitive Guide”&lt;/a&gt;&lt;/strong&gt;&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;&lt;a href=&quot;https://medium.com/data-engineering-with-dremio/2025-guide-to-architecting-an-iceberg-lakehouse-9b19ed42c9de&quot;&gt;2025 Apache Iceberg Architecture Guide&lt;/a&gt;&lt;/strong&gt;&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;&lt;a href=&quot;https://youtube.com/playlist?list=PLsLAVBjQJO0p0Yq1fLkoHvt2lEJj5pcYe&amp;amp;si=WTSnqjXZv6Glkc3y&quot;&gt;Iceberg Lakehouse Engineering Video Playlist&lt;/a&gt;&lt;/strong&gt;&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;&lt;a href=&quot;https://medium.com/data-engineering-with-dremio/ultimate-directory-of-apache-iceberg-resources-e3e02efac62e&quot;&gt;Ultimate Apache Iceberg Resource Guide&lt;/a&gt;&lt;/strong&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;h1&gt;Using Iceberg Metadata Tables to Determine When Compaction Is Needed&lt;/h1&gt;
&lt;p&gt;Scheduling compaction at fixed intervals is better than not optimizing at all—but it can still lead to unnecessary compute spend or delayed maintenance. A smarter approach is to &lt;strong&gt;dynamically trigger compaction&lt;/strong&gt; based on &lt;strong&gt;real-time metadata signals&lt;/strong&gt;.&lt;/p&gt;
&lt;p&gt;Apache Iceberg makes this possible with its powerful system of &lt;strong&gt;metadata tables&lt;/strong&gt;, which expose granular details about files, snapshots, and manifests.&lt;/p&gt;
&lt;p&gt;In this post, we&apos;ll explore how to query these tables to:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Detect small files&lt;/li&gt;
&lt;li&gt;Identify bloated partitions&lt;/li&gt;
&lt;li&gt;Spot manifest inefficiencies&lt;/li&gt;
&lt;li&gt;Automate event-driven compaction workflows&lt;/li&gt;
&lt;/ul&gt;
&lt;h2&gt;What Are Iceberg Metadata Tables?&lt;/h2&gt;
&lt;p&gt;Every Iceberg table automatically maintains a set of virtual tables that expose its internals. The most relevant for optimization include:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;code&gt;files&lt;/code&gt; – List of all data files in the table, including size, partition, and metrics&lt;/li&gt;
&lt;li&gt;&lt;code&gt;manifests&lt;/code&gt; – List of manifest files and the data files they reference&lt;/li&gt;
&lt;li&gt;&lt;code&gt;snapshots&lt;/code&gt; – History of table changes and snapshot metadata&lt;/li&gt;
&lt;li&gt;&lt;code&gt;history&lt;/code&gt; – Timeline of snapshot commits and their lineage&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;These tables can be queried like any other SQL table, making it easy to introspect your table’s health.&lt;/p&gt;
&lt;h2&gt;1. Detecting Small Files with the &lt;code&gt;files&lt;/code&gt; Table&lt;/h2&gt;
&lt;p&gt;To identify partitions suffering from small file syndrome:&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-sql&quot;&gt;SELECT
  partition,
  COUNT(*) AS file_count,
  AVG(file_size_in_bytes) AS avg_size_bytes
FROM my_table.files
GROUP BY partition
HAVING COUNT(*) &amp;gt; 10 AND AVG(file_size_in_bytes) &amp;lt; 134217728; -- 128 MB
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;You can use this to:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;
&lt;p&gt;Trigger compaction on specific partitions&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;Monitor trends in file size distribution over time&lt;/p&gt;
&lt;/li&gt;
&lt;/ul&gt;
&lt;h2&gt;2. Finding Fragmented or Stale Manifests&lt;/h2&gt;
&lt;p&gt;Bloated metadata can come from too many or inefficient manifest files. Use the manifests table to explore:&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-sql&quot;&gt;SELECT
  COUNT(*) AS manifest_count,
  AVG(added_data_files_count) AS avg_files_per_manifest
FROM my_table.manifests;
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Low averages can indicate fragmented manifests that are good candidates for rewriting.&lt;/p&gt;
&lt;h2&gt;3. Tracking Snapshot Volume and Velocity&lt;/h2&gt;
&lt;p&gt;To see if snapshots are accumulating too fast (and increasing metadata overhead):&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-sql&quot;&gt;SELECT
  COUNT(*) AS snapshot_count,
  MIN(committed_at) AS first_snapshot,
  MAX(committed_at) AS latest_snapshot
FROM my_table.snapshots;
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;You can also inspect how many files each snapshot adds or removes to identify noisy patterns from ingestion jobs.&lt;/p&gt;
&lt;h2&gt;4. Building a Health Score&lt;/h2&gt;
&lt;p&gt;By combining file count, file size, manifest count, and snapshot frequency, you can compute a &amp;quot;table health score&amp;quot;:&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-sql&quot;&gt;-- Example: High file count + small average size = poor health
WITH file_stats AS (
  SELECT COUNT(*) AS total_files, AVG(file_size_in_bytes) AS avg_file_size
  FROM my_table.files
),
manifest_stats AS (
  SELECT COUNT(*) AS total_manifests
  FROM my_table.manifests
)
SELECT
  total_files,
  avg_file_size,
  total_manifests,
  CASE
    WHEN avg_file_size &amp;lt; 67108864 AND total_files &amp;gt; 1000 THEN &apos;Needs compaction&apos;
    ELSE &apos;Healthy&apos;
  END AS status
FROM file_stats, manifest_stats;
&lt;/code&gt;&lt;/pre&gt;
&lt;h2&gt;5. Triggering Compaction Automatically&lt;/h2&gt;
&lt;p&gt;Once you identify problematic patterns, you can wire up your orchestration layer to act:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;
&lt;p&gt;Use Airflow, Dagster, or dbt Cloud to run SQL-based checks&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;When thresholds are breached, trigger Spark/Flink compaction jobs&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;Track results and update monitoring dashboards&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;This ensures you optimize only when needed, keeping costs and latency low.&lt;/p&gt;
&lt;/li&gt;
&lt;/ul&gt;
&lt;h2&gt;Benefits of Metadata-Driven Optimization&lt;/h2&gt;
&lt;ul&gt;
&lt;li&gt;
&lt;p&gt;Precision: Only touch affected partitions&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;Efficiency: Avoid unnecessary compute jobs&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;Responsiveness: React to real-time ingestion patterns&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;Governance: Create audit trails for all compaction decisions&lt;/p&gt;
&lt;/li&gt;
&lt;/ul&gt;
&lt;h2&gt;Summary&lt;/h2&gt;
&lt;p&gt;Apache Iceberg gives you visibility and control over your tables through metadata tables. By tapping into this metadata:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;
&lt;p&gt;You avoid blind scheduling of compaction&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;You build smarter, more efficient optimization workflows&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;You reduce both query latency and operational cost&lt;/p&gt;
&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;In the next post, we’ll dive into partition evolution and layout pitfalls, and how to avoid undermining your compaction and clustering strategies when schemas or partitions change.&lt;/p&gt;
</content:encoded><author>Alex Merced</author></item><item><title>Designing the Ideal Cadence for Compaction and Snapshot Expiration</title><link>https://iceberglakehouse.com/posts/iceberg-optimization-cadence/</link><guid isPermaLink="true">https://iceberglakehouse.com/posts/iceberg-optimization-cadence/</guid><description>
- **[Free Apache Iceberg Course](https://hello.dremio.com/webcast-an-apache-iceberg-lakehouse-crash-course-reg.html?utm_source=ev_external_blog&amp;utm_m...</description><pubDate>Tue, 19 Aug 2025 09:00:00 GMT</pubDate><content:encoded>&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;&lt;a href=&quot;https://hello.dremio.com/webcast-an-apache-iceberg-lakehouse-crash-course-reg.html?utm_source=ev_external_blog&amp;amp;utm_medium=influencer&amp;amp;utm_campaign=optimization_blogs&amp;amp;utm_content=alexmerced&amp;amp;utm_term=external_blog&quot;&gt;Free Apache Iceberg Course&lt;/a&gt;&lt;/strong&gt;&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;&lt;a href=&quot;https://hello.dremio.com/wp-apache-iceberg-the-definitive-guide-reg.html?utm_source=ev_external_blog&amp;amp;utm_medium=influencer&amp;amp;utm_campaign=optimization_blogs&amp;amp;utm_content=alexmerced&amp;amp;utm_term=external_blog&quot;&gt;Free Copy of “Apache Iceberg: The Definitive Guide”&lt;/a&gt;&lt;/strong&gt;&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;&lt;a href=&quot;https://hello.dremio.com/wp-apache-polaris-guide-reg.html?utm_source=ev_external_blog&amp;amp;utm_medium=influencer&amp;amp;utm_campaign=optimization_blogs&amp;amp;utm_content=alexmerced&amp;amp;utm_term=external_blog&quot;&gt;Free Copy of “Apache Polaris: The Definitive Guide”&lt;/a&gt;&lt;/strong&gt;&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;&lt;a href=&quot;https://medium.com/data-engineering-with-dremio/2025-guide-to-architecting-an-iceberg-lakehouse-9b19ed42c9de&quot;&gt;2025 Apache Iceberg Architecture Guide&lt;/a&gt;&lt;/strong&gt;&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;&lt;a href=&quot;https://youtube.com/playlist?list=PLsLAVBjQJO0p0Yq1fLkoHvt2lEJj5pcYe&amp;amp;si=WTSnqjXZv6Glkc3y&quot;&gt;Iceberg Lakehouse Engineering Video Playlist&lt;/a&gt;&lt;/strong&gt;&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;&lt;a href=&quot;https://medium.com/data-engineering-with-dremio/ultimate-directory-of-apache-iceberg-resources-e3e02efac62e&quot;&gt;Ultimate Apache Iceberg Resource Guide&lt;/a&gt;&lt;/strong&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;h1&gt;Designing the Ideal Cadence for Compaction and Snapshot Expiration&lt;/h1&gt;
&lt;p&gt;In previous posts, we explored how compaction and snapshot expiration keep Apache Iceberg tables performant and lean. But these actions aren’t one-and-done—they need to be &lt;strong&gt;scheduled strategically&lt;/strong&gt; to balance compute cost, data freshness, and operational safety.&lt;/p&gt;
&lt;p&gt;In this post, we’ll look at how to design a &lt;strong&gt;cadence&lt;/strong&gt; for compaction and snapshot expiration based on your workload patterns, data criticality, and infrastructure constraints.&lt;/p&gt;
&lt;h2&gt;Why Cadence Matters&lt;/h2&gt;
&lt;p&gt;Without a thoughtful schedule:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Over-optimization&lt;/strong&gt; can waste compute and create unnecessary load&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Under-optimization&lt;/strong&gt; leads to performance degradation and metadata bloat&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Poor coordination&lt;/strong&gt; can cause clashes with ingestion or query jobs&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;You need a cadence that fits your data’s lifecycle and your platform’s SLAs.&lt;/p&gt;
&lt;h2&gt;Key Factors to Consider&lt;/h2&gt;
&lt;h3&gt;1. &lt;strong&gt;Ingestion Rate and Pattern&lt;/strong&gt;&lt;/h3&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Streaming data?&lt;/strong&gt; Expect high file churn. Compact frequently (hourly or near-real-time).&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Batch jobs?&lt;/strong&gt; Compact after each large load or on a daily schedule.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Hybrid?&lt;/strong&gt; Monitor ingestion metrics and trigger compaction based on thresholds.&lt;/li&gt;
&lt;/ul&gt;
&lt;h3&gt;2. &lt;strong&gt;Query Frequency and Latency Expectations&lt;/strong&gt;&lt;/h3&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;High query volume tables&lt;/strong&gt; benefit from more frequent compaction to improve scan performance.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Low-usage tables&lt;/strong&gt; can tolerate more infrequent optimization.&lt;/li&gt;
&lt;/ul&gt;
&lt;h3&gt;3. &lt;strong&gt;Storage Costs and File System Limits&lt;/strong&gt;&lt;/h3&gt;
&lt;ul&gt;
&lt;li&gt;Cloud storage costs can balloon with small files and lingering unreferenced data.&lt;/li&gt;
&lt;li&gt;File system metadata limits may also be a concern at massive scale.&lt;/li&gt;
&lt;/ul&gt;
&lt;h3&gt;4. &lt;strong&gt;Retention and Governance Requirements&lt;/strong&gt;&lt;/h3&gt;
&lt;ul&gt;
&lt;li&gt;Snapshots may need to be retained longer for audit or rollback policies.&lt;/li&gt;
&lt;li&gt;Balance expiration with compliance needs.&lt;/li&gt;
&lt;/ul&gt;
&lt;h2&gt;Suggested Cadence Models&lt;/h2&gt;
&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Use Case&lt;/th&gt;
&lt;th&gt;Compaction Cadence&lt;/th&gt;
&lt;th&gt;Snapshot Expiration&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;High-volume streaming pipeline&lt;/td&gt;
&lt;td&gt;Hourly or event-based&lt;/td&gt;
&lt;td&gt;Daily, keep 1–3 days&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Daily batch ingestion&lt;/td&gt;
&lt;td&gt;Post-batch or nightly&lt;/td&gt;
&lt;td&gt;Weekly, keep 7–14 days&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Low-latency analytics&lt;/td&gt;
&lt;td&gt;Hourly&lt;/td&gt;
&lt;td&gt;Daily, keep 3–5 days&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Regulatory or audited data&lt;/td&gt;
&lt;td&gt;Weekly or on-demand&lt;/td&gt;
&lt;td&gt;Monthly, retain 30–90 days&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;
&lt;p&gt;Use metadata queries (e.g., from &lt;code&gt;files&lt;/code&gt;, &lt;code&gt;manifests&lt;/code&gt;, &lt;code&gt;snapshots&lt;/code&gt;) to drive dynamic policies.&lt;/p&gt;
&lt;h2&gt;Automating the Schedule&lt;/h2&gt;
&lt;p&gt;You can use orchestration tools like:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Airflow / Dagster / Prefect&lt;/strong&gt;: Schedule and monitor compaction and expiration tasks&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;dbt Cloud&lt;/strong&gt;: Use post-run hooks or scheduled jobs to optimize models backed by Iceberg&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Flink / Spark Streaming&lt;/strong&gt;: Trigger compaction inline or via micro-batch jobs&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;Tip: Tag critical jobs with priorities and isolate them from ingestion workloads where needed.&lt;/p&gt;
&lt;h2&gt;Coordinating Between Compaction and Expiration&lt;/h2&gt;
&lt;p&gt;Ideally:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Compact first&lt;/strong&gt;, then &lt;strong&gt;expire snapshots&lt;/strong&gt;&lt;/li&gt;
&lt;li&gt;This ensures snapshots written by compaction are retained at least temporarily&lt;/li&gt;
&lt;li&gt;Avoid expiring snapshots too soon after compaction to prevent data loss&lt;/li&gt;
&lt;/ul&gt;
&lt;h3&gt;Example Workflow:&lt;/h3&gt;
&lt;ol&gt;
&lt;li&gt;Run metadata scan to detect small file bloat&lt;/li&gt;
&lt;li&gt;Trigger compaction on affected partitions&lt;/li&gt;
&lt;li&gt;Delay snapshot expiration by a few hours&lt;/li&gt;
&lt;li&gt;Run snapshot expiration with a safety buffer&lt;/li&gt;
&lt;/ol&gt;
&lt;h2&gt;Monitoring and Adjusting Over Time&lt;/h2&gt;
&lt;p&gt;Cadence isn’t static—adjust based on:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Changing ingestion rates&lt;/li&gt;
&lt;li&gt;New query patterns&lt;/li&gt;
&lt;li&gt;Storage trends&lt;/li&gt;
&lt;li&gt;Platform feedback (slow queries, GC delays, etc.)&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;Use logs, metadata tables, and query performance dashboards to guide adjustments.&lt;/p&gt;
&lt;h2&gt;Summary&lt;/h2&gt;
&lt;p&gt;An effective compaction and snapshot expiration cadence keeps your Iceberg tables fast, lean, and cost-effective. Your schedule should:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Match your workload patterns&lt;/li&gt;
&lt;li&gt;Respect operational and governance needs&lt;/li&gt;
&lt;li&gt;Be flexible and monitorable&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;In the next post, we’ll look at how to use &lt;strong&gt;Iceberg’s metadata tables&lt;/strong&gt; to dynamically determine &lt;em&gt;when&lt;/em&gt; optimization is needed—so you can make it event-driven instead of fixed-schedule.&lt;/p&gt;
</content:encoded><author>Alex Merced</author></item><item><title>Avoiding Metadata Bloat with Snapshot Expiration and Rewriting Manifests</title><link>https://iceberglakehouse.com/posts/iceberg-metadata-bloat-cleanup/</link><guid isPermaLink="true">https://iceberglakehouse.com/posts/iceberg-metadata-bloat-cleanup/</guid><description>
- **[Free Apache Iceberg Course](https://hello.dremio.com/webcast-an-apache-iceberg-lakehouse-crash-course-reg.html?utm_source=ev_external_blog&amp;utm_m...</description><pubDate>Tue, 12 Aug 2025 09:00:00 GMT</pubDate><content:encoded>&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;&lt;a href=&quot;https://hello.dremio.com/webcast-an-apache-iceberg-lakehouse-crash-course-reg.html?utm_source=ev_external_blog&amp;amp;utm_medium=influencer&amp;amp;utm_campaign=optimization_blogs&amp;amp;utm_content=alexmerced&amp;amp;utm_term=external_blog&quot;&gt;Free Apache Iceberg Course&lt;/a&gt;&lt;/strong&gt;&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;&lt;a href=&quot;https://hello.dremio.com/wp-apache-iceberg-the-definitive-guide-reg.html?utm_source=ev_external_blog&amp;amp;utm_medium=influencer&amp;amp;utm_campaign=optimization_blogs&amp;amp;utm_content=alexmerced&amp;amp;utm_term=external_blog&quot;&gt;Free Copy of “Apache Iceberg: The Definitive Guide”&lt;/a&gt;&lt;/strong&gt;&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;&lt;a href=&quot;https://hello.dremio.com/wp-apache-polaris-guide-reg.html?utm_source=ev_external_blog&amp;amp;utm_medium=influencer&amp;amp;utm_campaign=optimization_blogs&amp;amp;utm_content=alexmerced&amp;amp;utm_term=external_blog&quot;&gt;Free Copy of “Apache Polaris: The Definitive Guide”&lt;/a&gt;&lt;/strong&gt;&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;&lt;a href=&quot;https://medium.com/data-engineering-with-dremio/2025-guide-to-architecting-an-iceberg-lakehouse-9b19ed42c9de&quot;&gt;2025 Apache Iceberg Architecture Guide&lt;/a&gt;&lt;/strong&gt;&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;&lt;a href=&quot;https://youtube.com/playlist?list=PLsLAVBjQJO0p0Yq1fLkoHvt2lEJj5pcYe&amp;amp;si=WTSnqjXZv6Glkc3y&quot;&gt;Iceberg Lakehouse Engineering Video Playlist&lt;/a&gt;&lt;/strong&gt;&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;&lt;a href=&quot;https://medium.com/data-engineering-with-dremio/ultimate-directory-of-apache-iceberg-resources-e3e02efac62e&quot;&gt;Ultimate Apache Iceberg Resource Guide&lt;/a&gt;&lt;/strong&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;h1&gt;Avoiding Metadata Bloat with Snapshot Expiration and Rewriting Manifests&lt;/h1&gt;
&lt;p&gt;As your Apache Iceberg tables evolve—through continuous writes, schema changes, and compaction jobs—they generate a growing amount of &lt;strong&gt;metadata&lt;/strong&gt;. While metadata is a powerful feature in Iceberg, enabling time travel and auditability, &lt;strong&gt;unchecked metadata growth&lt;/strong&gt; can lead to:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Slower planning and query times&lt;/li&gt;
&lt;li&gt;Increased storage costs&lt;/li&gt;
&lt;li&gt;Longer table commit and rollback operations&lt;/li&gt;
&lt;li&gt;Excessive memory usage during scans&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;In this post, we’ll explore how to &lt;strong&gt;expire old snapshots&lt;/strong&gt; and &lt;strong&gt;rewrite manifests&lt;/strong&gt; to keep your Iceberg tables lean, responsive, and cost-efficient.&lt;/p&gt;
&lt;h2&gt;What Causes Metadata Bloat?&lt;/h2&gt;
&lt;p&gt;Iceberg tracks table state through a series of &lt;strong&gt;snapshots&lt;/strong&gt;. Each snapshot references a set of &lt;strong&gt;manifest lists&lt;/strong&gt;, which in turn reference &lt;strong&gt;manifest files&lt;/strong&gt; describing individual data files.&lt;/p&gt;
&lt;p&gt;Bloat occurs when:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Snapshots accumulate and are not expired&lt;/li&gt;
&lt;li&gt;Manifests are duplicated across snapshots&lt;/li&gt;
&lt;li&gt;Files are replaced by compaction but older snapshots still reference them&lt;/li&gt;
&lt;li&gt;Streaming ingestion creates frequent small commits, generating excessive metadata&lt;/li&gt;
&lt;/ul&gt;
&lt;h2&gt;Expiring Snapshots&lt;/h2&gt;
&lt;p&gt;You can safely remove older snapshots using Iceberg’s built-in expiration functionality. This deletes metadata for snapshots that are no longer needed for time travel, rollback, or audit purposes.&lt;/p&gt;
&lt;h3&gt;Example in Spark:&lt;/h3&gt;
&lt;pre&gt;&lt;code class=&quot;language-scala&quot;&gt;import org.apache.iceberg.actions.Actions

Actions.forTable(spark, table)
  .expireSnapshots()
  .expireOlderThan(System.currentTimeMillis() - TimeUnit.DAYS.toMillis(7)) // keep 7 days
  .retainLast(2) // keep last 2 snapshots no matter what
  .execute();
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;This keeps recent snapshots while cleaning up older ones, freeing up metadata and unreferenced data files (if garbage collection is also enabled).&lt;/p&gt;
&lt;h3&gt;Guidelines:&lt;/h3&gt;
&lt;ul&gt;
&lt;li&gt;
&lt;p&gt;Retain at least a few recent snapshots for rollback safety&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;Use a time-based and count-based retention policy&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;Coordinate expiration with your data governance policies&lt;/p&gt;
&lt;/li&gt;
&lt;/ul&gt;
&lt;h2&gt;Rewriting Manifests&lt;/h2&gt;
&lt;p&gt;Over time, manifest files can become inefficient:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;
&lt;p&gt;Many may reference the same files across snapshots&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;Some may contain only a few files due to small writes&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;Their layout may be suboptimal for query planning&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;You can rewrite manifests to consolidate and reorganize them for improved performance.&lt;/p&gt;
&lt;/li&gt;
&lt;/ul&gt;
&lt;h3&gt;Example in Spark:&lt;/h3&gt;
&lt;pre&gt;&lt;code class=&quot;language-scala&quot;&gt;Actions.forTable(spark, table)
  .rewriteManifests()
  .execute();
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;This reduces metadata file count, organizes manifests by partition and sort order, and can improve query planning times.&lt;/p&gt;
&lt;h2&gt;When Should You Perform Metadata Cleanup?&lt;/h2&gt;
&lt;ul&gt;
&lt;li&gt;
&lt;p&gt;After large ingestion spikes (e.g., backfills)&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;Following streaming workloads with high commit frequency&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;Post compaction or schema evolution&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;On a scheduled basis (e.g., daily or weekly)&lt;/p&gt;
&lt;/li&gt;
&lt;/ul&gt;
&lt;h2&gt;Bonus: Use Metadata Tables to Inspect Bloat&lt;/h2&gt;
&lt;p&gt;Iceberg’s metadata tables help you inspect how much bloat has built up.&lt;/p&gt;
&lt;p&gt;Example:&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-sql&quot;&gt;SELECT snapshot_id, added_files_count, total_data_files_count
FROM my_table.snapshots
ORDER BY committed_at DESC;
&lt;/code&gt;&lt;/pre&gt;
&lt;pre&gt;&lt;code class=&quot;language-sql&quot;&gt;SELECT COUNT(*) FROM my_table.manifests;
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;These insights can help you determine when cleanup is needed.&lt;/p&gt;
&lt;h2&gt;Tradeoffs and Cautions&lt;/h2&gt;
&lt;ul&gt;
&lt;li&gt;
&lt;p&gt;Snapshot expiration is irreversible: Make sure you don’t need the old snapshots for recovery or audit.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;Manifests rewrites are safe but can be compute-intensive on large tables—schedule wisely.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;Storage GC may require coordination with your catalog to clean up unreferenced files.&lt;/p&gt;
&lt;/li&gt;
&lt;/ul&gt;
&lt;h2&gt;Summary&lt;/h2&gt;
&lt;p&gt;Metadata is a powerful part of Iceberg’s architecture, but without routine maintenance, it can weigh down your table performance. By:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;
&lt;p&gt;Expiring stale snapshots&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;Rewriting bloated manifests&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;Monitoring metadata tables regularly&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;You ensure that your Iceberg tables remain agile, scalable, and ready for production workloads.&lt;/p&gt;
&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;In the next post, we’ll explore how to design the ideal cadence for compaction and snapshot expiration so your optimizations are timely and cost-effective.&lt;/p&gt;
</content:encoded><author>Alex Merced</author></item><item><title>Smarter Data Layout — Sorting and Clustering Iceberg Tables</title><link>https://iceberglakehouse.com/posts/iceberg-clustering-sorting-zorder/</link><guid isPermaLink="true">https://iceberglakehouse.com/posts/iceberg-clustering-sorting-zorder/</guid><description>
- **[Free Apache Iceberg Course](https://hello.dremio.com/webcast-an-apache-iceberg-lakehouse-crash-course-reg.html?utm_source=ev_external_blog&amp;utm_m...</description><pubDate>Tue, 05 Aug 2025 09:00:00 GMT</pubDate><content:encoded>&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;&lt;a href=&quot;https://hello.dremio.com/webcast-an-apache-iceberg-lakehouse-crash-course-reg.html?utm_source=ev_external_blog&amp;amp;utm_medium=influencer&amp;amp;utm_campaign=optimization_blogs&amp;amp;utm_content=alexmerced&amp;amp;utm_term=external_blog&quot;&gt;Free Apache Iceberg Course&lt;/a&gt;&lt;/strong&gt;&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;&lt;a href=&quot;https://hello.dremio.com/wp-apache-iceberg-the-definitive-guide-reg.html?utm_source=ev_external_blog&amp;amp;utm_medium=influencer&amp;amp;utm_campaign=optimization_blogs&amp;amp;utm_content=alexmerced&amp;amp;utm_term=external_blog&quot;&gt;Free Copy of “Apache Iceberg: The Definitive Guide”&lt;/a&gt;&lt;/strong&gt;&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;&lt;a href=&quot;https://hello.dremio.com/wp-apache-polaris-guide-reg.html?utm_source=ev_external_blog&amp;amp;utm_medium=influencer&amp;amp;utm_campaign=optimization_blogs&amp;amp;utm_content=alexmerced&amp;amp;utm_term=external_blog&quot;&gt;Free Copy of “Apache Polaris: The Definitive Guide”&lt;/a&gt;&lt;/strong&gt;&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;&lt;a href=&quot;https://medium.com/data-engineering-with-dremio/2025-guide-to-architecting-an-iceberg-lakehouse-9b19ed42c9de&quot;&gt;2025 Apache Iceberg Architecture Guide&lt;/a&gt;&lt;/strong&gt;&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;&lt;a href=&quot;https://youtube.com/playlist?list=PLsLAVBjQJO0p0Yq1fLkoHvt2lEJj5pcYe&amp;amp;si=WTSnqjXZv6Glkc3y&quot;&gt;Iceberg Lakehouse Engineering Video Playlist&lt;/a&gt;&lt;/strong&gt;&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;&lt;a href=&quot;https://medium.com/data-engineering-with-dremio/ultimate-directory-of-apache-iceberg-resources-e3e02efac62e&quot;&gt;Ultimate Apache Iceberg Resource Guide&lt;/a&gt;&lt;/strong&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;h1&gt;Smarter Data Layout — Sorting and Clustering Iceberg Tables&lt;/h1&gt;
&lt;p&gt;So far in this series, we&apos;ve focused on optimizing file sizes to reduce metadata and scan overhead. But &lt;strong&gt;how data is laid out within those files&lt;/strong&gt; can be just as important as the size of the files themselves.&lt;/p&gt;
&lt;p&gt;In this post, we&apos;ll explore &lt;strong&gt;clustering techniques in Apache Iceberg&lt;/strong&gt;, including &lt;strong&gt;sort order&lt;/strong&gt; and &lt;strong&gt;Z-ordering&lt;/strong&gt;, and how these techniques improve query performance by reducing the amount of data that needs to be read.&lt;/p&gt;
&lt;h2&gt;Why Clustering Matters&lt;/h2&gt;
&lt;p&gt;Imagine a query that filters on a &lt;code&gt;customer_id&lt;/code&gt;. If your data is randomly distributed, every file needs to be scanned. But if the data is sorted or clustered, the engine can skip over entire files or row groups — reducing I/O and speeding up execution.&lt;/p&gt;
&lt;p&gt;Clustering benefits:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Fewer files and rows scanned&lt;/li&gt;
&lt;li&gt;Better compression ratios&lt;/li&gt;
&lt;li&gt;Faster joins and aggregations&lt;/li&gt;
&lt;li&gt;More efficient pruning of partitions and row groups&lt;/li&gt;
&lt;/ul&gt;
&lt;h2&gt;Sorting in Iceberg&lt;/h2&gt;
&lt;p&gt;Iceberg supports &lt;strong&gt;sort order evolution&lt;/strong&gt;, which lets you define how data should be physically sorted as it&apos;s written or rewritten.&lt;/p&gt;
&lt;p&gt;You can define sort orders during write or compaction:&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-scala&quot;&gt;import org.apache.iceberg.SortOrder
import static org.apache.iceberg.expressions.Expressions.*;

table.updateSortOrder()
  .sortBy(asc(&amp;quot;customer_id&amp;quot;), desc(&amp;quot;order_date&amp;quot;))
  .commit();
&lt;/code&gt;&lt;/pre&gt;
&lt;h2&gt;Use Cases for Sorting&lt;/h2&gt;
&lt;ul&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Time-series data:&lt;/strong&gt; sort by event_time to improve range queries&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Dimension filters:&lt;/strong&gt; sort by commonly filtered columns like region, user_id&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Joins:&lt;/strong&gt; sort by join keys to speed up hash joins and reduce shuffling&lt;/p&gt;
&lt;/li&gt;
&lt;/ul&gt;
&lt;h2&gt;Z-order Clustering&lt;/h2&gt;
&lt;p&gt;Z-ordering is a multi-dimensional clustering technique that co-locates related values across multiple columns. It&apos;s ideal for exploratory queries that filter on different combinations of columns.&lt;/p&gt;
&lt;p&gt;Example:&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-scala&quot;&gt;table.updateSortOrder()
  .sortBy(zorder(&amp;quot;customer_id&amp;quot;, &amp;quot;product_id&amp;quot;, &amp;quot;region&amp;quot;))
  .commit();
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Z-ordering works by interleaving bits from multiple columns to keep related rows close together. This increases the chance that queries filtering on any subset of these columns can benefit from data skipping.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Note:&lt;/strong&gt; Z-ordering is supported by Iceberg through integrations like Dremio&apos;s Iceberg Auto-Clustering and Spark jobs using RewriteDataFiles.&lt;/p&gt;
&lt;h2&gt;Choosing Between Sort and Z-order&lt;/h2&gt;
&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Use Case&lt;/th&gt;
&lt;th&gt;Best Technique&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Filtering on one key column&lt;/td&gt;
&lt;td&gt;Simple Sort&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Range queries on timestamps&lt;/td&gt;
&lt;td&gt;Sort on time&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Multi-column filtering&lt;/td&gt;
&lt;td&gt;Z-order&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Joins on a key column&lt;/td&gt;
&lt;td&gt;Sort on join key&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Complex OLAP-style filters&lt;/td&gt;
&lt;td&gt;Z-order&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;
&lt;h2&gt;When to Apply Clustering&lt;/h2&gt;
&lt;p&gt;Clustering is typically applied:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;
&lt;p&gt;During initial writes, if the engine supports it&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;As part of compaction jobs, using RewriteDataFiles with sort order&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;In Spark, you can specify sort order in rewrite actions:&lt;/p&gt;
&lt;/li&gt;
&lt;/ul&gt;
&lt;pre&gt;&lt;code class=&quot;language-scala&quot;&gt;Actions.forTable(spark, table)
  .rewriteDataFiles()
  .sortBy(&amp;quot;region&amp;quot;, &amp;quot;event_time&amp;quot;)
  .execute();
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Make sure the sort order aligns with your most frequent query patterns.&lt;/p&gt;
&lt;h2&gt;Tradeoffs&lt;/h2&gt;
&lt;p&gt;While clustering helps query performance, it comes with tradeoffs:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;
&lt;p&gt;Sorting increases job duration: Sorting is more expensive than just rewriting files&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;Clustering can become outdated: Evolving data patterns may require adjusting sort orders&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;Not all engines respect sort order: Make sure your query engine leverages the layout&lt;/p&gt;
&lt;/li&gt;
&lt;/ul&gt;
&lt;h2&gt;Summary&lt;/h2&gt;
&lt;p&gt;Smart data layout is essential for fast queries in Apache Iceberg. By leveraging sorting and Z-order clustering:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;
&lt;p&gt;You reduce the volume of data scanned&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;Improve filter selectivity&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;Optimize performance for a wide variety of workloads&lt;/p&gt;
&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;In the next post, we’ll look at another silent performance killer: metadata bloat, and how to clean it up using snapshot expiration and manifest rewriting.&lt;/p&gt;
</content:encoded><author>Alex Merced</author></item><item><title>Optimizing Compaction for Streaming Workloads in Apache Iceberg</title><link>https://iceberglakehouse.com/posts/iceberg-streaming-compaction/</link><guid isPermaLink="true">https://iceberglakehouse.com/posts/iceberg-streaming-compaction/</guid><description>
- **[Free Apache Iceberg Course](https://hello.dremio.com/webcast-an-apache-iceberg-lakehouse-crash-course-reg.html?utm_source=ev_external_blog&amp;utm_m...</description><pubDate>Tue, 29 Jul 2025 09:00:00 GMT</pubDate><content:encoded>&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;&lt;a href=&quot;https://hello.dremio.com/webcast-an-apache-iceberg-lakehouse-crash-course-reg.html?utm_source=ev_external_blog&amp;amp;utm_medium=influencer&amp;amp;utm_campaign=optimization_blogs&amp;amp;utm_content=alexmerced&amp;amp;utm_term=external_blog&quot;&gt;Free Apache Iceberg Course&lt;/a&gt;&lt;/strong&gt;&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;&lt;a href=&quot;https://hello.dremio.com/wp-apache-iceberg-the-definitive-guide-reg.html?utm_source=ev_external_blog&amp;amp;utm_medium=influencer&amp;amp;utm_campaign=optimization_blogs&amp;amp;utm_content=alexmerced&amp;amp;utm_term=external_blog&quot;&gt;Free Copy of “Apache Iceberg: The Definitive Guide”&lt;/a&gt;&lt;/strong&gt;&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;&lt;a href=&quot;https://hello.dremio.com/wp-apache-polaris-guide-reg.html?utm_source=ev_external_blog&amp;amp;utm_medium=influencer&amp;amp;utm_campaign=optimization_blogs&amp;amp;utm_content=alexmerced&amp;amp;utm_term=external_blog&quot;&gt;Free Copy of “Apache Polaris: The Definitive Guide”&lt;/a&gt;&lt;/strong&gt;&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;&lt;a href=&quot;https://medium.com/data-engineering-with-dremio/2025-guide-to-architecting-an-iceberg-lakehouse-9b19ed42c9de&quot;&gt;2025 Apache Iceberg Architecture Guide&lt;/a&gt;&lt;/strong&gt;&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;&lt;a href=&quot;https://youtube.com/playlist?list=PLsLAVBjQJO0p0Yq1fLkoHvt2lEJj5pcYe&amp;amp;si=WTSnqjXZv6Glkc3y&quot;&gt;Iceberg Lakehouse Engineering Video Playlist&lt;/a&gt;&lt;/strong&gt;&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;&lt;a href=&quot;https://medium.com/data-engineering-with-dremio/ultimate-directory-of-apache-iceberg-resources-e3e02efac62e&quot;&gt;Ultimate Apache Iceberg Resource Guide&lt;/a&gt;&lt;/strong&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;h1&gt;Optimizing Compaction for Streaming Workloads in Apache Iceberg&lt;/h1&gt;
&lt;p&gt;In traditional batch pipelines, compaction jobs can run in large windows during idle periods. But in streaming workloads, data is written continuously—often in small increments—leading to rapid small file accumulation and tight freshness requirements.&lt;/p&gt;
&lt;p&gt;So how do we compact Iceberg tables without interfering with ingestion and latency-sensitive reads? This post explores how to &lt;strong&gt;design efficient, incremental compaction jobs&lt;/strong&gt; that preserve performance without disrupting your streaming pipelines.&lt;/p&gt;
&lt;h2&gt;The Challenge with Streaming + Compaction&lt;/h2&gt;
&lt;p&gt;Streaming ingestion into Apache Iceberg often uses micro-batches or event-driven triggers that:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Generate many small files per partition&lt;/li&gt;
&lt;li&gt;Write new snapshots frequently&lt;/li&gt;
&lt;li&gt;Introduce high metadata churn&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;A naive compaction job that rewrites entire partitions or the whole table risks:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Commit contention&lt;/strong&gt; with streaming jobs&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Stale data&lt;/strong&gt; in read replicas or downstream queries&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Latency spikes&lt;/strong&gt; if compaction blocks snapshot availability&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;The key is to &lt;strong&gt;optimize incrementally and intelligently.&lt;/strong&gt;&lt;/p&gt;
&lt;h2&gt;Techniques for Streaming-Safe Compaction&lt;/h2&gt;
&lt;h3&gt;1. &lt;strong&gt;Compact Only Cold Partitions&lt;/strong&gt;&lt;/h3&gt;
&lt;p&gt;Don’t rewrite partitions actively being written to. Instead:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Identify &amp;quot;cold&amp;quot; partitions (e.g., older than 1 hour if partioned by hour)&lt;/li&gt;
&lt;li&gt;Compact only those to avoid conflicts with streaming writes&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;Example query using metadata table:&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-sql&quot;&gt;SELECT partition, COUNT(*) AS file_count
FROM my_table.files
WHERE last_modified &amp;lt; current_timestamp() - INTERVAL &apos;1 hour&apos;
GROUP BY partition
HAVING COUNT(*) &amp;gt; 10;
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;This can drive dynamic, safe compaction logic in orchestration tools.&lt;/p&gt;
&lt;h3&gt;2. Use Incremental Compaction Windows&lt;/h3&gt;
&lt;p&gt;Instead of full rewrites:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;
&lt;p&gt;Compact only a subset of files at a time (e.g., oldest or smallest)&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;Avoid reprocessing already optimized files&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;Reduce job run time to minutes instead of hours&lt;/p&gt;
&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;Spark&apos;s RewriteDataFiles and Dremio&apos;s &lt;code&gt;OPTIMIZE&lt;/code&gt; features both support targeted rewrites.&lt;/p&gt;
&lt;h3&gt;3. Trigger Based on Metadata Metrics&lt;/h3&gt;
&lt;p&gt;Rather than scheduling compaction at fixed intervals, use metadata-driven triggers like:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;
&lt;p&gt;Number of files per partition &amp;gt; threshold&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;Average file size &amp;lt; target&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;File age &amp;gt; threshold&lt;/p&gt;
&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;You can track these via files and manifests metadata tables and use orchestration tools (e.g., Airflow, Dagster, dbt Cloud) to trigger compaction.&lt;/p&gt;
&lt;p&gt;Example: Time-Based Compaction Script (Pseudo-code)&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-python&quot;&gt;# For each partition older than 1 hour with many small files
for partition in get_partitions_older_than(hours=1):
    if count_small_files(partition) &amp;gt; threshold:
        run_compaction(partition)
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;This pattern allows incremental, scoped jobs that don’t touch fresh data.&lt;/p&gt;
&lt;h2&gt;Tuning for Performance&lt;/h2&gt;
&lt;p&gt;Parallelism: Use high parallelism for wide tables to speed up job runtime&lt;/p&gt;
&lt;p&gt;Target file size: Stick to 128MB–256MB range unless your queries benefit from larger files&lt;/p&gt;
&lt;p&gt;Retries and check-pointing: Make sure jobs are fault-tolerant in production&lt;/p&gt;
&lt;h2&gt;Summary&lt;/h2&gt;
&lt;p&gt;To maintain performance in streaming Iceberg pipelines:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;
&lt;p&gt;Compact frequently, but narrowly&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;Use metadata to guide scope&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;Avoid active partitions and large rewrites&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;Leverage orchestration and branching when available&lt;/p&gt;
&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;With the right setup, you can keep query performance and data freshness high—without sacrificing one for the other.&lt;/p&gt;
</content:encoded><author>Alex Merced</author></item><item><title>The Basics of Compaction — Bin Packing Your Data for Efficiency</title><link>https://iceberglakehouse.com/posts/iceberg-optimization-compaction-basics/</link><guid isPermaLink="true">https://iceberglakehouse.com/posts/iceberg-optimization-compaction-basics/</guid><description>
- **[Free Apache Iceberg Course](https://hello.dremio.com/webcast-an-apache-iceberg-lakehouse-crash-course-reg.html?utm_source=ev_external_blog&amp;utm_m...</description><pubDate>Tue, 22 Jul 2025 09:00:00 GMT</pubDate><content:encoded>&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;&lt;a href=&quot;https://hello.dremio.com/webcast-an-apache-iceberg-lakehouse-crash-course-reg.html?utm_source=ev_external_blog&amp;amp;utm_medium=influencer&amp;amp;utm_campaign=optimization_blogs&amp;amp;utm_content=alexmerced&amp;amp;utm_term=external_blog&quot;&gt;Free Apache Iceberg Course&lt;/a&gt;&lt;/strong&gt;&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;&lt;a href=&quot;https://hello.dremio.com/wp-apache-iceberg-the-definitive-guide-reg.html?utm_source=ev_external_blog&amp;amp;utm_medium=influencer&amp;amp;utm_campaign=optimization_blogs&amp;amp;utm_content=alexmerced&amp;amp;utm_term=external_blog&quot;&gt;Free Copy of “Apache Iceberg: The Definitive Guide”&lt;/a&gt;&lt;/strong&gt;&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;&lt;a href=&quot;https://hello.dremio.com/wp-apache-polaris-guide-reg.html?utm_source=ev_external_blog&amp;amp;utm_medium=influencer&amp;amp;utm_campaign=optimization_blogs&amp;amp;utm_content=alexmerced&amp;amp;utm_term=external_blog&quot;&gt;Free Copy of “Apache Polaris: The Definitive Guide”&lt;/a&gt;&lt;/strong&gt;&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;&lt;a href=&quot;https://medium.com/data-engineering-with-dremio/2025-guide-to-architecting-an-iceberg-lakehouse-9b19ed42c9de&quot;&gt;2025 Apache Iceberg Architecture Guide&lt;/a&gt;&lt;/strong&gt;&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;&lt;a href=&quot;https://youtube.com/playlist?list=PLsLAVBjQJO0p0Yq1fLkoHvt2lEJj5pcYe&amp;amp;si=WTSnqjXZv6Glkc3y&quot;&gt;Iceberg Lakehouse Engineering Video Playlist&lt;/a&gt;&lt;/strong&gt;&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;&lt;a href=&quot;https://medium.com/data-engineering-with-dremio/ultimate-directory-of-apache-iceberg-resources-e3e02efac62e&quot;&gt;Ultimate Apache Iceberg Resource Guide&lt;/a&gt;&lt;/strong&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;h1&gt;The Basics of Compaction — Bin Packing Your Data for Efficiency&lt;/h1&gt;
&lt;p&gt;In the first post of this series, we explored how Apache Iceberg tables degrade when left unoptimized. Now it&apos;s time to look at the most foundational optimization technique: &lt;strong&gt;compaction&lt;/strong&gt;.&lt;/p&gt;
&lt;p&gt;Compaction is the process of merging small files into larger ones to reduce file system overhead and improve query performance. In Iceberg, this usually takes the form of &lt;strong&gt;bin packing&lt;/strong&gt; — grouping smaller files together so they align with an optimal size target.&lt;/p&gt;
&lt;h2&gt;Why Bin Packing Matters&lt;/h2&gt;
&lt;p&gt;Query engines like Dremio, Trino, and Spark operate more efficiently when reading a smaller number of larger files instead of a large number of tiny files. Every file adds cost:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;It triggers an I/O request&lt;/li&gt;
&lt;li&gt;It needs to be tracked in metadata&lt;/li&gt;
&lt;li&gt;It increases planning and scheduling complexity&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;By merging many small files into fewer large files, compaction directly addresses:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Small file problem&lt;/strong&gt;&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Metadata bloat in manifests&lt;/strong&gt;&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Inefficient scan patterns&lt;/strong&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;h2&gt;How Standard Compaction Works&lt;/h2&gt;
&lt;p&gt;A typical Iceberg compaction job involves:&lt;/p&gt;
&lt;ol&gt;
&lt;li&gt;&lt;strong&gt;Scanning the table&lt;/strong&gt; to identify small files below a certain threshold.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Reading and coalescing records&lt;/strong&gt; from multiple small files within a partition.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Writing out new files&lt;/strong&gt; targeting an optimal size (commonly 128MB–512MB per file).&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Creating a new snapshot&lt;/strong&gt; that references the new files and drops the older ones.&lt;/li&gt;
&lt;/ol&gt;
&lt;p&gt;This process can be orchestrated using:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Apache Spark&lt;/strong&gt; with Iceberg’s &lt;code&gt;RewriteDataFiles&lt;/code&gt; action&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Dremio&lt;/strong&gt; with its &lt;code&gt;OPTIMIZE&lt;/code&gt; command&lt;/li&gt;
&lt;/ul&gt;
&lt;h3&gt;Example: Spark Action&lt;/h3&gt;
&lt;pre&gt;&lt;code class=&quot;language-scala&quot;&gt;import org.apache.iceberg.actions.Actions

Actions.forTable(spark, table)
  .rewriteDataFiles()
  .option(&amp;quot;target-file-size-bytes&amp;quot;, 134217728) // 128 MB
  .execute()
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;This will identify and bin-pack small files across partitions, replacing them with larger files.&lt;/p&gt;
&lt;h2&gt;Tips for Running Compaction&lt;/h2&gt;
&lt;ul&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Target file size:&lt;/strong&gt; Match your engine’s ideal scan size. 128MB or 256MB often work well.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Partition scope:&lt;/strong&gt; You can compact per partition to avoid touching the entire table.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Job parallelism:&lt;/strong&gt; Tune parallelism to handle large volumes efficiently.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Avoid overlap:&lt;/strong&gt; If streaming ingestion is running, compaction jobs should avoid writing to the same partitions concurrently (we’ll cover this in Part 3).&lt;/p&gt;
&lt;/li&gt;
&lt;/ul&gt;
&lt;h2&gt;When Should You Run It?&lt;/h2&gt;
&lt;p&gt;That depends on:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Ingestion frequency:&lt;/strong&gt; Frequent writes = more small files = more frequent compaction&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Query behavior:&lt;/strong&gt; If queries touch recently ingested data, compact often&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Table size and storage costs:&lt;/strong&gt; The larger the table, the more benefit from compaction&lt;/p&gt;
&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;In many cases, a daily or hourly schedule works well. Some platforms support event-driven compaction based on file count or size thresholds.&lt;/p&gt;
&lt;h2&gt;Tradeoffs&lt;/h2&gt;
&lt;p&gt;While compaction boosts performance, it also:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;
&lt;p&gt;Consumes compute and I/O resources&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;Temporarily increases storage (until old files are expired)&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;Can interfere with concurrent writes if not carefully scheduled&lt;/p&gt;
&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;That’s why timing and scope matter—a theme we’ll return to later in this series.&lt;/p&gt;
&lt;h2&gt;Up Next&lt;/h2&gt;
&lt;p&gt;Now that you understand standard compaction, the next challenge is applying it without interrupting streaming workloads. In Part 3, we’ll explore techniques to make compaction faster, safer, and more incremental for real-time pipelines.&lt;/p&gt;
</content:encoded><author>Alex Merced</author></item><item><title>The Cost of Neglect — How Apache Iceberg Tables Degrade Without Optimization</title><link>https://iceberglakehouse.com/posts/iceberg-optimization-degradation/</link><guid isPermaLink="true">https://iceberglakehouse.com/posts/iceberg-optimization-degradation/</guid><description>
- **[Free Apache Iceberg Course](https://hello.dremio.com/webcast-an-apache-iceberg-lakehouse-crash-course-reg.html?utm_source=ev_external_blog&amp;utm_m...</description><pubDate>Tue, 15 Jul 2025 09:00:00 GMT</pubDate><content:encoded>&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;&lt;a href=&quot;https://hello.dremio.com/webcast-an-apache-iceberg-lakehouse-crash-course-reg.html?utm_source=ev_external_blog&amp;amp;utm_medium=influencer&amp;amp;utm_campaign=optimization_blogs&amp;amp;utm_content=alexmerced&amp;amp;utm_term=external_blog&quot;&gt;Free Apache Iceberg Course&lt;/a&gt;&lt;/strong&gt;&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;&lt;a href=&quot;https://hello.dremio.com/wp-apache-iceberg-the-definitive-guide-reg.html?utm_source=ev_external_blog&amp;amp;utm_medium=influencer&amp;amp;utm_campaign=optimization_blogs&amp;amp;utm_content=alexmerced&amp;amp;utm_term=external_blog&quot;&gt;Free Copy of “Apache Iceberg: The Definitive Guide”&lt;/a&gt;&lt;/strong&gt;&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;&lt;a href=&quot;https://hello.dremio.com/wp-apache-polaris-guide-reg.html?utm_source=ev_external_blog&amp;amp;utm_medium=influencer&amp;amp;utm_campaign=optimization_blogs&amp;amp;utm_content=alexmerced&amp;amp;utm_term=external_blog&quot;&gt;Free Copy of “Apache Polaris: The Definitive Guide”&lt;/a&gt;&lt;/strong&gt;&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;&lt;a href=&quot;https://medium.com/data-engineering-with-dremio/2025-guide-to-architecting-an-iceberg-lakehouse-9b19ed42c9de&quot;&gt;2025 Apache Iceberg Architecture Guide&lt;/a&gt;&lt;/strong&gt;&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;&lt;a href=&quot;https://youtube.com/playlist?list=PLsLAVBjQJO0p0Yq1fLkoHvt2lEJj5pcYe&amp;amp;si=WTSnqjXZv6Glkc3y&quot;&gt;Iceberg Lakehouse Engineering Video Playlist&lt;/a&gt;&lt;/strong&gt;&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;&lt;a href=&quot;https://medium.com/data-engineering-with-dremio/ultimate-directory-of-apache-iceberg-resources-e3e02efac62e&quot;&gt;Ultimate Apache Iceberg Resource Guide&lt;/a&gt;&lt;/strong&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;h1&gt;The Cost of Neglect — How Apache Iceberg Tables Degrade Without Optimization&lt;/h1&gt;
&lt;p&gt;Apache Iceberg offers powerful features for managing large-scale datasets with reliability, versioning, and schema evolution. But like any robust system, Iceberg tables require care and maintenance. Without ongoing optimization, even the most well-designed Iceberg table can degrade—causing query slowdowns, ballooning metadata, and rising infrastructure costs.&lt;/p&gt;
&lt;p&gt;This post kicks off a 10-part series on &lt;strong&gt;Apache Iceberg Table Optimization&lt;/strong&gt;, beginning with a look at &lt;em&gt;what happens when you don’t optimize&lt;/em&gt; and why it matters.&lt;/p&gt;
&lt;h2&gt;Why Do Iceberg Tables Degrade?&lt;/h2&gt;
&lt;p&gt;At its core, Iceberg uses a &lt;strong&gt;table metadata layer&lt;/strong&gt; to track the location and structure of physical files (data files, manifests, and manifest lists). Over time, various ingestion patterns—batch loads, streaming micro-batches, late-arriving records—can lead to an accumulation of inefficiencies:&lt;/p&gt;
&lt;h3&gt;1. &lt;strong&gt;Small Files Problem&lt;/strong&gt;&lt;/h3&gt;
&lt;p&gt;Each write operation typically creates a new data file. In streaming or frequent ingestion pipelines, this can lead to thousands of tiny files that:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Increase the number of file system operations during scans&lt;/li&gt;
&lt;li&gt;Reduce the effectiveness of predicate pushdown and pruning&lt;/li&gt;
&lt;li&gt;Add overhead to table metadata (larger manifest files)&lt;/li&gt;
&lt;/ul&gt;
&lt;h3&gt;2. &lt;strong&gt;Fragmented Manifests&lt;/strong&gt;&lt;/h3&gt;
&lt;p&gt;Each new snapshot creates new manifest files. If the same files appear in many manifests or are not compacted, snapshot metadata becomes expensive to read and maintain.&lt;/p&gt;
&lt;h3&gt;3. &lt;strong&gt;Bloated Snapshots&lt;/strong&gt;&lt;/h3&gt;
&lt;p&gt;Iceberg maintains a full history of table snapshots unless explicitly expired. Over time, this bloats the metadata layer with obsolete entries:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Slows down time travel and rollback operations&lt;/li&gt;
&lt;li&gt;Inflates table size even if the data volume is static&lt;/li&gt;
&lt;li&gt;Consumes storage and memory unnecessarily&lt;/li&gt;
&lt;/ul&gt;
&lt;h3&gt;4. &lt;strong&gt;Unclustered or Unsorted Data&lt;/strong&gt;&lt;/h3&gt;
&lt;p&gt;Without explicit clustering or sort order, files may be written in a way that scatters relevant records across multiple files. This leads to:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Increased scan ranges and data reads during filtering&lt;/li&gt;
&lt;li&gt;Poor locality for analytical queries&lt;/li&gt;
&lt;/ul&gt;
&lt;h3&gt;5. &lt;strong&gt;Partition Imbalance&lt;/strong&gt;&lt;/h3&gt;
&lt;p&gt;When partitions grow at uneven rates, you may end up with:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Some partitions containing massive files&lt;/li&gt;
&lt;li&gt;Others being overloaded with small files&lt;/li&gt;
&lt;li&gt;Query planning bottlenecks on overgrown partitions&lt;/li&gt;
&lt;/ul&gt;
&lt;h2&gt;What Are the Consequences?&lt;/h2&gt;
&lt;p&gt;These degradations manifest as tangible issues across your data platform:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Performance Hits:&lt;/strong&gt; Query scans take longer and use more compute resources.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Higher Costs:&lt;/strong&gt; More files and metadata inflate cloud storage bills and increase query processing cost in engines like Dremio, Trino, or Spark.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Longer Maintenance Windows:&lt;/strong&gt; Snapshot expiration, schema evolution, and compaction become more expensive over time.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Reduced Freshness and Responsiveness:&lt;/strong&gt; Particularly in streaming use cases, lag builds up if optimizations are not happening incrementally.&lt;/li&gt;
&lt;/ul&gt;
&lt;h2&gt;What Causes This Degradation?&lt;/h2&gt;
&lt;p&gt;Most of these issues stem from a lack of:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Regular &lt;strong&gt;compaction&lt;/strong&gt;&lt;/li&gt;
&lt;li&gt;Snapshot and metadata &lt;strong&gt;cleanup&lt;/strong&gt;&lt;/li&gt;
&lt;li&gt;Monitoring table &lt;strong&gt;health metrics&lt;/strong&gt;&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Clustering and layout optimization&lt;/strong&gt; during writes&lt;/li&gt;
&lt;/ul&gt;
&lt;h2&gt;Looking Ahead&lt;/h2&gt;
&lt;p&gt;The good news is that Apache Iceberg provides powerful tools to fix these issues—with the right strategy. In the next posts, we’ll break down each optimization method, starting with standard compaction and how to implement it effectively.&lt;/p&gt;
&lt;p&gt;Stay tuned for Part 2: &lt;strong&gt;The Basics of Compaction — Bin Packing Your Data for Efficiency&lt;/strong&gt;&lt;/p&gt;
</content:encoded><author>Alex Merced</author></item><item><title>How to Discover or Organize Lakehouse &amp; Apache Iceberg Meetups</title><link>https://iceberglakehouse.com/posts/2025-07-discovering-or-organizing-lakehouse-iceberg-meetups/</link><guid isPermaLink="true">https://iceberglakehouse.com/posts/2025-07-discovering-or-organizing-lakehouse-iceberg-meetups/</guid><description>
## Free Resources  
- **[Free Apache Iceberg Course](https://hello.dremio.com/webcast-an-apache-iceberg-lakehouse-crash-course-reg.html?utm_source=ev...</description><pubDate>Thu, 03 Jul 2025 09:00:00 GMT</pubDate><content:encoded>&lt;h2&gt;Free Resources&lt;/h2&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;&lt;a href=&quot;https://hello.dremio.com/webcast-an-apache-iceberg-lakehouse-crash-course-reg.html?utm_source=ev_external_blog&amp;amp;utm_medium=influencer&amp;amp;utm_campaign=lakehouse-meetups&amp;amp;utm_content=alexmerced&amp;amp;utm_term=external_blog&quot;&gt;Free Apache Iceberg Course&lt;/a&gt;&lt;/strong&gt;&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;&lt;a href=&quot;https://hello.dremio.com/wp-apache-iceberg-the-definitive-guide-reg.html?utm_source=ev_external_blog&amp;amp;utm_medium=influencer&amp;amp;utm_campaign=lakehouse-meetups&amp;amp;utm_content=alexmerced&amp;amp;utm_term=external_blog&quot;&gt;Free Copy of “Apache Iceberg: The Definitive Guide”&lt;/a&gt;&lt;/strong&gt;&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;&lt;a href=&quot;https://hello.dremio.com/wp-apache-polaris-guide-reg.html?utm_source=ev_external_blog&amp;amp;utm_medium=influencer&amp;amp;utm_campaign=lakehouse-meetups&amp;amp;utm_content=alexmerced&amp;amp;utm_term=external_blog&quot;&gt;Free Copy of “Apache Polaris: The Definitive Guide”&lt;/a&gt;&lt;/strong&gt;&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;&lt;a href=&quot;https://medium.com/data-engineering-with-dremio/2025-guide-to-architecting-an-iceberg-lakehouse-9b19ed42c9de&quot;&gt;2025 Apache Iceberg Architecture Guide&lt;/a&gt;&lt;/strong&gt;&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;&lt;a href=&quot;https://youtube.com/playlist?list=PLsLAVBjQJO0p0Yq1fLkoHvt2lEJj5pcYe&amp;amp;si=WTSnqjXZv6Glkc3y&quot;&gt;Iceberg Lakehouse Engineering Video Playlist&lt;/a&gt;&lt;/strong&gt;&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;&lt;a href=&quot;https://medium.com/data-engineering-with-dremio/ultimate-directory-of-apache-iceberg-resources-e3e02efac62e&quot;&gt;Ultimate Apache Iceberg Resource Guide&lt;/a&gt;&lt;/strong&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;Planning a meetup around Apache Iceberg or modern data lakehouse architectures? Whether you&apos;re looking to host your first community event or expand your existing network, discovering and organizing meetups can be both rewarding and impactful. These gatherings offer an opportunity to connect with other data professionals, share best practices, and explore cutting-edge tools and architectures. In this blog, we&apos;ll explore how to find and collaborate with existing data communities, discover upcoming Iceberg and lakehouse-related events, and provide tips on organizing your own meetup. We&apos;ll also share links to online communities, tools, and platforms to help you build momentum around your event and grow your local or virtual data community.&lt;/p&gt;
&lt;h1&gt;Step 1: Join the Related Communities&lt;/h1&gt;
&lt;p&gt;Slack communities for different lakehouse communities are going to be one of the best places to find people to collaborate with. In certain communities there are dedicated channels for meetups that make easier to discover people looking to collaborate in your area&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;a href=&quot;https://join.slack.com/t/dataeventssla-pnp1776/shared_invite/zt-38vgrooy9-U9ral_gr3NAz_Siih1QwmQ&quot;&gt;Data Events Slack Community&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&quot;https://join.slack.com/t/thedatalakehousehub/shared_invite/zt-274yc8sza-mI2zhCW8LGkOh1uxuf8T5Q&quot;&gt;The Data Lakehouse Hub&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&quot;https://iceberg.apache.org/community/&quot;&gt;Apache Iceberg Slack&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&quot;https://polaris.apache.org/community/&quot;&gt;The Apache Polaris Slack&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&quot;https://hudi.apache.org/community/get-involved&quot;&gt;The Apache Hudi Slack&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&quot;https://delta.io/community/&quot;&gt;Delta Lake Slack&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;h1&gt;Step 2: Where to collaborate&lt;/h1&gt;
&lt;p&gt;A good pattern to use is to create a meetup channel if it doesn&apos;t already exist for your area like &lt;code&gt;#meetup-atlanta&lt;/code&gt; and then invite people to join the channel to collaborate on local meetups.&lt;/p&gt;
&lt;h3&gt;Data Events Slack Community&lt;/h3&gt;
&lt;p&gt;The Data Events Slack Community is a great place to find people to collaborate with. Here are the existing meetup channels in the Data Events Slack Community:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;code&gt;meetup-argentina&lt;/code&gt;&lt;/li&gt;
&lt;li&gt;&lt;code&gt;meetup-australia&lt;/code&gt;&lt;/li&gt;
&lt;li&gt;&lt;code&gt;meetup-brazil&lt;/code&gt;&lt;/li&gt;
&lt;li&gt;&lt;code&gt;meetup-california&lt;/code&gt;&lt;/li&gt;
&lt;li&gt;&lt;code&gt;meetup-canada&lt;/code&gt;&lt;/li&gt;
&lt;li&gt;&lt;code&gt;meetup-chile&lt;/code&gt;&lt;/li&gt;
&lt;li&gt;&lt;code&gt;meetup-china&lt;/code&gt;&lt;/li&gt;
&lt;li&gt;&lt;code&gt;meetup-colombia&lt;/code&gt;&lt;/li&gt;
&lt;li&gt;&lt;code&gt;meetup-colorado&lt;/code&gt;&lt;/li&gt;
&lt;li&gt;&lt;code&gt;meetup-egypt&lt;/code&gt;&lt;/li&gt;
&lt;li&gt;&lt;code&gt;meetup-florida&lt;/code&gt;&lt;/li&gt;
&lt;li&gt;&lt;code&gt;meetup-france&lt;/code&gt;&lt;/li&gt;
&lt;li&gt;&lt;code&gt;meetup-georgia&lt;/code&gt;&lt;/li&gt;
&lt;li&gt;&lt;code&gt;meetup-germany&lt;/code&gt;&lt;/li&gt;
&lt;li&gt;&lt;code&gt;meetup-illinois&lt;/code&gt;&lt;/li&gt;
&lt;li&gt;&lt;code&gt;meetup-india&lt;/code&gt;&lt;/li&gt;
&lt;li&gt;&lt;code&gt;meetup-ireland&lt;/code&gt;&lt;/li&gt;
&lt;li&gt;&lt;code&gt;meetup-israel&lt;/code&gt;&lt;/li&gt;
&lt;li&gt;&lt;code&gt;meetup-japan&lt;/code&gt;&lt;/li&gt;
&lt;li&gt;&lt;code&gt;meetup-massachusetts&lt;/code&gt;&lt;/li&gt;
&lt;li&gt;&lt;code&gt;meetup-mexico&lt;/code&gt;&lt;/li&gt;
&lt;li&gt;&lt;code&gt;meetup-netherlands&lt;/code&gt;&lt;/li&gt;
&lt;li&gt;&lt;code&gt;meetup-newyork&lt;/code&gt;&lt;/li&gt;
&lt;li&gt;&lt;code&gt;meetup-northcarolina&lt;/code&gt;&lt;/li&gt;
&lt;li&gt;&lt;code&gt;meetup-singapore&lt;/code&gt;&lt;/li&gt;
&lt;li&gt;&lt;code&gt;meetup-southafrica&lt;/code&gt;&lt;/li&gt;
&lt;li&gt;&lt;code&gt;meetup-southkorea&lt;/code&gt;&lt;/li&gt;
&lt;li&gt;&lt;code&gt;meetup-sweden&lt;/code&gt;&lt;/li&gt;
&lt;li&gt;&lt;code&gt;meetup-texas&lt;/code&gt;&lt;/li&gt;
&lt;li&gt;&lt;code&gt;meetup-uk&lt;/code&gt;&lt;/li&gt;
&lt;li&gt;&lt;code&gt;meetup-utah&lt;/code&gt;&lt;/li&gt;
&lt;li&gt;&lt;code&gt;meetup-washington&lt;/code&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;h3&gt;Iceberg Slack&lt;/h3&gt;
&lt;p&gt;Currently in the Apache Iceberg Slack Workspace the following Channels Exist:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;code&gt;meetup-atlanta&lt;/code&gt;&lt;/li&gt;
&lt;li&gt;&lt;code&gt;meetup-austin&lt;/code&gt;&lt;/li&gt;
&lt;li&gt;&lt;code&gt;meetup-bayarea&lt;/code&gt;&lt;/li&gt;
&lt;li&gt;&lt;code&gt;meetup-boston&lt;/code&gt;&lt;/li&gt;
&lt;li&gt;&lt;code&gt;meetup-chicago&lt;/code&gt;&lt;/li&gt;
&lt;li&gt;&lt;code&gt;meetup-denver&lt;/code&gt;&lt;/li&gt;
&lt;li&gt;&lt;code&gt;meetup-nola&lt;/code&gt;&lt;/li&gt;
&lt;li&gt;&lt;code&gt;meetup-orlando&lt;/code&gt;&lt;/li&gt;
&lt;li&gt;&lt;code&gt;meetup-seattle&lt;/code&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;&lt;strong&gt;note:&lt;/strong&gt; There are no other channels for other cities as the ability to make channels was turned off in the Iceberg Slack, my suggestion is make the channel in the Data Lakehouse Hub slack.&lt;/p&gt;
&lt;h3&gt;Data Lakehouse Hub Slack&lt;/h3&gt;
&lt;p&gt;Here are the existing meetup channels in the Data Lakehouse Hub Slack:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;code&gt;meetup-atlanta&lt;/code&gt;&lt;/li&gt;
&lt;li&gt;&lt;code&gt;meetup-austin&lt;/code&gt;&lt;/li&gt;
&lt;li&gt;&lt;code&gt;meetup-barcelona&lt;/code&gt;&lt;/li&gt;
&lt;li&gt;&lt;code&gt;meetup-boston&lt;/code&gt;&lt;/li&gt;
&lt;li&gt;&lt;code&gt;meetup-chicago&lt;/code&gt;&lt;/li&gt;
&lt;li&gt;&lt;code&gt;meetup-denver&lt;/code&gt;&lt;/li&gt;
&lt;li&gt;&lt;code&gt;meetup-london&lt;/code&gt;&lt;/li&gt;
&lt;li&gt;&lt;code&gt;meetup-miami&lt;/code&gt;&lt;/li&gt;
&lt;li&gt;&lt;code&gt;meetup-munich&lt;/code&gt;&lt;/li&gt;
&lt;li&gt;&lt;code&gt;meetup-nyc&lt;/code&gt;&lt;/li&gt;
&lt;li&gt;&lt;code&gt;meetup-nola&lt;/code&gt;&lt;/li&gt;
&lt;li&gt;&lt;code&gt;meetup-orlando&lt;/code&gt;&lt;/li&gt;
&lt;li&gt;&lt;code&gt;meetup-san-francisco&lt;/code&gt;&lt;/li&gt;
&lt;li&gt;&lt;code&gt;meetup-santa-clara&lt;/code&gt;&lt;/li&gt;
&lt;li&gt;&lt;code&gt;meetup-seattle&lt;/code&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;h1&gt;Apache Polaris Slack&lt;/h1&gt;
&lt;p&gt;There is a &lt;code&gt;#meetup-attendee&lt;/code&gt; and &lt;code&gt;#meetup-organizer&lt;/code&gt; channel in the Apache Polaris Slack along with:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;code&gt;#meetup-nyc-austin-boston-atlanta&lt;/code&gt;&lt;/li&gt;
&lt;li&gt;&lt;code&gt;#meetup-sanfran-seattle-denver-chicago&lt;/code&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;h1&gt;Step 3: Find or Propose Events&lt;/h1&gt;
&lt;p&gt;By reading these channels you should be able to discover upcoming Iceberg and lakehouse-related events in your area. If you want to organize an event you can propose an event and see who would want to collaborate in organizing the event.&lt;/p&gt;
&lt;h1&gt;Step 4: Organize the Event&lt;/h1&gt;
&lt;h3&gt;Naming Your Event&lt;/h3&gt;
&lt;p&gt;Simplest way to organize your event is under the name &lt;code&gt;X Lakehouse Meetup&lt;/code&gt; where &lt;code&gt;X&lt;/code&gt; is the city or region and you can run the meetup any way you like. For example, &lt;code&gt;Atlanta Lakehouse Meetup&lt;/code&gt;. But if you want to use a name like &lt;code&gt;Atlanta Apache Iceberg Meetup&lt;/code&gt; you can do that but need to follow &lt;a href=&quot;https://lists.apache.org/thread/ls2rg4xcwk9hnhtotor5f9xsrbdknw1s&quot;&gt;recently approved guidelines&lt;/a&gt; for doing so to avoid trademark issues with the Apache Software Foundation.&lt;/p&gt;
&lt;ol&gt;
&lt;li&gt;&lt;strong&gt;Apache Iceberg should be championed&lt;/strong&gt; in every meetup &lt;em&gt;and&lt;/em&gt; technical
session (after all, we&apos;re here to support this technology and our community)&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;All talks should be vendor-neutral&lt;/strong&gt; and not sales pitches (of course
vendors can be mentioned, but that should never be the point of the talk)&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Each meetup should have &lt;em&gt;at least&lt;/em&gt; two talks&lt;/strong&gt; with speakers
representing different companies/organizations (we need to champion
diversity of thought)&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Planned meetups ought to be brought to the attention of the dev list&lt;/strong&gt;
(this is to promote transparency and raise awareness)&lt;/li&gt;
&lt;/ol&gt;
&lt;p&gt;These rules include having an open call for speakers prior to the event and decided on the speakers among all event sponsors (and allow others to sponsor the event if they want to).&lt;/p&gt;
&lt;h3&gt;Organizing the Event&lt;/h3&gt;
&lt;p&gt;Essentially you have three main costs:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Venue&lt;/li&gt;
&lt;li&gt;Drinks&lt;/li&gt;
&lt;li&gt;Food&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;So soliciting people to co-sponsor the event either by sharing the costs of these things or having different sponsorts pay for different things is a good way to organize the event.&lt;/p&gt;
&lt;p&gt;All contriuting sponsors should have their logos on the event promotion. You&apos;ll want all these details squared away to allow at least 2 weeks of promotion before the event if not more.&lt;/p&gt;
&lt;h3&gt;Promoting the Event&lt;/h3&gt;
&lt;p&gt;You should first either create a meetup or lu.ma listing for the event. For Apache Iceberg meetups there are community run outlets to post your event.&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;
&lt;p&gt;&lt;a href=&quot;https://lu.ma/apache-iceberg?k=c&quot;&gt;Apache Iceberg Meetups Luma Calendar&lt;/a&gt;&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;a href=&quot;https://www.meetup.com/na-apache-iceberg-meetups/?eventOrigin=home_groups_you_organize&quot;&gt;North America Community Run Apache Iceberg Meetups&lt;/a&gt;&lt;/p&gt;
&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;Here are some other Luma Calendars and Meetup Groups you may want to follow for Lakehouse Events:&lt;/p&gt;
&lt;h5&gt;Meetup Groups&lt;/h5&gt;
&lt;ul&gt;
&lt;li&gt;&lt;a href=&quot;https://www.meetup.com/north-american-open-data-lakehouse-linkups/?eventOrigin=home_groups_you_organize&quot;&gt;North American Open Lakehouse Linkups&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&quot;https://www.meetup.com/iceberg-data-lakehouse-meetups/?eventOrigin=home_groups_you_organize&quot;&gt;Open Lakehouse Meetups&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;h5&gt;Luma Calendars&lt;/h5&gt;
&lt;p&gt;Message calendars@datalakehousehub.com to get your event added to these calendars, include link to Luma or Meetup event listing:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;a href=&quot;https://lu.ma/datalakehousemeetupsinternational?k=c&amp;amp;period=past&quot;&gt;Data Lakehouse Meetups International&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&quot;https://lu.ma/eastcoastuslakehousemeetups?k=c&amp;amp;period=past&quot;&gt;East Cost US Open Lakehouse Events&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&quot;https://lu.ma/westcoastlakehouse?k=c&amp;amp;period=past&quot;&gt;West Coast US Open Lakehouse Events&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&quot;https://lu.ma/Lakehouselinkups?k=c&quot;&gt;Lakehouse Linkups&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&quot;https://lu.ma/NYCDataLakehouse?k=c&quot;&gt;NYC Data Lakehouse Events&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&quot;https://lu.ma/Orlandodata?k=c&amp;amp;period=past&quot;&gt;Orlando Data Lakehouse Events&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;h5&gt;Social Media&lt;/h5&gt;
&lt;ul&gt;
&lt;li&gt;Make sure everyone involved is posting about the event on linkedin, twitter and blue sky.&lt;/li&gt;
&lt;/ul&gt;
&lt;h5&gt;Emails&lt;/h5&gt;
&lt;ul&gt;
&lt;li&gt;Sponsors should send emails about the event to their lists if they can, use Luma to email attendees to remind them about the event 7 days, 24 hours and 2 hours before the event with any logistics details they should know. Offering each sponsor a link in these emails to a related blog or asset is a good idea.&lt;/li&gt;
&lt;/ul&gt;
&lt;h2&gt;Conclusion&lt;/h2&gt;
&lt;p&gt;Bringing together the Lakehouse and Apache Iceberg community through meetups is one of the most effective ways to foster collaboration, share knowledge, and build meaningful relationships across organizations and regions. Whether you&apos;re organizing your first meetup or joining an existing one, the open and welcoming nature of these communities makes it easy to get involved. By leveraging platforms like Slack, Luma, and Meetup, and by following best practices for organizing inclusive and impactful events, you can help grow the ecosystem and play a key role in advancing open data architectures. So jump into a meetup channel, connect with others, and start planning — your community is waiting.&lt;/p&gt;
</content:encoded><author>Alex Merced</author></item><item><title>What is an API? And Why Data Architecture Depends on Them</title><link>https://iceberglakehouse.com/posts/2025-06-what-is-an-api/</link><guid isPermaLink="true">https://iceberglakehouse.com/posts/2025-06-what-is-an-api/</guid><description>
## Free Resources  
- **[Free Apache Iceberg Course](https://hello.dremio.com/webcast-an-apache-iceberg-lakehouse-crash-course-reg.html?utm_source=ev...</description><pubDate>Mon, 23 Jun 2025 09:00:00 GMT</pubDate><content:encoded>&lt;h2&gt;Free Resources&lt;/h2&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;&lt;a href=&quot;https://hello.dremio.com/webcast-an-apache-iceberg-lakehouse-crash-course-reg.html?utm_source=ev_external_blog&amp;amp;utm_medium=influencer&amp;amp;utm_campaign=what-is-an-api&amp;amp;utm_content=alexmerced&amp;amp;utm_term=external_blog&quot;&gt;Free Apache Iceberg Course&lt;/a&gt;&lt;/strong&gt;&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;&lt;a href=&quot;https://hello.dremio.com/wp-apache-iceberg-the-definitive-guide-reg.html?utm_source=ev_external_blog&amp;amp;utm_medium=influencer&amp;amp;utm_campaign=what-is-an-api&amp;amp;utm_content=alexmerced&amp;amp;utm_term=external_blog&quot;&gt;Free Copy of “Apache Iceberg: The Definitive Guide”&lt;/a&gt;&lt;/strong&gt;&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;&lt;a href=&quot;https://hello.dremio.com/wp-apache-polaris-guide-reg.html?utm_source=ev_external_blog&amp;amp;utm_medium=influencer&amp;amp;utm_campaign=what-is-an-api&amp;amp;utm_content=alexmerced&amp;amp;utm_term=external_blog&quot;&gt;Free Copy of “Apache Polaris: The Definitive Guide”&lt;/a&gt;&lt;/strong&gt;&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;&lt;a href=&quot;https://medium.com/data-engineering-with-dremio/2025-guide-to-architecting-an-iceberg-lakehouse-9b19ed42c9de&quot;&gt;2025 Apache Iceberg Architecture Guide&lt;/a&gt;&lt;/strong&gt;&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;&lt;a href=&quot;https://medium.alexmerced.blog/guide-to-finding-apache-iceberg-events-near-you-and-being-part-of-the-greater-iceberg-community-0c38ae785ddb&quot;&gt;How to Join the Iceberg Community&lt;/a&gt;&lt;/strong&gt;&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;&lt;a href=&quot;https://youtube.com/playlist?list=PLsLAVBjQJO0p0Yq1fLkoHvt2lEJj5pcYe&amp;amp;si=WTSnqjXZv6Glkc3y&quot;&gt;Iceberg Lakehouse Engineering Video Playlist&lt;/a&gt;&lt;/strong&gt;&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;&lt;a href=&quot;https://medium.com/data-engineering-with-dremio/ultimate-directory-of-apache-iceberg-resources-e3e02efac62e&quot;&gt;Ultimate Apache Iceberg Resource Guide&lt;/a&gt;&lt;/strong&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;Imagine walking into a restaurant in a foreign country where you don’t speak the language. You point at things, gesture wildly, maybe even draw pictures — anything to communicate what you want. But if you and the server spoke a common language like English or Spanish, things would go a lot smoother.&lt;/p&gt;
&lt;p&gt;That’s exactly what APIs do for software systems. They are shared languages that define how software components talk to each other. Without a shared API, systems can&apos;t collaborate easily, leading to miscommunication, friction, or total breakdown.&lt;/p&gt;
&lt;p&gt;In this post, we&apos;ll unpack what APIs are and why they’re critical in data architecture. We&apos;ll explore the different types of APIs, how they&apos;ve shaped modern data workflows, and the standards that have emerged in key areas like storage, data transport, and cataloging. Whether you&apos;re a developer building integrations or a data architect planning your stack, understanding these APIs is essential for navigating today&apos;s complex data ecosystem.&lt;/p&gt;
&lt;h2&gt;What is an API?&lt;/h2&gt;
&lt;p&gt;An API, or Application Programming Interface, is like a contract that defines how different software components can interact. Think of it as a language specification — if two programs speak the same API, they can communicate effectively, even if they&apos;re written in different languages or run on different platforms.&lt;/p&gt;
&lt;p&gt;Just like a language has rules for grammar and vocabulary, an API defines the rules for how requests are made, what data is expected, and how responses are structured. When software follows these rules, integration becomes smooth and predictable.&lt;/p&gt;
&lt;p&gt;It&apos;s important to recognize that the term &amp;quot;API&amp;quot; can mean different things depending on context:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;In software development, an API can refer to the functions and methods exposed by a library or class. If one class implements the same method signatures as another, it can serve as a drop-in replacement.&lt;/li&gt;
&lt;li&gt;In system integration, APIs more commonly refer to how different applications or services communicate over a network, especially using HTTP. This includes how data is sent, what endpoints exist, and how authentication is handled.&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;In essence, APIs enable modularity and collaboration in software. They allow teams to build components independently, knowing they can connect through a well-defined interface.&lt;/p&gt;
&lt;h2&gt;The Four Horsemen of HTTP APIs&lt;/h2&gt;
&lt;p&gt;When most people talk about APIs in modern software systems, they’re usually referring to HTTP-based APIs — interfaces that allow software to communicate over the web or internal networks. Over time, four main styles of HTTP APIs have emerged, each with its own strengths and trade-offs.&lt;/p&gt;
&lt;h3&gt;1. SOAP (Simple Object Access Protocol)&lt;/h3&gt;
&lt;p&gt;SOAP is a protocol-based API style that uses XML to encode messages and enforces strict standards for how messages are structured. It includes built-in specifications for things like security and error handling. While powerful, SOAP is often seen as heavyweight and complex, which has led to a decline in its use for most new applications.&lt;/p&gt;
&lt;h3&gt;2. REST (Representational State Transfer)&lt;/h3&gt;
&lt;p&gt;REST is more lightweight and flexible. It uses standard HTTP methods like GET, POST, PUT, and DELETE to perform operations on resources, which are identified via URLs. REST APIs are stateless, meaning each request contains all the information needed to process it. REST&apos;s simplicity and widespread adoption have made it the go-to style for many web services.&lt;/p&gt;
&lt;h3&gt;3. RPC (Remote Procedure Call)&lt;/h3&gt;
&lt;p&gt;RPC is all about invoking functions remotely. Instead of thinking in terms of resources, you think in terms of actions — like calling a method named &lt;code&gt;getUserDetails&lt;/code&gt;. RPC can use different serialization formats (like JSON-RPC or gRPC) and tends to be more efficient for certain tasks, especially internal service communication.&lt;/p&gt;
&lt;h3&gt;4. GraphQL&lt;/h3&gt;
&lt;p&gt;GraphQL allows clients to request exactly the data they need and nothing more. Instead of multiple endpoints, there’s typically a single endpoint that interprets a query language. This can reduce over-fetching and under-fetching of data and provides a more dynamic interface, especially useful for frontend applications.&lt;/p&gt;
&lt;p&gt;Each of these API types has its place in the ecosystem. Understanding their differences helps you pick the right tool for the job depending on complexity, flexibility, and performance needs.&lt;/p&gt;
&lt;h2&gt;Why APIs Matter in Modern Data Architecture&lt;/h2&gt;
&lt;p&gt;The modern data stack is a vibrant and diverse ecosystem. From ingestion tools and storage layers to transformation engines and visualization platforms, each component often comes from a different vendor or open-source project. The glue that holds this ecosystem together is the API.&lt;/p&gt;
&lt;p&gt;With so many tools available, the ability to integrate them seamlessly becomes a competitive advantage. Instead of reinventing the wheel, software platforms that adopt well-known APIs can plug into existing workflows and leverage established tooling. This interoperability allows teams to mix and match components without being locked into a single vendor or technology stack.&lt;/p&gt;
&lt;p&gt;For example, if two different tools both understand the same API for reading from a data catalog or writing to object storage, they can work together out of the box. This eliminates the need for custom connectors or fragile workarounds.&lt;/p&gt;
&lt;p&gt;APIs also encourage specialization. A tool can focus on doing one thing well — like cataloging metadata or transporting data — and expose an API that others can build upon. This modularity is what makes today&apos;s data architectures more flexible and scalable than ever before.&lt;/p&gt;
&lt;p&gt;In short, APIs are the foundation of composability in data systems. They allow different parts of the stack to evolve independently while still working together in harmony.&lt;/p&gt;
&lt;h2&gt;Case Study – The Ubiquity of the S3 API&lt;/h2&gt;
&lt;p&gt;Amazon S3 wasn&apos;t just a game changer because it offered scalable cloud storage. It also introduced a clean, consistent API that made storing and retrieving objects over the web straightforward. This API became so widely adopted that it evolved into a de facto standard for cloud object storage.&lt;/p&gt;
&lt;p&gt;As other cloud providers and storage platforms emerged, they faced a choice: create their own APIs or adopt the S3 API. Many chose the latter. Why? Because the S3 API already had a massive ecosystem of integrations. Backup tools, data lakes, ETL pipelines, and analytics platforms already knew how to talk to S3. By supporting the S3 API, new storage services could plug into these tools without requiring any custom development.&lt;/p&gt;
&lt;p&gt;This is a powerful example of how API adoption fuels interoperability. Instead of forcing users to learn a new interface or rebuild their workflows, S3-compatible services ride the wave of existing infrastructure. As a result, users get flexibility and choice without sacrificing compatibility.&lt;/p&gt;
&lt;p&gt;The takeaway: when an API reaches critical mass, it becomes more than a technical interface — it becomes an ecosystem enabler.&lt;/p&gt;
&lt;h2&gt;Data Transport APIs – From JDBC/ODBC to ADBC&lt;/h2&gt;
&lt;p&gt;Moving data between systems has always been a core challenge in data architecture. For decades, the standard approach involved using JDBC (Java Database Connectivity) and ODBC (Open Database Connectivity). These APIs allowed applications to connect to relational databases in a consistent way, abstracting the underlying database-specific protocols.&lt;/p&gt;
&lt;p&gt;While JDBC and ODBC have served well, they come with limitations. These APIs were designed for transactional systems and row-based data access. As analytics workloads became more complex and data volumes grew, these traditional interfaces began to show performance bottlenecks.&lt;/p&gt;
&lt;p&gt;It’s also important to note that JDBC and ODBC are not HTTP-based APIs. They operate over lower-level network protocols tailored to database drivers and client libraries. This can make them harder to integrate in cloud-native or language-agnostic environments.&lt;/p&gt;
&lt;p&gt;Enter ADBC (Arrow Database Connectivity), a modern alternative designed for analytical use cases. ADBC builds on Arrow Flight, which is a gRPC-based protocol optimized for high-throughput data transport. Instead of transferring rows one by one, Arrow Flight sends columnar batches over a persistent connection, dramatically improving efficiency for analytical queries.&lt;/p&gt;
&lt;p&gt;With ADBC, the API is designed for today’s needs: fast, language-agnostic, and cloud-friendly. It embraces open standards like Apache Arrow and gRPC to deliver performance without sacrificing interoperability.&lt;/p&gt;
&lt;p&gt;As analytics platforms grow more distributed and data-hungry, APIs like ADBC represent a forward-looking approach to data transport — one that matches the scale and speed of modern data systems.&lt;/p&gt;
&lt;h2&gt;Data Catalog APIs – Hive, Glue, and Iceberg REST&lt;/h2&gt;
&lt;p&gt;Lakehouse Data catalogs store metadata about datasets — such as schema, location, and partitioning — allowing tools to discover and manage data assets consistently. But for this ecosystem to function, catalogs need APIs that other tools can understand.&lt;/p&gt;
&lt;p&gt;Three primary catalog APIs have emerged in the lakehouse and analytics space:&lt;/p&gt;
&lt;h3&gt;1. Hive Metastore API&lt;/h3&gt;
&lt;p&gt;The Hive API was one of the earliest standards for metadata management in Hadoop-based systems. Because Apache Hive gained significant adoption early on, its metastore API became widely supported. Even tools that don’t use Hive for querying often support its API for interoperability.&lt;/p&gt;
&lt;h3&gt;2. AWS Glue Catalog API&lt;/h3&gt;
&lt;p&gt;As AWS became a dominant platform for cloud-native analytics, its Glue Catalog gained traction. Glue offered a managed alternative to Hive with cloud-native scalability and tight integration with AWS services. Many tools added support for Glue to integrate seamlessly within AWS ecosystems.&lt;/p&gt;
&lt;h3&gt;3. Apache Iceberg REST Catalog API&lt;/h3&gt;
&lt;p&gt;The Iceberg project initially struggled with catalog integration due to varying implementations. To solve this, the community introduced a REST-based catalog API that standardizes how tools interact with Iceberg catalogs regardless of the underlying backend. This REST interface provides a clear contract and enables broader compatibility. Catalogs that support the Iceberg REST Catalog (IRC) API include Apache Polaris (incubating), Apache Gravitino, Dremio Catalog, Open Catalog, AWS Glue Catalog, Lakekeeper, Nessie, Unity Catalog and many more. Most specialized Iceberg tooling uses this as the main catalog API for discovering your Apache Iceberg datasets while catalogs like Polaris, Gravitino and Unity also adopt other APIs to make additional datasets discoverable.&lt;/p&gt;
&lt;p&gt;Today, most lakehouse tools support one or more of these APIs to ensure compatibility across different environments. Whether you&apos;re working with on-prem systems using Hive, cloud-native stacks using Glue, or modern lakehouse engines built around Iceberg, API adoption remains the key to ecosystem integration.&lt;/p&gt;
&lt;p&gt;Choosing catalog tools that support these APIs ensures you&apos;re building on a foundation that promotes interoperability, flexibility, and future-proofing.&lt;/p&gt;
&lt;h2&gt;Conclusion&lt;/h2&gt;
&lt;p&gt;APIs are more than just technical interfaces — they are the connective tissue of modern software. In data architecture, where tools span a wide range of functions and vendors, APIs enable these components to work together smoothly.&lt;/p&gt;
&lt;p&gt;We’ve seen how APIs act like shared languages, allowing software to communicate efficiently. From foundational HTTP-based APIs like REST and GraphQL, to specialized data interfaces like the S3 API, JDBC, ADBC, and various catalog APIs, each plays a role in shaping the data landscape.&lt;/p&gt;
&lt;p&gt;By adopting established APIs, tools become more compatible, easier to integrate, and more valuable within the broader ecosystem. And for data teams, aligning on common APIs means less time wrestling with custom connectors and more time delivering insights.&lt;/p&gt;
&lt;p&gt;As the data world continues to evolve, understanding and leveraging key APIs is essential. They’re not just part of the plumbing — they’re a strategic asset for building robust, scalable, and flexible data systems.&lt;/p&gt;
</content:encoded><author>Alex Merced</author></item><item><title>Decoding AWS EC2 Instance Type Names</title><link>https://iceberglakehouse.com/posts/2025-06-AWS-Instance-Types/</link><guid isPermaLink="true">https://iceberglakehouse.com/posts/2025-06-AWS-Instance-Types/</guid><description>
## Free Resources  
- **[Free Apache Iceberg Course](https://hello.dremio.com/webcast-an-apache-iceberg-lakehouse-crash-course-reg.html?utm_source=ev...</description><pubDate>Wed, 18 Jun 2025 09:00:00 GMT</pubDate><content:encoded>&lt;h2&gt;Free Resources&lt;/h2&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;&lt;a href=&quot;https://hello.dremio.com/webcast-an-apache-iceberg-lakehouse-crash-course-reg.html?utm_source=ev_external_blog&amp;amp;utm_medium=influencer&amp;amp;utm_campaign=intro_to_de&amp;amp;utm_content=alexmerced&amp;amp;utm_term=external_blog&quot;&gt;Free Apache Iceberg Course&lt;/a&gt;&lt;/strong&gt;&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;&lt;a href=&quot;https://hello.dremio.com/wp-apache-iceberg-the-definitive-guide-reg.html?utm_source=ev_external_blog&amp;amp;utm_medium=influencer&amp;amp;utm_campaign=intro_to_de&amp;amp;utm_content=alexmerced&amp;amp;utm_term=external_blog&quot;&gt;Free Copy of “Apache Iceberg: The Definitive Guide”&lt;/a&gt;&lt;/strong&gt;&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;&lt;a href=&quot;https://hello.dremio.com/wp-apache-polaris-guide-reg.html?utm_source=ev_external_blog&amp;amp;utm_medium=influencer&amp;amp;utm_campaign=intro_to_de&amp;amp;utm_content=alexmerced&amp;amp;utm_term=external_blog&quot;&gt;Free Copy of “Apache Polaris: The Definitive Guide”&lt;/a&gt;&lt;/strong&gt;&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;&lt;a href=&quot;https://medium.com/data-engineering-with-dremio/2025-guide-to-architecting-an-iceberg-lakehouse-9b19ed42c9de&quot;&gt;2025 Apache Iceberg Architecture Guide&lt;/a&gt;&lt;/strong&gt;&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;&lt;a href=&quot;https://medium.alexmerced.blog/guide-to-finding-apache-iceberg-events-near-you-and-being-part-of-the-greater-iceberg-community-0c38ae785ddb&quot;&gt;How to Join the Iceberg Community&lt;/a&gt;&lt;/strong&gt;&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;&lt;a href=&quot;https://youtube.com/playlist?list=PLsLAVBjQJO0p0Yq1fLkoHvt2lEJj5pcYe&amp;amp;si=WTSnqjXZv6Glkc3y&quot;&gt;Iceberg Lakehouse Engineering Video Playlist&lt;/a&gt;&lt;/strong&gt;&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;&lt;a href=&quot;https://medium.com/data-engineering-with-dremio/ultimate-directory-of-apache-iceberg-resources-e3e02efac62e&quot;&gt;Ultimate Apache Iceberg Resource Guide&lt;/a&gt;&lt;/strong&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;h2&gt;Introduction&lt;/h2&gt;
&lt;p&gt;If you&apos;ve ever browsed AWS EC2 instance types and found yourself staring blankly at names like &lt;code&gt;m5.large&lt;/code&gt;, &lt;code&gt;c6g.xlarge&lt;/code&gt;, or &lt;code&gt;r7a.2xlarge&lt;/code&gt;, you&apos;re not alone. At first glance, these names can feel cryptic—like trying to decode a secret code.&lt;/p&gt;
&lt;p&gt;But here&apos;s the good news: there&apos;s a method to the madness. Each part of an instance type name tells you something important about the underlying hardware, performance characteristics, and intended use case.&lt;/p&gt;
&lt;p&gt;In this blog post, we&apos;ll break down the structure of AWS instance type names and show you how to read them like a pro. Once you understand how to interpret each component, you&apos;ll be able to confidently choose the right instance for your workload—and maybe even impress your colleagues with your cloud fluency.&lt;/p&gt;
&lt;h2&gt;The Anatomy of an Instance Type&lt;/h2&gt;
&lt;p&gt;Every AWS EC2 instance type name is composed of distinct parts that reveal critical details about the instance&apos;s capabilities. The general structure looks like this:&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;[family][generation][optional suffix].[size]
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Let’s take the instance type &lt;code&gt;c6g.large&lt;/code&gt; as an example:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;code&gt;c&lt;/code&gt; → &lt;strong&gt;Compute optimized&lt;/strong&gt;&lt;/li&gt;
&lt;li&gt;&lt;code&gt;6&lt;/code&gt; → &lt;strong&gt;6th generation hardware&lt;/strong&gt;&lt;/li&gt;
&lt;li&gt;&lt;code&gt;g&lt;/code&gt; → &lt;strong&gt;Powered by AWS Graviton (ARM-based processor)&lt;/strong&gt;&lt;/li&gt;
&lt;li&gt;&lt;code&gt;large&lt;/code&gt; → &lt;strong&gt;Medium-sized instance (typically 2 vCPUs and 4 GB RAM)&lt;/strong&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;By understanding what each segment means, you can quickly assess whether an instance is optimized for compute, memory, storage, or GPU, and how big or powerful it is.&lt;/p&gt;
&lt;p&gt;In the sections below, we’ll walk through each part of the name in more detail.&lt;/p&gt;
&lt;h2&gt;Family – What Is the Instance Optimized For?&lt;/h2&gt;
&lt;p&gt;The first letter (or set of letters) in an instance type indicates the &lt;strong&gt;instance family&lt;/strong&gt;, which tells you what the instance is optimized for. This helps guide your choice based on the nature of your workload—whether you need general-purpose performance, high CPU, large memory, or GPU acceleration.&lt;/p&gt;
&lt;p&gt;Here’s a quick overview of the most common instance families:&lt;/p&gt;
&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Family&lt;/th&gt;
&lt;th&gt;Description&lt;/th&gt;
&lt;th&gt;Common Use Cases&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;t&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;Burstable general purpose&lt;/td&gt;
&lt;td&gt;Development, low-traffic websites&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;m&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;General purpose&lt;/td&gt;
&lt;td&gt;Balanced CPU and memory workloads&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;c&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;Compute optimized&lt;/td&gt;
&lt;td&gt;High-performance computing, batch processing&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;r&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;Memory optimized&lt;/td&gt;
&lt;td&gt;In-memory databases, real-time analytics&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;x&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;Extra memory optimized&lt;/td&gt;
&lt;td&gt;SAP HANA, memory-intensive enterprise apps&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;i&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;Storage optimized (high IOPS)&lt;/td&gt;
&lt;td&gt;NoSQL databases, large transactional systems&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;g&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;GPU instances&lt;/td&gt;
&lt;td&gt;Machine learning, video rendering&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;p&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;High-performance GPU&lt;/td&gt;
&lt;td&gt;Deep learning training, scientific modeling&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;h&lt;/code&gt;, &lt;code&gt;d&lt;/code&gt;, &lt;code&gt;z&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;Specialized families&lt;/td&gt;
&lt;td&gt;Varies (HPC, local storage, high-frequency)&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;
&lt;p&gt;Understanding the family is the first step in selecting the right instance. For example, if your application is CPU-bound, a &lt;code&gt;c&lt;/code&gt; family instance will typically deliver better performance per dollar than an &lt;code&gt;m&lt;/code&gt; or &lt;code&gt;t&lt;/code&gt; instance.&lt;/p&gt;
&lt;h2&gt;Generation – How New Is the Hardware?&lt;/h2&gt;
&lt;p&gt;The number immediately following the family letter represents the &lt;strong&gt;generation&lt;/strong&gt; of the instance. AWS continuously improves its infrastructure, and newer generations typically offer better performance, energy efficiency, and cost-effectiveness compared to older ones.&lt;/p&gt;
&lt;p&gt;For example:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;code&gt;m4&lt;/code&gt; → 4th generation general-purpose instance&lt;/li&gt;
&lt;li&gt;&lt;code&gt;m5&lt;/code&gt; → Newer 5th generation version&lt;/li&gt;
&lt;li&gt;&lt;code&gt;m6g&lt;/code&gt; → 6th generation with Graviton (ARM-based processor)&lt;/li&gt;
&lt;/ul&gt;
&lt;h3&gt;Why It Matters:&lt;/h3&gt;
&lt;p&gt;Choosing a newer generation instance usually means access to:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Improved CPUs (e.g., Intel Ice Lake, AMD EPYC, or AWS Graviton)&lt;/li&gt;
&lt;li&gt;Better network and storage throughput&lt;/li&gt;
&lt;li&gt;Lower cost for similar or better performance&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;That said, not all regions have the latest generation available. Always check your region’s instance offerings and benchmark critical workloads if performance is a top priority.&lt;/p&gt;
&lt;h2&gt;Suffix – Special Chips or Capabilities&lt;/h2&gt;
&lt;p&gt;Some instance types include an optional &lt;strong&gt;suffix&lt;/strong&gt;—a letter (or combination of letters) that provides additional detail about the instance’s hardware or features. These suffixes appear immediately after the generation number and can help you identify special variants optimized for particular use cases.&lt;/p&gt;
&lt;h3&gt;Common Suffixes and What They Mean:&lt;/h3&gt;
&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Suffix&lt;/th&gt;
&lt;th&gt;Meaning&lt;/th&gt;
&lt;th&gt;Description&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;a&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;AMD EPYC processor&lt;/td&gt;
&lt;td&gt;Cost-effective alternative to Intel-based instances&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;g&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;AWS Graviton processor (ARM-based)&lt;/td&gt;
&lt;td&gt;Energy-efficient, high performance, lower cost&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;n&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;Network-optimized&lt;/td&gt;
&lt;td&gt;Enhanced network bandwidth and performance&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;d&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;Includes local NVMe storage&lt;/td&gt;
&lt;td&gt;Fast local instance storage for low-latency workloads&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;e&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;Extended memory or enhanced features&lt;/td&gt;
&lt;td&gt;More memory or improved capabilities per vCPU&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;z&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;High-frequency Intel CPUs&lt;/td&gt;
&lt;td&gt;For workloads that need very high clock speed&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;
&lt;h3&gt;Example:&lt;/h3&gt;
&lt;ul&gt;
&lt;li&gt;&lt;code&gt;r6a&lt;/code&gt; → Memory optimized (r), 6th generation, AMD processor (a)&lt;/li&gt;
&lt;li&gt;&lt;code&gt;m6g&lt;/code&gt; → General purpose (m), 6th generation, Graviton processor (g)&lt;/li&gt;
&lt;li&gt;&lt;code&gt;i3d&lt;/code&gt; → Storage optimized (i), 3rd generation, with NVMe instance store (d)&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;These suffixes allow you to fine-tune your instance selection based on price, performance, or architecture preferences—especially important if your software is architecture-sensitive (e.g., x86 vs ARM).&lt;/p&gt;
&lt;h2&gt;Size – How Big Is the Instance?&lt;/h2&gt;
&lt;p&gt;The part of the instance type that comes &lt;strong&gt;after the period (&lt;code&gt;.&lt;/code&gt;)&lt;/strong&gt; defines the &lt;strong&gt;size&lt;/strong&gt; of the instance. This determines how many vCPUs, how much memory, and sometimes how much networking or storage bandwidth is allocated.&lt;/p&gt;
&lt;p&gt;AWS uses consistent naming for sizes across instance families:&lt;/p&gt;
&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Size&lt;/th&gt;
&lt;th&gt;Description&lt;/th&gt;
&lt;th&gt;Typical vCPUs&lt;/th&gt;
&lt;th&gt;Notes&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;.nano&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;Very small&lt;/td&gt;
&lt;td&gt;1&lt;/td&gt;
&lt;td&gt;For ultra-light workloads&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;.micro&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;Small&lt;/td&gt;
&lt;td&gt;1&lt;/td&gt;
&lt;td&gt;Entry-level, burstable performance&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;.small&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;Modest&lt;/td&gt;
&lt;td&gt;1–2&lt;/td&gt;
&lt;td&gt;Slightly more consistent CPU&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;.medium&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;Standard&lt;/td&gt;
&lt;td&gt;1–2&lt;/td&gt;
&lt;td&gt;Balanced for small apps&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;.large&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;2x baseline&lt;/td&gt;
&lt;td&gt;2&lt;/td&gt;
&lt;td&gt;Common for dev/test workloads&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;.xlarge&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;4x baseline&lt;/td&gt;
&lt;td&gt;4&lt;/td&gt;
&lt;td&gt;Heavier compute or memory needs&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;.2xlarge&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;8x baseline&lt;/td&gt;
&lt;td&gt;8&lt;/td&gt;
&lt;td&gt;Medium to large production loads&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;.4xlarge&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;16x baseline&lt;/td&gt;
&lt;td&gt;16&lt;/td&gt;
&lt;td&gt;High-capacity apps&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;.8xlarge&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;32x baseline&lt;/td&gt;
&lt;td&gt;32&lt;/td&gt;
&lt;td&gt;Data processing, analytics&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;.12xlarge&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;48x baseline&lt;/td&gt;
&lt;td&gt;48&lt;/td&gt;
&lt;td&gt;High-scale enterprise workloads&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;.24xlarge&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;96x baseline&lt;/td&gt;
&lt;td&gt;96&lt;/td&gt;
&lt;td&gt;Very high-performance computing&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;.metal&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;Bare metal (no hypervisor)&lt;/td&gt;
&lt;td&gt;Varies&lt;/td&gt;
&lt;td&gt;Full access to physical server&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;
&lt;h3&gt;Example:&lt;/h3&gt;
&lt;ul&gt;
&lt;li&gt;&lt;code&gt;m5.large&lt;/code&gt; = General-purpose instance, 5th generation, with 2 vCPUs and 8 GB memory.&lt;/li&gt;
&lt;li&gt;&lt;code&gt;c6g.4xlarge&lt;/code&gt; = Compute optimized, 6th gen, Graviton processor, with 16 vCPUs and 32 GB memory.&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;Choosing the right size allows you to scale &lt;strong&gt;vertically&lt;/strong&gt; by increasing resources within a single instance, or &lt;strong&gt;horizontally&lt;/strong&gt; by adding more instances of a smaller size depending on your architecture and cost goals.&lt;/p&gt;
&lt;h2&gt;Pulling It All Together&lt;/h2&gt;
&lt;p&gt;Now that you understand each component—&lt;strong&gt;family&lt;/strong&gt;, &lt;strong&gt;generation&lt;/strong&gt;, &lt;strong&gt;suffix&lt;/strong&gt;, and &lt;strong&gt;size&lt;/strong&gt;—you can decode any EC2 instance type and understand exactly what it offers.&lt;/p&gt;
&lt;p&gt;Let’s break down a few examples to reinforce what you’ve learned:&lt;/p&gt;
&lt;h3&gt;🔹 Example 1: &lt;code&gt;c6g.large&lt;/code&gt;&lt;/h3&gt;
&lt;ul&gt;
&lt;li&gt;&lt;code&gt;c&lt;/code&gt; → Compute optimized&lt;/li&gt;
&lt;li&gt;&lt;code&gt;6&lt;/code&gt; → 6th generation&lt;/li&gt;
&lt;li&gt;&lt;code&gt;g&lt;/code&gt; → AWS Graviton (ARM-based processor)&lt;/li&gt;
&lt;li&gt;&lt;code&gt;large&lt;/code&gt; → Medium-sized (2 vCPUs, ~4 GB RAM)&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;&lt;strong&gt;Use case:&lt;/strong&gt; Great for compute-heavy applications running on ARM, like containerized services or microservices at scale.&lt;/p&gt;
&lt;hr&gt;
&lt;h3&gt;🔹 Example 2: &lt;code&gt;r5d.4xlarge&lt;/code&gt;&lt;/h3&gt;
&lt;ul&gt;
&lt;li&gt;&lt;code&gt;r&lt;/code&gt; → Memory optimized&lt;/li&gt;
&lt;li&gt;&lt;code&gt;5&lt;/code&gt; → 5th generation&lt;/li&gt;
&lt;li&gt;&lt;code&gt;d&lt;/code&gt; → Includes local NVMe SSD instance store&lt;/li&gt;
&lt;li&gt;&lt;code&gt;4xlarge&lt;/code&gt; → 16 vCPUs and 128 GB RAM&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;&lt;strong&gt;Use case:&lt;/strong&gt; Ideal for high-throughput, in-memory databases or data processing that benefits from fast local storage.&lt;/p&gt;
&lt;hr&gt;
&lt;h3&gt;🔹 Example 3: &lt;code&gt;m7a.xlarge&lt;/code&gt;&lt;/h3&gt;
&lt;ul&gt;
&lt;li&gt;&lt;code&gt;m&lt;/code&gt; → General purpose&lt;/li&gt;
&lt;li&gt;&lt;code&gt;7&lt;/code&gt; → 7th generation&lt;/li&gt;
&lt;li&gt;&lt;code&gt;a&lt;/code&gt; → AMD EPYC processor&lt;/li&gt;
&lt;li&gt;&lt;code&gt;xlarge&lt;/code&gt; → 4 vCPUs, 16 GB RAM&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;&lt;strong&gt;Use case:&lt;/strong&gt; Balanced workloads where cost-effectiveness is important, such as web applications or business logic layers.&lt;/p&gt;
&lt;hr&gt;
&lt;p&gt;Understanding how to read these names makes it easier to compare instance types, choose the best fit for your application, and avoid over-provisioning. You’ll save money, optimize performance, and build with more confidence on AWS.&lt;/p&gt;
</content:encoded><author>Alex Merced</author></item><item><title>Introduction to Data Engineering Concepts | What is Data Engineering?</title><link>https://iceberglakehouse.com/posts/2025-05-intro-to-data-engineering-concepts-01/</link><guid isPermaLink="true">https://iceberglakehouse.com/posts/2025-05-intro-to-data-engineering-concepts-01/</guid><description>
## Free Resources  
- **[Free Apache Iceberg Course](https://hello.dremio.com/webcast-an-apache-iceberg-lakehouse-crash-course-reg.html?utm_source=ev...</description><pubDate>Fri, 02 May 2025 09:00:00 GMT</pubDate><content:encoded>&lt;h2&gt;Free Resources&lt;/h2&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;&lt;a href=&quot;https://hello.dremio.com/webcast-an-apache-iceberg-lakehouse-crash-course-reg.html?utm_source=ev_external_blog&amp;amp;utm_medium=influencer&amp;amp;utm_campaign=intro_to_de&amp;amp;utm_content=alexmerced&amp;amp;utm_term=external_blog&quot;&gt;Free Apache Iceberg Course&lt;/a&gt;&lt;/strong&gt;&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;&lt;a href=&quot;https://hello.dremio.com/wp-apache-iceberg-the-definitive-guide-reg.html?utm_source=ev_external_blog&amp;amp;utm_medium=influencer&amp;amp;utm_campaign=intro_to_de&amp;amp;utm_content=alexmerced&amp;amp;utm_term=external_blog&quot;&gt;Free Copy of “Apache Iceberg: The Definitive Guide”&lt;/a&gt;&lt;/strong&gt;&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;&lt;a href=&quot;https://hello.dremio.com/wp-apache-polaris-guide-reg.html?utm_source=ev_external_blog&amp;amp;utm_medium=influencer&amp;amp;utm_campaign=intro_to_de&amp;amp;utm_content=alexmerced&amp;amp;utm_term=external_blog&quot;&gt;Free Copy of “Apache Polaris: The Definitive Guide”&lt;/a&gt;&lt;/strong&gt;&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;&lt;a href=&quot;https://medium.com/data-engineering-with-dremio/2025-guide-to-architecting-an-iceberg-lakehouse-9b19ed42c9de&quot;&gt;2025 Apache Iceberg Architecture Guide&lt;/a&gt;&lt;/strong&gt;&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;&lt;a href=&quot;https://medium.alexmerced.blog/guide-to-finding-apache-iceberg-events-near-you-and-being-part-of-the-greater-iceberg-community-0c38ae785ddb&quot;&gt;How to Join the Iceberg Community&lt;/a&gt;&lt;/strong&gt;&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;&lt;a href=&quot;https://youtube.com/playlist?list=PLsLAVBjQJO0p0Yq1fLkoHvt2lEJj5pcYe&amp;amp;si=WTSnqjXZv6Glkc3y&quot;&gt;Iceberg Lakehouse Engineering Video Playlist&lt;/a&gt;&lt;/strong&gt;&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;&lt;a href=&quot;https://medium.com/data-engineering-with-dremio/ultimate-directory-of-apache-iceberg-resources-e3e02efac62e&quot;&gt;Ultimate Apache Iceberg Resource Guide&lt;/a&gt;&lt;/strong&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;Data engineering sits at the heart of modern data-driven organizations. While data science often grabs headlines with predictive models and AI, it&apos;s the data engineer who builds and maintains the infrastructure that makes all of that possible. In this first post of our series, we’ll explore what data engineering is, why it matters, and how it fits into the broader data ecosystem.&lt;/p&gt;
&lt;h2&gt;The Role of the Data Engineer&lt;/h2&gt;
&lt;p&gt;Think of a data engineer as the architect and builder of the data highways. These professionals design, construct, and maintain systems that move, transform, and store data efficiently. Their job is to ensure that data flows from various sources into data warehouses or lakes where it can be used reliably for analysis, reporting, and machine learning.&lt;/p&gt;
&lt;p&gt;In a practical sense, this means working with pipelines that connect everything from transactional databases and API feeds to large-scale storage systems. Data engineers work closely with data analysts, scientists, and platform teams to ensure the data is clean, consistent, and available when needed.&lt;/p&gt;
&lt;h2&gt;From Raw to Refined: The Journey of Data&lt;/h2&gt;
&lt;p&gt;Raw data is rarely useful as-is. It often arrives incomplete, messy, or inconsistently formatted. Data engineers are responsible for shepherding this raw material through a series of processing stages to prepare it for consumption.&lt;/p&gt;
&lt;p&gt;This involves tasks like:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Data ingestion (bringing data in from various sources)&lt;/li&gt;
&lt;li&gt;Data transformation (cleaning, enriching, and reshaping the data)&lt;/li&gt;
&lt;li&gt;Data storage (choosing optimal formats and storage solutions)&lt;/li&gt;
&lt;li&gt;Data delivery (ensuring end users can access data quickly and easily)&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;At each stage, considerations around scalability, performance, security, and governance come into play.&lt;/p&gt;
&lt;h2&gt;Data Engineering vs Data Science&lt;/h2&gt;
&lt;p&gt;It&apos;s common to see some confusion between the roles of data engineers and data scientists. While their work is often complementary, their responsibilities are distinct.&lt;/p&gt;
&lt;p&gt;A data scientist focuses on analyzing data and building predictive models. Their tools often include Python, R, and statistical frameworks. On the other hand, data engineers build the systems that make the data usable in the first place. They are often more focused on infrastructure, system design, and optimization.&lt;/p&gt;
&lt;p&gt;In short: the data scientist asks questions; the data engineer ensures the data is ready to answer them.&lt;/p&gt;
&lt;h2&gt;A Brief History of the Data Stack&lt;/h2&gt;
&lt;p&gt;The evolution of data engineering can be seen in how the data stack has changed over time.&lt;/p&gt;
&lt;p&gt;In traditional environments, organizations relied heavily on ETL tools to move data from relational databases into on-premise warehouses. These systems were tightly controlled but not particularly flexible or scalable.&lt;/p&gt;
&lt;p&gt;With the rise of big data, open-source tools like Hadoop and Spark introduced new ways to process data at scale. More recently, cloud-native services and modern orchestration frameworks have enabled even more agility and scalability in data workflows.&lt;/p&gt;
&lt;p&gt;This evolution has led to concepts like the &lt;strong&gt;modern data stack&lt;/strong&gt; and &lt;strong&gt;data lakehouse&lt;/strong&gt;—topics we’ll cover later in this series.&lt;/p&gt;
&lt;h2&gt;Why It Matters&lt;/h2&gt;
&lt;p&gt;Every modern organization depends on data. But without a solid foundation, data becomes a liability rather than an asset. Poorly managed data can lead to flawed insights, compliance issues, and lost opportunities.&lt;/p&gt;
&lt;p&gt;Good data engineering practices ensure that data is:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Accurate and timely&lt;/li&gt;
&lt;li&gt;Secure and compliant&lt;/li&gt;
&lt;li&gt;Scalable and performant&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;In a world where data volumes and velocity are only increasing, the importance of data engineering will only continue to grow.&lt;/p&gt;
&lt;h2&gt;What’s Next&lt;/h2&gt;
&lt;p&gt;Now that we’ve outlined the role and importance of data engineering, the next step is to explore how data gets into a system in the first place. In the next post, we’ll dig into data sources and the ingestion process—how data flows from the outside world into your ecosystem.&lt;/p&gt;
</content:encoded><author>Alex Merced</author></item><item><title>Introduction to Data Engineering Concepts | Understanding Data Sources and Ingestion</title><link>https://iceberglakehouse.com/posts/2025-05-intro-to-data-engineering-concepts-02/</link><guid isPermaLink="true">https://iceberglakehouse.com/posts/2025-05-intro-to-data-engineering-concepts-02/</guid><description>
## Free Resources  
- **[Free Apache Iceberg Course](https://hello.dremio.com/webcast-an-apache-iceberg-lakehouse-crash-course-reg.html?utm_source=ev...</description><pubDate>Fri, 02 May 2025 09:00:00 GMT</pubDate><content:encoded>&lt;h2&gt;Free Resources&lt;/h2&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;&lt;a href=&quot;https://hello.dremio.com/webcast-an-apache-iceberg-lakehouse-crash-course-reg.html?utm_source=ev_external_blog&amp;amp;utm_medium=influencer&amp;amp;utm_campaign=intro_to_de&amp;amp;utm_content=alexmerced&amp;amp;utm_term=external_blog&quot;&gt;Free Apache Iceberg Course&lt;/a&gt;&lt;/strong&gt;&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;&lt;a href=&quot;https://hello.dremio.com/wp-apache-iceberg-the-definitive-guide-reg.html?utm_source=ev_external_blog&amp;amp;utm_medium=influencer&amp;amp;utm_campaign=intro_to_de&amp;amp;utm_content=alexmerced&amp;amp;utm_term=external_blog&quot;&gt;Free Copy of “Apache Iceberg: The Definitive Guide”&lt;/a&gt;&lt;/strong&gt;&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;&lt;a href=&quot;https://hello.dremio.com/wp-apache-polaris-guide-reg.html?utm_source=ev_external_blog&amp;amp;utm_medium=influencer&amp;amp;utm_campaign=intro_to_de&amp;amp;utm_content=alexmerced&amp;amp;utm_term=external_blog&quot;&gt;Free Copy of “Apache Polaris: The Definitive Guide”&lt;/a&gt;&lt;/strong&gt;&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;&lt;a href=&quot;https://medium.com/data-engineering-with-dremio/2025-guide-to-architecting-an-iceberg-lakehouse-9b19ed42c9de&quot;&gt;2025 Apache Iceberg Architecture Guide&lt;/a&gt;&lt;/strong&gt;&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;&lt;a href=&quot;https://medium.alexmerced.blog/guide-to-finding-apache-iceberg-events-near-you-and-being-part-of-the-greater-iceberg-community-0c38ae785ddb&quot;&gt;How to Join the Iceberg Community&lt;/a&gt;&lt;/strong&gt;&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;&lt;a href=&quot;https://youtube.com/playlist?list=PLsLAVBjQJO0p0Yq1fLkoHvt2lEJj5pcYe&amp;amp;si=WTSnqjXZv6Glkc3y&quot;&gt;Iceberg Lakehouse Engineering Video Playlist&lt;/a&gt;&lt;/strong&gt;&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;&lt;a href=&quot;https://medium.com/data-engineering-with-dremio/ultimate-directory-of-apache-iceberg-resources-e3e02efac62e&quot;&gt;Ultimate Apache Iceberg Resource Guide&lt;/a&gt;&lt;/strong&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;Before we can analyze, model, or visualize data, we first need to get it into our systems. This step—often taken for granted—is known as data ingestion. It’s the bridge between the outside world and the internal data infrastructure, and it plays a critical role in how data is shaped from day one.&lt;/p&gt;
&lt;p&gt;In this post, we’ll break down the types of data sources you’ll encounter, the ingestion strategies available, and what trade-offs to consider when designing ingestion workflows.&lt;/p&gt;
&lt;h2&gt;What Are Data Sources?&lt;/h2&gt;
&lt;p&gt;At its core, a data source is any origin point from which data can be extracted. These sources vary widely in structure, velocity, and complexity.&lt;/p&gt;
&lt;p&gt;Relational databases like MySQL or PostgreSQL are common sources in transactional systems. They tend to produce highly structured, row-based data and are often central to business operations such as order processing or customer management.&lt;/p&gt;
&lt;p&gt;APIs are another rich source of data, especially in modern SaaS environments. From financial data to social media feeds, APIs expose endpoints where structured (often JSON-formatted) data can be requested in real-time or on a schedule.&lt;/p&gt;
&lt;p&gt;Then there are flat files—CSV, JSON, XML—often used in data exports, logs, and external data sharing. While simple, they can carry critical context or fill gaps that structured sources miss.&lt;/p&gt;
&lt;p&gt;Sensor data, clickstreams, mobile apps, third-party tools, and message queues all add to the landscape, each bringing its own cadence and complexity.&lt;/p&gt;
&lt;h2&gt;Ingestion Strategies: Batch vs Streaming&lt;/h2&gt;
&lt;p&gt;Once you identify your sources, the next question becomes: &lt;strong&gt;how&lt;/strong&gt; will you ingest the data?&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Batch ingestion&lt;/strong&gt; involves collecting data at intervals and processing it in chunks. This could be once a day, every hour, or even every minute. It&apos;s suitable for systems that don&apos;t require real-time updates and where data can afford to be a little stale. For example, nightly financial reports or end-of-day sales data.&lt;/p&gt;
&lt;p&gt;Batch processes tend to be simpler and easier to maintain. They can rely on traditional extract-transform-load (ETL) workflows and are often orchestrated using tools like Apache Airflow or simple cron jobs.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Streaming ingestion&lt;/strong&gt;, on the other hand, handles data in motion. As new records are created—say, a customer clicks a link or a sensor detects a temperature change—they’re ingested immediately. This method is crucial for use cases that require low-latency or real-time processing, such as fraud detection or live recommendation engines.&lt;/p&gt;
&lt;p&gt;Apache Kafka is a popular tool for enabling streaming pipelines. It allows systems to publish and subscribe to streams of records, ensuring data flows continuously with minimal delay.&lt;/p&gt;
&lt;h2&gt;Structured, Semi-Structured, and Unstructured Data&lt;/h2&gt;
&lt;p&gt;Understanding the shape of your data also influences how you ingest it.&lt;/p&gt;
&lt;p&gt;Structured data is highly organized and fits neatly into tables. Think SQL databases or CSV files. Ingestion here often involves direct connections via JDBC drivers, SQL queries, or file uploads.&lt;/p&gt;
&lt;p&gt;Semi-structured data, like JSON or XML, has an internal structure but doesn’t conform strictly to relational models. Ingesting this data may require parsing logic and schema inference before it&apos;s usable downstream.&lt;/p&gt;
&lt;p&gt;Unstructured data includes images, videos, PDFs, and raw text. These formats typically require specialized tools and more complex handling, often involving metadata extraction or integration with machine learning models for classification or tagging.&lt;/p&gt;
&lt;h2&gt;Considerations in Designing Ingestion Pipelines&lt;/h2&gt;
&lt;p&gt;Data ingestion isn’t just about moving bytes—it’s about doing so reliably, efficiently, and with the future in mind.&lt;/p&gt;
&lt;p&gt;Latency requirements play a major role. Does the business need data as it happens, or is yesterday’s data good enough? That determines your choice between batch and streaming.&lt;/p&gt;
&lt;p&gt;Scalability is another concern. What works for 10,000 records a day might break under 10 million. Tools like Kafka and cloud-native services such as AWS Kinesis or Google Pub/Sub help handle high throughput without compromising performance.&lt;/p&gt;
&lt;p&gt;Error handling is essential. What happens if a source API goes down? What if a file arrives with missing fields? Designing retry logic, alerts, and fallback mechanisms helps ensure ingestion pipelines are robust.&lt;/p&gt;
&lt;p&gt;Finally, schema evolution can’t be overlooked. Data changes over time—columns get added, data types shift. Your ingestion pipeline must be flexible enough to adapt without breaking downstream systems.&lt;/p&gt;
&lt;h2&gt;Looking Ahead&lt;/h2&gt;
&lt;p&gt;Getting data into the system is just the beginning. Once it’s ingested, it often needs to be transformed to fit the analytical or business context.&lt;/p&gt;
&lt;p&gt;In the next post, we’ll explore the concepts of ETL and ELT—two core paradigms for moving and transforming data—and look at how they differ in practice and purpose.&lt;/p&gt;
</content:encoded><author>Alex Merced</author></item><item><title>Introduction to Data Engineering Concepts | ETL vs ELT – Understanding Data Pipelines</title><link>https://iceberglakehouse.com/posts/2025-05-intro-to-data-engineering-concepts-03/</link><guid isPermaLink="true">https://iceberglakehouse.com/posts/2025-05-intro-to-data-engineering-concepts-03/</guid><description>
## Free Resources  
- **[Free Apache Iceberg Course](https://hello.dremio.com/webcast-an-apache-iceberg-lakehouse-crash-course-reg.html?utm_source=ev...</description><pubDate>Fri, 02 May 2025 09:00:00 GMT</pubDate><content:encoded>&lt;h2&gt;Free Resources&lt;/h2&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;&lt;a href=&quot;https://hello.dremio.com/webcast-an-apache-iceberg-lakehouse-crash-course-reg.html?utm_source=ev_external_blog&amp;amp;utm_medium=influencer&amp;amp;utm_campaign=intro_to_de&amp;amp;utm_content=alexmerced&amp;amp;utm_term=external_blog&quot;&gt;Free Apache Iceberg Course&lt;/a&gt;&lt;/strong&gt;&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;&lt;a href=&quot;https://hello.dremio.com/wp-apache-iceberg-the-definitive-guide-reg.html?utm_source=ev_external_blog&amp;amp;utm_medium=influencer&amp;amp;utm_campaign=intro_to_de&amp;amp;utm_content=alexmerced&amp;amp;utm_term=external_blog&quot;&gt;Free Copy of “Apache Iceberg: The Definitive Guide”&lt;/a&gt;&lt;/strong&gt;&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;&lt;a href=&quot;https://hello.dremio.com/wp-apache-polaris-guide-reg.html?utm_source=ev_external_blog&amp;amp;utm_medium=influencer&amp;amp;utm_campaign=intro_to_de&amp;amp;utm_content=alexmerced&amp;amp;utm_term=external_blog&quot;&gt;Free Copy of “Apache Polaris: The Definitive Guide”&lt;/a&gt;&lt;/strong&gt;&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;&lt;a href=&quot;https://medium.com/data-engineering-with-dremio/2025-guide-to-architecting-an-iceberg-lakehouse-9b19ed42c9de&quot;&gt;2025 Apache Iceberg Architecture Guide&lt;/a&gt;&lt;/strong&gt;&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;&lt;a href=&quot;https://medium.alexmerced.blog/guide-to-finding-apache-iceberg-events-near-you-and-being-part-of-the-greater-iceberg-community-0c38ae785ddb&quot;&gt;How to Join the Iceberg Community&lt;/a&gt;&lt;/strong&gt;&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;&lt;a href=&quot;https://youtube.com/playlist?list=PLsLAVBjQJO0p0Yq1fLkoHvt2lEJj5pcYe&amp;amp;si=WTSnqjXZv6Glkc3y&quot;&gt;Iceberg Lakehouse Engineering Video Playlist&lt;/a&gt;&lt;/strong&gt;&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;&lt;a href=&quot;https://medium.com/data-engineering-with-dremio/ultimate-directory-of-apache-iceberg-resources-e3e02efac62e&quot;&gt;Ultimate Apache Iceberg Resource Guide&lt;/a&gt;&lt;/strong&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;Once data has been ingested into your system, the next step is to prepare it for actual use. This typically involves cleaning, transforming, and storing the data in a way that supports analysis, reporting, or further processing. This is where data pipelines come in, and at the center of pipeline design are two common strategies: ETL and ELT.&lt;/p&gt;
&lt;p&gt;Although they may look similar at first glance, ETL and ELT represent fundamentally different approaches to handling data transformations, and each has its strengths and trade-offs depending on the context in which it’s used.&lt;/p&gt;
&lt;h2&gt;What is ETL?&lt;/h2&gt;
&lt;p&gt;ETL stands for Extract, Transform, Load. It’s the traditional method used in many enterprise environments for years. The process starts by &lt;strong&gt;extracting&lt;/strong&gt; data from source systems such as databases, APIs, or flat files. This raw data is then &lt;strong&gt;transformed&lt;/strong&gt;—typically on a separate processing server or ETL engine—before it is finally &lt;strong&gt;loaded&lt;/strong&gt; into a data warehouse or other destination system.&lt;/p&gt;
&lt;p&gt;For example, imagine a retail company collecting daily sales data from multiple stores. In an ETL workflow, the system might extract those records at the end of the day, standardize formats, filter out corrupted rows, aggregate sales by region, and then load the clean, transformed dataset into a reporting warehouse like Snowflake or Redshift.&lt;/p&gt;
&lt;p&gt;One of the key advantages of ETL is that it allows you to load only clean, verified data into your warehouse. That often means smaller storage footprints and potentially better performance on downstream queries.&lt;/p&gt;
&lt;p&gt;However, this approach also has limitations. Because the transformation happens before loading, you must decide upfront how the data should be shaped. If business rules change or additional use cases emerge, you may need to go back and reprocess the data.&lt;/p&gt;
&lt;h2&gt;What is ELT?&lt;/h2&gt;
&lt;p&gt;ELT reverses the order of the last two steps: Extract, Load, Transform. In this model, raw data is extracted from the source and immediately &lt;strong&gt;loaded&lt;/strong&gt; into the target system—usually a cloud data warehouse that can scale horizontally. Once the data is in place, transformations are performed &lt;strong&gt;within&lt;/strong&gt; the warehouse using SQL or warehouse-native tools.&lt;/p&gt;
&lt;p&gt;This approach takes advantage of the high compute power and scalability of modern cloud platforms. Instead of bottlenecking on a dedicated ETL server, the warehouse can handle complex joins, aggregations, and transformations at scale.&lt;/p&gt;
&lt;p&gt;Let’s go back to the retail example. With ELT, all sales data is loaded as-is into the warehouse. Analysts or data engineers can then write transformation scripts to reshape the data for various use cases—trend analysis, regional comparisons, or fraud detection—all without having to re-ingest or reload the source data.&lt;/p&gt;
&lt;p&gt;ELT offers more flexibility for evolving requirements, supports broader self-service analytics, and enables faster time-to-insight. The trade-off is that it requires strong governance and monitoring. Because raw data is stored in the warehouse, the risk of exposing inconsistent or unclean data is higher if transformation logic isn’t managed carefully.&lt;/p&gt;
&lt;h2&gt;Choosing Between ETL and ELT&lt;/h2&gt;
&lt;p&gt;The decision to use ETL or ELT often depends on your stack, performance needs, and organizational practices.&lt;/p&gt;
&lt;p&gt;ETL still makes sense in environments with strict data governance, limited warehouse compute resources, or scenarios where only clean data should be retained. It’s also common in legacy systems and on-premise architectures.&lt;/p&gt;
&lt;p&gt;ELT shines in modern cloud-native environments where scalability and agility are top priorities. It’s often used with platforms like Snowflake, BigQuery, or Redshift, which are built to handle large volumes of raw data and complex SQL-based transformations efficiently.&lt;/p&gt;
&lt;p&gt;In practice, many organizations use a hybrid approach. Critical data may go through an ETL flow, while experimental or rapidly evolving datasets follow an ELT pattern.&lt;/p&gt;
&lt;h2&gt;The Bigger Picture&lt;/h2&gt;
&lt;p&gt;ETL and ELT are just different roads to the same destination: getting data ready for use. As the modern data stack evolves, so do the tools and best practices for managing these flows. Whether you choose one approach or blend both, what matters most is building pipelines that are reliable, maintainable, and aligned with your organization’s goals.&lt;/p&gt;
&lt;p&gt;In the next post, we’ll focus on batch processing—the traditional foundation of many ETL workflows—and discuss how data engineers design, schedule, and optimize these processes for scale.&lt;/p&gt;
</content:encoded><author>Alex Merced</author></item><item><title>Introduction to Data Engineering Concepts | Batch Processing Fundamentals</title><link>https://iceberglakehouse.com/posts/2025-05-intro-to-data-engineering-concepts-04/</link><guid isPermaLink="true">https://iceberglakehouse.com/posts/2025-05-intro-to-data-engineering-concepts-04/</guid><description>
## Free Resources  
- **[Free Apache Iceberg Course](https://hello.dremio.com/webcast-an-apache-iceberg-lakehouse-crash-course-reg.html?utm_source=ev...</description><pubDate>Fri, 02 May 2025 09:00:00 GMT</pubDate><content:encoded>&lt;h2&gt;Free Resources&lt;/h2&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;&lt;a href=&quot;https://hello.dremio.com/webcast-an-apache-iceberg-lakehouse-crash-course-reg.html?utm_source=ev_external_blog&amp;amp;utm_medium=influencer&amp;amp;utm_campaign=intro_to_de&amp;amp;utm_content=alexmerced&amp;amp;utm_term=external_blog&quot;&gt;Free Apache Iceberg Course&lt;/a&gt;&lt;/strong&gt;&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;&lt;a href=&quot;https://hello.dremio.com/wp-apache-iceberg-the-definitive-guide-reg.html?utm_source=ev_external_blog&amp;amp;utm_medium=influencer&amp;amp;utm_campaign=intro_to_de&amp;amp;utm_content=alexmerced&amp;amp;utm_term=external_blog&quot;&gt;Free Copy of “Apache Iceberg: The Definitive Guide”&lt;/a&gt;&lt;/strong&gt;&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;&lt;a href=&quot;https://hello.dremio.com/wp-apache-polaris-guide-reg.html?utm_source=ev_external_blog&amp;amp;utm_medium=influencer&amp;amp;utm_campaign=intro_to_de&amp;amp;utm_content=alexmerced&amp;amp;utm_term=external_blog&quot;&gt;Free Copy of “Apache Polaris: The Definitive Guide”&lt;/a&gt;&lt;/strong&gt;&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;&lt;a href=&quot;https://medium.com/data-engineering-with-dremio/2025-guide-to-architecting-an-iceberg-lakehouse-9b19ed42c9de&quot;&gt;2025 Apache Iceberg Architecture Guide&lt;/a&gt;&lt;/strong&gt;&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;&lt;a href=&quot;https://medium.alexmerced.blog/guide-to-finding-apache-iceberg-events-near-you-and-being-part-of-the-greater-iceberg-community-0c38ae785ddb&quot;&gt;How to Join the Iceberg Community&lt;/a&gt;&lt;/strong&gt;&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;&lt;a href=&quot;https://youtube.com/playlist?list=PLsLAVBjQJO0p0Yq1fLkoHvt2lEJj5pcYe&amp;amp;si=WTSnqjXZv6Glkc3y&quot;&gt;Iceberg Lakehouse Engineering Video Playlist&lt;/a&gt;&lt;/strong&gt;&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;&lt;a href=&quot;https://medium.com/data-engineering-with-dremio/ultimate-directory-of-apache-iceberg-resources-e3e02efac62e&quot;&gt;Ultimate Apache Iceberg Resource Guide&lt;/a&gt;&lt;/strong&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;For many data engineering tasks, real-time insights aren’t necessary. In fact, a large portion of the data processed across organizations happens in scheduled intervals—daily sales reports, weekly data refreshes, monthly billing cycles. This is where batch processing comes in, and despite the growing popularity of streaming, batch remains the backbone of many data-driven workflows.&lt;/p&gt;
&lt;p&gt;In this post, we’ll explore what batch processing is, how it works under the hood, and why it’s still a critical technique in the data engineer’s toolbox.&lt;/p&gt;
&lt;h2&gt;What is Batch Processing?&lt;/h2&gt;
&lt;p&gt;Batch processing is the execution of data workflows on a predefined schedule or in response to specific triggers. Instead of processing data as it arrives, the system collects a set of data over a period of time, then processes that set as a single unit.&lt;/p&gt;
&lt;p&gt;This approach is particularly useful when data arrives in large quantities but doesn’t need to be acted on immediately. For example, processing daily transactions from a point-of-sale system or generating overnight reports for executive dashboards.&lt;/p&gt;
&lt;p&gt;Batch jobs are often triggered at set times—say, every night at 2 a.m.—and are designed to run until completion, often without user interaction. They can run for seconds, minutes, or even hours depending on the volume of data and complexity of the transformations.&lt;/p&gt;
&lt;h2&gt;Under the Hood: How Batch Jobs Work&lt;/h2&gt;
&lt;p&gt;The anatomy of a batch job usually includes several stages. First, the job identifies the data it needs to process. This might involve querying a database for all records created in the last 24 hours or scanning a specific folder in object storage for new files.&lt;/p&gt;
&lt;p&gt;Next comes the transformation phase. This is where data is cleaned, filtered, joined with other datasets, and reshaped to fit its target structure. This phase can include tasks like date formatting, currency conversion, null value imputation, or the calculation of derived fields.&lt;/p&gt;
&lt;p&gt;Finally, the job writes the transformed data to its destination—often a data warehouse, data lake, or downstream reporting system.&lt;/p&gt;
&lt;p&gt;To manage all of this, engineers rely on workflow orchestration tools. These tools provide scheduling, error handling, and logging capabilities to ensure that jobs run in the right order and can recover gracefully from failure.&lt;/p&gt;
&lt;h2&gt;Tools and Technologies&lt;/h2&gt;
&lt;p&gt;Several tools have become staples in batch-oriented workflows. Apache Airflow is one of the most widely used. It allows engineers to define complex workflows as Directed Acyclic Graphs (DAGs), where each node represents a task and dependencies are explicitly declared.&lt;/p&gt;
&lt;p&gt;Other tools like Luigi and Oozie offer similar functionality, though they are less commonly used in newer stacks. Cloud-native platforms such as AWS Glue and Google Cloud Composer provide managed orchestration services that integrate tightly with the respective cloud ecosystems.&lt;/p&gt;
&lt;p&gt;In addition to orchestration, batch jobs often depend on distributed processing engines like Apache Spark. Spark allows massive datasets to be processed in parallel across a cluster of machines, reducing processing times dramatically compared to traditional single-node tools.&lt;/p&gt;
&lt;h2&gt;Strengths and Limitations&lt;/h2&gt;
&lt;p&gt;One of the biggest advantages of batch processing is its simplicity. Since data is processed in chunks, you can apply robust validation and error-handling routines before moving data downstream. It&apos;s also easier to track and audit, which is especially important for regulated industries.&lt;/p&gt;
&lt;p&gt;Batch jobs are also cost-efficient when working with large volumes of data that don’t require immediate availability. Processing once per day means you can spin up compute resources only when needed, rather than keeping systems running continuously.&lt;/p&gt;
&lt;p&gt;However, the main limitation is latency. If something happens in your business—say, a spike in fraudulent transactions—you won’t know about it until after the next batch job runs. For use cases that require faster insights or real-time responsiveness, batch processing isn’t sufficient.&lt;/p&gt;
&lt;p&gt;There’s also the issue of windowing and completeness. Since batch jobs process data in slices, late-arriving records can fall outside the intended window unless carefully managed. This adds complexity to pipeline design and requires thoughtful handling of time-based logic.&lt;/p&gt;
&lt;h2&gt;Where Batch Still Shines&lt;/h2&gt;
&lt;p&gt;Despite its limitations, batch processing remains ideal for a wide range of use cases. Financial reconciliations, data archival, slow-changing dimensional data updates, and long-running analytics workloads are just a few examples where batch continues to dominate.&lt;/p&gt;
&lt;p&gt;As a data engineer, understanding how to design efficient and reliable batch workflows is an essential skill, especially in environments where consistency and auditability are critical.&lt;/p&gt;
&lt;p&gt;In the next post, we’ll explore the counterpart to batch: streaming data processing. We’ll look at what it means to process data in real time, how it differs from batch, and what patterns and tools make it work.&lt;/p&gt;
</content:encoded><author>Alex Merced</author></item><item><title>Introduction to Data Engineering Concepts | Streaming Data Fundamentals</title><link>https://iceberglakehouse.com/posts/2025-05-intro-to-data-engineering-concepts-05/</link><guid isPermaLink="true">https://iceberglakehouse.com/posts/2025-05-intro-to-data-engineering-concepts-05/</guid><description>
## Free Resources  
- **[Free Apache Iceberg Course](https://hello.dremio.com/webcast-an-apache-iceberg-lakehouse-crash-course-reg.html?utm_source=ev...</description><pubDate>Fri, 02 May 2025 09:00:00 GMT</pubDate><content:encoded>&lt;h2&gt;Free Resources&lt;/h2&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;&lt;a href=&quot;https://hello.dremio.com/webcast-an-apache-iceberg-lakehouse-crash-course-reg.html?utm_source=ev_external_blog&amp;amp;utm_medium=influencer&amp;amp;utm_campaign=intro_to_de&amp;amp;utm_content=alexmerced&amp;amp;utm_term=external_blog&quot;&gt;Free Apache Iceberg Course&lt;/a&gt;&lt;/strong&gt;&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;&lt;a href=&quot;https://hello.dremio.com/wp-apache-iceberg-the-definitive-guide-reg.html?utm_source=ev_external_blog&amp;amp;utm_medium=influencer&amp;amp;utm_campaign=intro_to_de&amp;amp;utm_content=alexmerced&amp;amp;utm_term=external_blog&quot;&gt;Free Copy of “Apache Iceberg: The Definitive Guide”&lt;/a&gt;&lt;/strong&gt;&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;&lt;a href=&quot;https://hello.dremio.com/wp-apache-polaris-guide-reg.html?utm_source=ev_external_blog&amp;amp;utm_medium=influencer&amp;amp;utm_campaign=intro_to_de&amp;amp;utm_content=alexmerced&amp;amp;utm_term=external_blog&quot;&gt;Free Copy of “Apache Polaris: The Definitive Guide”&lt;/a&gt;&lt;/strong&gt;&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;&lt;a href=&quot;https://medium.com/data-engineering-with-dremio/2025-guide-to-architecting-an-iceberg-lakehouse-9b19ed42c9de&quot;&gt;2025 Apache Iceberg Architecture Guide&lt;/a&gt;&lt;/strong&gt;&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;&lt;a href=&quot;https://medium.alexmerced.blog/guide-to-finding-apache-iceberg-events-near-you-and-being-part-of-the-greater-iceberg-community-0c38ae785ddb&quot;&gt;How to Join the Iceberg Community&lt;/a&gt;&lt;/strong&gt;&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;&lt;a href=&quot;https://youtube.com/playlist?list=PLsLAVBjQJO0p0Yq1fLkoHvt2lEJj5pcYe&amp;amp;si=WTSnqjXZv6Glkc3y&quot;&gt;Iceberg Lakehouse Engineering Video Playlist&lt;/a&gt;&lt;/strong&gt;&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;&lt;a href=&quot;https://medium.com/data-engineering-with-dremio/ultimate-directory-of-apache-iceberg-resources-e3e02efac62e&quot;&gt;Ultimate Apache Iceberg Resource Guide&lt;/a&gt;&lt;/strong&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;In contrast to batch processing, where data is collected and processed in chunks, streaming data processing deals with data in motion. Instead of waiting for data to accumulate before running transformations, streaming pipelines ingest and process each piece of data as it arrives. This model enables organizations to respond to events in real time, a capability that’s becoming increasingly essential in domains like finance, security, and customer experience.&lt;/p&gt;
&lt;p&gt;In this post, we’ll unpack the core ideas behind streaming, how it works in practice, and the challenges it presents compared to traditional batch systems.&lt;/p&gt;
&lt;h2&gt;What is Streaming Data?&lt;/h2&gt;
&lt;p&gt;Streaming data refers to data that is continuously generated by various sources—website clicks, IoT sensors, user interactions, system logs—and transmitted in real time or near-real time. This data typically arrives in small payloads, often as individual events, and needs to be processed with minimal delay.&lt;/p&gt;
&lt;p&gt;The goal of a streaming pipeline is to capture this data as it’s generated, perform necessary transformations, and deliver it to its destination with as little latency as possible.&lt;/p&gt;
&lt;p&gt;A simple example would be a ride-sharing app that tracks vehicle locations in real time. As each car moves, GPS data is streamed to a backend system that updates the user interface and helps dispatch rides based on current conditions.&lt;/p&gt;
&lt;h2&gt;How Streaming Systems Work&lt;/h2&gt;
&lt;p&gt;Unlike batch jobs that execute on a schedule, streaming systems run continuously. They consume data from a source, process it incrementally, and push it to a sink—all without waiting for a dataset to be complete.&lt;/p&gt;
&lt;p&gt;At the heart of a streaming system is a message broker or event queue, which acts as a buffer between data producers and consumers. Apache Kafka is a popular choice here. It allows producers to publish events to topics, and consumers to read from those topics independently, often with strong guarantees around ordering and durability.&lt;/p&gt;
&lt;p&gt;Once events are ingested, a processing engine takes over. Tools like Apache Flink, Spark Structured Streaming, and Apache Beam allow developers to apply transformations on a per-record basis or over time-based windows. This is where operations like filtering, aggregating, joining, and enriching occur.&lt;/p&gt;
&lt;p&gt;These transformations must be designed to handle data that may arrive late, out of order, or in bursts. As such, streaming systems often implement complex logic to manage time—distinguishing between event time (when the event occurred) and processing time (when it was received)—to ensure results are accurate.&lt;/p&gt;
&lt;h2&gt;Use Cases and Business Impact&lt;/h2&gt;
&lt;p&gt;The appeal of streaming pipelines lies in their ability to power real-time applications. Fraud detection systems can flag suspicious transactions as they happen. E-commerce platforms can recommend products based on live browsing behavior. Logistics companies can monitor fleet activity and adjust routes on the fly.&lt;/p&gt;
&lt;p&gt;In operational analytics, dashboards fed by streaming data provide up-to-the-minute visibility, allowing teams to make informed decisions in response to changing conditions.&lt;/p&gt;
&lt;p&gt;Streaming is also a foundational component of event-driven architectures. When services communicate via events, streaming systems act as the glue that ties the application together, enabling asynchronous, decoupled interactions.&lt;/p&gt;
&lt;h2&gt;Challenges in Streaming Systems&lt;/h2&gt;
&lt;p&gt;Despite its power, streaming introduces complexity that shouldn’t be underestimated. Handling late or out-of-order data is a major concern. If an event shows up ten minutes after it was supposed to be processed, the system must be smart enough to either incorporate it correctly or account for the gap.&lt;/p&gt;
&lt;p&gt;State management is another critical factor. When a pipeline needs to remember information across multiple events—like keeping a running total or maintaining a session—it must manage that state reliably, often across distributed systems.&lt;/p&gt;
&lt;p&gt;There’s also the issue of fault tolerance. Streaming systems must be able to recover from crashes or network issues without duplicating results or losing data. This requires sophisticated checkpointing, replay, and exactly-once processing semantics, which tools like Flink and Beam are designed to provide.&lt;/p&gt;
&lt;p&gt;Finally, testing and debugging streaming pipelines can be more difficult than batch jobs. Because they run continuously and deal with time-sensitive data, reproducing issues often requires specialized tooling or replay mechanisms.&lt;/p&gt;
&lt;h2&gt;When to Choose Streaming&lt;/h2&gt;
&lt;p&gt;Streaming makes sense when low-latency data processing is essential to the business. This could mean operational decision-making, customer experience personalization, or complex event processing in a microservices architecture.&lt;/p&gt;
&lt;p&gt;It’s not always the right tool for the job, though. For workloads that don’t require immediate insights—or where simplicity and reliability matter more—batch processing remains the better choice.&lt;/p&gt;
&lt;p&gt;As data engineers, the key is to understand the trade-offs and choose the right pattern for each use case.&lt;/p&gt;
&lt;p&gt;In the next post, we’ll shift gears and look at how data is modeled for analytics. Understanding the differences between OLTP and OLAP systems, as well as the pros and cons of different schema designs, is critical to building pipelines that serve real business needs.&lt;/p&gt;
</content:encoded><author>Alex Merced</author></item><item><title>Introduction to Data Engineering Concepts | Data Modeling Basics</title><link>https://iceberglakehouse.com/posts/2025-05-intro-to-data-engineering-concepts-06/</link><guid isPermaLink="true">https://iceberglakehouse.com/posts/2025-05-intro-to-data-engineering-concepts-06/</guid><description>
## Free Resources  
- **[Free Apache Iceberg Course](https://hello.dremio.com/webcast-an-apache-iceberg-lakehouse-crash-course-reg.html?utm_source=ev...</description><pubDate>Fri, 02 May 2025 09:00:00 GMT</pubDate><content:encoded>&lt;h2&gt;Free Resources&lt;/h2&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;&lt;a href=&quot;https://hello.dremio.com/webcast-an-apache-iceberg-lakehouse-crash-course-reg.html?utm_source=ev_external_blog&amp;amp;utm_medium=influencer&amp;amp;utm_campaign=intro_to_de&amp;amp;utm_content=alexmerced&amp;amp;utm_term=external_blog&quot;&gt;Free Apache Iceberg Course&lt;/a&gt;&lt;/strong&gt;&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;&lt;a href=&quot;https://hello.dremio.com/wp-apache-iceberg-the-definitive-guide-reg.html?utm_source=ev_external_blog&amp;amp;utm_medium=influencer&amp;amp;utm_campaign=intro_to_de&amp;amp;utm_content=alexmerced&amp;amp;utm_term=external_blog&quot;&gt;Free Copy of “Apache Iceberg: The Definitive Guide”&lt;/a&gt;&lt;/strong&gt;&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;&lt;a href=&quot;https://hello.dremio.com/wp-apache-polaris-guide-reg.html?utm_source=ev_external_blog&amp;amp;utm_medium=influencer&amp;amp;utm_campaign=intro_to_de&amp;amp;utm_content=alexmerced&amp;amp;utm_term=external_blog&quot;&gt;Free Copy of “Apache Polaris: The Definitive Guide”&lt;/a&gt;&lt;/strong&gt;&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;&lt;a href=&quot;https://medium.com/data-engineering-with-dremio/2025-guide-to-architecting-an-iceberg-lakehouse-9b19ed42c9de&quot;&gt;2025 Apache Iceberg Architecture Guide&lt;/a&gt;&lt;/strong&gt;&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;&lt;a href=&quot;https://medium.alexmerced.blog/guide-to-finding-apache-iceberg-events-near-you-and-being-part-of-the-greater-iceberg-community-0c38ae785ddb&quot;&gt;How to Join the Iceberg Community&lt;/a&gt;&lt;/strong&gt;&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;&lt;a href=&quot;https://youtube.com/playlist?list=PLsLAVBjQJO0p0Yq1fLkoHvt2lEJj5pcYe&amp;amp;si=WTSnqjXZv6Glkc3y&quot;&gt;Iceberg Lakehouse Engineering Video Playlist&lt;/a&gt;&lt;/strong&gt;&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;&lt;a href=&quot;https://medium.com/data-engineering-with-dremio/ultimate-directory-of-apache-iceberg-resources-e3e02efac62e&quot;&gt;Ultimate Apache Iceberg Resource Guide&lt;/a&gt;&lt;/strong&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;Behind every useful dashboard or analytics report lies a well-structured data model. Data modeling is the practice of shaping data into organized structures that are easy to query, analyze, and maintain. While it may sound abstract, modeling directly impacts how quickly and accurately data consumers can extract value from the information stored in your systems.&lt;/p&gt;
&lt;p&gt;In this post, we’ll look at the foundations of data modeling, the difference between OLTP and OLAP systems, and common schema designs that data engineers use to build efficient and scalable data platforms.&lt;/p&gt;
&lt;h2&gt;Why Data Modeling Matters&lt;/h2&gt;
&lt;p&gt;When data arrives from source systems, it’s often raw and optimized for transactions, not analysis. A transactional database might record every sale or click in granular detail, but that structure doesn’t translate well into aggregations like “monthly revenue by product category.”&lt;/p&gt;
&lt;p&gt;A data model reshapes that data to make it usable. Good models reduce complexity, improve performance, and minimize errors. Poor models, on the other hand, lead to slow queries, redundant data, and confusion about what numbers really mean.&lt;/p&gt;
&lt;p&gt;Modeling is both a technical and a collaborative process. It requires not just understanding how data is structured, but also how the business thinks about that data—what questions need answering, how metrics are defined, and what trade-offs are acceptable.&lt;/p&gt;
&lt;h2&gt;OLTP vs OLAP: Two Worlds, Two Purposes&lt;/h2&gt;
&lt;p&gt;Before diving into specific modeling techniques, it’s important to distinguish between the two main types of data systems: OLTP and OLAP.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;OLTP (Online Transaction Processing)&lt;/strong&gt; systems are built for real-time operations. Think of point-of-sale systems, user authentication services, or banking apps. These systems are optimized for high-throughput reads and writes, handling thousands of small transactions per second. Their schemas are typically highly normalized to avoid data duplication and to keep updates fast and consistent.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;OLAP (Online Analytical Processing)&lt;/strong&gt; systems, on the other hand, are designed for analysis. These platforms support complex queries over large volumes of historical data. Performance here is about aggregating, filtering, and summarizing—not handling rapid transactions. Because of this, OLAP models often trade strict normalization for faster access to pre-joined or denormalized data.&lt;/p&gt;
&lt;p&gt;Understanding whether your system is OLTP or OLAP helps determine how you model your data. The techniques and trade-offs are different depending on the system’s purpose.&lt;/p&gt;
&lt;h2&gt;Normalization and Denormalization&lt;/h2&gt;
&lt;p&gt;In OLTP systems, normalization is the standard. This means structuring data so that each fact is stored in exactly one place. For example, instead of storing a customer’s name with every order record, you keep customer details in a separate table and reference them via a key.&lt;/p&gt;
&lt;p&gt;This approach minimizes redundancy, reduces storage, and simplifies updates. Change the customer’s name in one place, and every order reflects that change immediately.&lt;/p&gt;
&lt;p&gt;In analytical systems, this level of indirection becomes a performance bottleneck. Complex queries must join many tables together, which can slow things down significantly.&lt;/p&gt;
&lt;p&gt;That’s where &lt;strong&gt;denormalization&lt;/strong&gt; comes in. In OLAP models, it’s common to store data in a flattened format, with descriptive attributes repeated across rows. While this increases storage requirements, it significantly speeds up query performance and simplifies logic for analysts and BI tools.&lt;/p&gt;
&lt;h2&gt;Star and Snowflake Schemas&lt;/h2&gt;
&lt;p&gt;Two common modeling patterns in OLAP systems are the &lt;strong&gt;star schema&lt;/strong&gt; and the &lt;strong&gt;snowflake schema&lt;/strong&gt;.&lt;/p&gt;
&lt;p&gt;A &lt;strong&gt;star schema&lt;/strong&gt; organizes data around a central fact table. This table holds measurable events—like sales transactions—with keys that reference surrounding dimension tables, which contain descriptive attributes such as product names, customer demographics, or store locations.&lt;/p&gt;
&lt;p&gt;In a star schema, the dimension tables are typically denormalized. This makes queries straightforward and fast: one central join connects the fact table to all the attributes needed for analysis.&lt;/p&gt;
&lt;p&gt;The &lt;strong&gt;snowflake schema&lt;/strong&gt; takes this idea further by normalizing the dimension tables. Instead of a single product dimension table, for example, you might have separate tables for product, category, and supplier. This saves space and can improve maintainability, but at the cost of more complex joins.&lt;/p&gt;
&lt;p&gt;The choice between star and snowflake schemas depends on your performance needs, data volume, and how often attributes change.&lt;/p&gt;
&lt;h2&gt;Modeling for Flexibility and Growth&lt;/h2&gt;
&lt;p&gt;Good data models are designed with change in mind. New columns will be added, relationships will evolve, and new metrics will be needed. A rigid model can become a bottleneck, while a flexible one supports ongoing development.&lt;/p&gt;
&lt;p&gt;One best practice is to favor additive metrics when possible. These are measures you can safely sum across time or groups—like revenue or quantity sold. Additive metrics work better with aggregations and are easier to model consistently.&lt;/p&gt;
&lt;p&gt;It’s also important to consider slowly changing dimensions. For example, if a customer’s email address or a product’s price changes, do you want to reflect the latest value, or keep historical versions? Modeling for this kind of change requires thought about versioning and historical accuracy.&lt;/p&gt;
&lt;h2&gt;The Road Ahead&lt;/h2&gt;
&lt;p&gt;Data modeling sits at the intersection of technical design and business logic. It’s not just about tables and keys—it’s about making data intuitive and useful for the people who depend on it.&lt;/p&gt;
&lt;p&gt;As data engineers, our role is to create models that strike a balance between performance, maintainability, and expressiveness. Doing this well requires not just technical skill, but ongoing communication with analysts, stakeholders, and subject matter experts.&lt;/p&gt;
&lt;p&gt;In the next post, we’ll take a closer look at data warehousing—how these models are stored, queried, and optimized in systems built for analytics at scale.&lt;/p&gt;
</content:encoded><author>Alex Merced</author></item><item><title>Introduction to Data Engineering Concepts | Data Warehousing Fundamentals</title><link>https://iceberglakehouse.com/posts/2025-05-intro-to-data-engineering-concepts-07/</link><guid isPermaLink="true">https://iceberglakehouse.com/posts/2025-05-intro-to-data-engineering-concepts-07/</guid><description>
## Free Resources  
- **[Free Apache Iceberg Course](https://hello.dremio.com/webcast-an-apache-iceberg-lakehouse-crash-course-reg.html?utm_source=ev...</description><pubDate>Fri, 02 May 2025 09:00:00 GMT</pubDate><content:encoded>&lt;h2&gt;Free Resources&lt;/h2&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;&lt;a href=&quot;https://hello.dremio.com/webcast-an-apache-iceberg-lakehouse-crash-course-reg.html?utm_source=ev_external_blog&amp;amp;utm_medium=influencer&amp;amp;utm_campaign=intro_to_de&amp;amp;utm_content=alexmerced&amp;amp;utm_term=external_blog&quot;&gt;Free Apache Iceberg Course&lt;/a&gt;&lt;/strong&gt;&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;&lt;a href=&quot;https://hello.dremio.com/wp-apache-iceberg-the-definitive-guide-reg.html?utm_source=ev_external_blog&amp;amp;utm_medium=influencer&amp;amp;utm_campaign=intro_to_de&amp;amp;utm_content=alexmerced&amp;amp;utm_term=external_blog&quot;&gt;Free Copy of “Apache Iceberg: The Definitive Guide”&lt;/a&gt;&lt;/strong&gt;&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;&lt;a href=&quot;https://hello.dremio.com/wp-apache-polaris-guide-reg.html?utm_source=ev_external_blog&amp;amp;utm_medium=influencer&amp;amp;utm_campaign=intro_to_de&amp;amp;utm_content=alexmerced&amp;amp;utm_term=external_blog&quot;&gt;Free Copy of “Apache Polaris: The Definitive Guide”&lt;/a&gt;&lt;/strong&gt;&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;&lt;a href=&quot;https://medium.com/data-engineering-with-dremio/2025-guide-to-architecting-an-iceberg-lakehouse-9b19ed42c9de&quot;&gt;2025 Apache Iceberg Architecture Guide&lt;/a&gt;&lt;/strong&gt;&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;&lt;a href=&quot;https://medium.alexmerced.blog/guide-to-finding-apache-iceberg-events-near-you-and-being-part-of-the-greater-iceberg-community-0c38ae785ddb&quot;&gt;How to Join the Iceberg Community&lt;/a&gt;&lt;/strong&gt;&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;&lt;a href=&quot;https://youtube.com/playlist?list=PLsLAVBjQJO0p0Yq1fLkoHvt2lEJj5pcYe&amp;amp;si=WTSnqjXZv6Glkc3y&quot;&gt;Iceberg Lakehouse Engineering Video Playlist&lt;/a&gt;&lt;/strong&gt;&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;&lt;a href=&quot;https://medium.com/data-engineering-with-dremio/ultimate-directory-of-apache-iceberg-resources-e3e02efac62e&quot;&gt;Ultimate Apache Iceberg Resource Guide&lt;/a&gt;&lt;/strong&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;Data warehouses serve as the analytical backbone for many organizations. They are purpose-built systems that store structured data optimized for fast querying and aggregation. While data lakes handle raw, unstructured data at scale, data warehouses focus on delivering clean, organized datasets to analysts, BI tools, and decision-makers.&lt;/p&gt;
&lt;p&gt;In this post, we&apos;ll break down what makes a data warehouse different from other storage systems, how it&apos;s architected, and what practices ensure it performs efficiently as your data and business grow.&lt;/p&gt;
&lt;h2&gt;The Role of a Data Warehouse&lt;/h2&gt;
&lt;p&gt;At a high level, a data warehouse collects data from multiple operational systems and stores it in a way that makes analysis easy and consistent. Instead of digging through individual source systems—like sales platforms, CRM tools, or web analytics—users can query a centralized warehouse that’s been curated and modeled for insight.&lt;/p&gt;
&lt;p&gt;This consolidation allows organizations to apply consistent definitions for metrics, reduce the risk of conflicting data interpretations, and dramatically improve performance for analytical workloads.&lt;/p&gt;
&lt;p&gt;Where a transactional database is designed to handle lots of small, rapid reads and writes, a data warehouse is designed to scan large volumes of data efficiently. These systems optimize for queries like “What were our top five products last quarter?” or “How did regional sales trend year-over-year?”&lt;/p&gt;
&lt;h2&gt;Architecture and Components&lt;/h2&gt;
&lt;p&gt;A traditional data warehouse is structured with a clear separation between compute and storage. In legacy on-premise systems like Teradata or Oracle, both functions were tightly coupled. In modern cloud-native systems like Snowflake or BigQuery, storage and compute are decoupled, which allows more flexible scaling.&lt;/p&gt;
&lt;p&gt;The core of a warehouse is the schema—the logical structure defining how data is organized into tables, relationships, and hierarchies. As discussed in the previous post, these tables often follow star or snowflake patterns, with fact tables surrounded by dimension tables that provide context.&lt;/p&gt;
&lt;p&gt;One of the key components of a warehouse is its query engine. This engine is built to efficiently execute SQL queries, taking advantage of indexing, partitioning, and columnar storage formats to return results quickly even when scanning billions of rows.&lt;/p&gt;
&lt;p&gt;Data warehouses also maintain metadata—information about data types, table relationships, and data lineage—that helps users navigate and trust the system. Many modern platforms also offer built-in tools for access control, versioning, and data classification to support governance.&lt;/p&gt;
&lt;h2&gt;Performance Optimization: Partitioning and Clustering&lt;/h2&gt;
&lt;p&gt;As warehouses scale, query performance becomes a key concern. It’s not enough to simply store the data—you also need to retrieve it quickly and cost-effectively.&lt;/p&gt;
&lt;p&gt;One common optimization is &lt;strong&gt;partitioning&lt;/strong&gt;, which breaks up large tables into smaller, manageable chunks based on a field like date, region, or product category. When a query specifies a filter on that field, the engine can skip over partitions that aren’t relevant, significantly reducing scan times.&lt;/p&gt;
&lt;p&gt;Another technique is &lt;strong&gt;clustering&lt;/strong&gt;, which organizes the physical layout of data based on a set of fields that are commonly filtered or joined on. For example, clustering sales records by customer ID can improve performance for queries that retrieve purchase history.&lt;/p&gt;
&lt;p&gt;Columnar storage is also key to performance. Unlike row-based storage, which keeps all fields of a record together, columnar formats like those used in BigQuery or Redshift store each column separately. This allows the engine to scan only the columns needed for a query, reducing I/O and speeding up execution.&lt;/p&gt;
&lt;h2&gt;Data Loading and Refresh Patterns&lt;/h2&gt;
&lt;p&gt;Getting data into the warehouse is typically done through ETL or ELT processes. These pipelines extract data from source systems, apply transformations, and load the result into warehouse tables.&lt;/p&gt;
&lt;p&gt;Loading can happen in batches—say, every hour or once a day—or in micro-batches that simulate near-real-time ingestion. The right frequency depends on your business needs and the capabilities of your orchestration tools.&lt;/p&gt;
&lt;p&gt;Incremental loading is often preferred over full reloads. By only processing new or changed records, pipelines reduce load times and warehouse compute costs. This usually requires tracking change data through mechanisms like timestamps or change data capture (CDC).&lt;/p&gt;
&lt;h2&gt;Warehouse Technologies&lt;/h2&gt;
&lt;p&gt;Several platforms dominate the modern data warehousing space, each with its strengths.&lt;/p&gt;
&lt;p&gt;Snowflake offers a fully managed, multi-cluster architecture with automatic scaling and support for semi-structured data. It separates compute from storage and supports concurrent workloads with minimal tuning.&lt;/p&gt;
&lt;p&gt;Google BigQuery is a serverless, query-on-demand platform that excels at ad hoc analytics and scales seamlessly with user demand. It’s ideal for teams that want fast performance without managing infrastructure.&lt;/p&gt;
&lt;p&gt;Amazon Redshift provides deep integration with the AWS ecosystem and allows more control over configuration, which can be valuable for teams with specific performance tuning needs.&lt;/p&gt;
&lt;p&gt;Each of these platforms supports ANSI SQL, integrates with major BI tools, and offers features for security, monitoring, and data governance.&lt;/p&gt;
&lt;h2&gt;Wrapping Up&lt;/h2&gt;
&lt;p&gt;A data warehouse isn’t just a place to store data—it’s the system of record for analytics. Its structure, performance, and accessibility determine how quickly stakeholders can make informed decisions.&lt;/p&gt;
&lt;p&gt;Designing and maintaining an effective warehouse requires a thoughtful approach to modeling, data loading, and performance tuning. As your organization grows, so do the expectations placed on your warehouse to handle increasing complexity, scale, and demand for real-time insight.&lt;/p&gt;
&lt;p&gt;In the next post, we’ll explore how data lakes differ from warehouses, and how they offer a flexible, scalable foundation for managing large volumes of diverse data types.&lt;/p&gt;
</content:encoded><author>Alex Merced</author></item><item><title>Introduction to Data Engineering Concepts | Data Lakes Explained</title><link>https://iceberglakehouse.com/posts/2025-05-intro-to-data-engineering-concepts-08/</link><guid isPermaLink="true">https://iceberglakehouse.com/posts/2025-05-intro-to-data-engineering-concepts-08/</guid><description>
## Free Resources  
- **[Free Apache Iceberg Course](https://hello.dremio.com/webcast-an-apache-iceberg-lakehouse-crash-course-reg.html?utm_source=ev...</description><pubDate>Fri, 02 May 2025 09:00:00 GMT</pubDate><content:encoded>&lt;h2&gt;Free Resources&lt;/h2&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;&lt;a href=&quot;https://hello.dremio.com/webcast-an-apache-iceberg-lakehouse-crash-course-reg.html?utm_source=ev_external_blog&amp;amp;utm_medium=influencer&amp;amp;utm_campaign=intro_to_de&amp;amp;utm_content=alexmerced&amp;amp;utm_term=external_blog&quot;&gt;Free Apache Iceberg Course&lt;/a&gt;&lt;/strong&gt;&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;&lt;a href=&quot;https://hello.dremio.com/wp-apache-iceberg-the-definitive-guide-reg.html?utm_source=ev_external_blog&amp;amp;utm_medium=influencer&amp;amp;utm_campaign=intro_to_de&amp;amp;utm_content=alexmerced&amp;amp;utm_term=external_blog&quot;&gt;Free Copy of “Apache Iceberg: The Definitive Guide”&lt;/a&gt;&lt;/strong&gt;&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;&lt;a href=&quot;https://hello.dremio.com/wp-apache-polaris-guide-reg.html?utm_source=ev_external_blog&amp;amp;utm_medium=influencer&amp;amp;utm_campaign=intro_to_de&amp;amp;utm_content=alexmerced&amp;amp;utm_term=external_blog&quot;&gt;Free Copy of “Apache Polaris: The Definitive Guide”&lt;/a&gt;&lt;/strong&gt;&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;&lt;a href=&quot;https://medium.com/data-engineering-with-dremio/2025-guide-to-architecting-an-iceberg-lakehouse-9b19ed42c9de&quot;&gt;2025 Apache Iceberg Architecture Guide&lt;/a&gt;&lt;/strong&gt;&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;&lt;a href=&quot;https://medium.alexmerced.blog/guide-to-finding-apache-iceberg-events-near-you-and-being-part-of-the-greater-iceberg-community-0c38ae785ddb&quot;&gt;How to Join the Iceberg Community&lt;/a&gt;&lt;/strong&gt;&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;&lt;a href=&quot;https://youtube.com/playlist?list=PLsLAVBjQJO0p0Yq1fLkoHvt2lEJj5pcYe&amp;amp;si=WTSnqjXZv6Glkc3y&quot;&gt;Iceberg Lakehouse Engineering Video Playlist&lt;/a&gt;&lt;/strong&gt;&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;&lt;a href=&quot;https://medium.com/data-engineering-with-dremio/ultimate-directory-of-apache-iceberg-resources-e3e02efac62e&quot;&gt;Ultimate Apache Iceberg Resource Guide&lt;/a&gt;&lt;/strong&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;As data volumes grow and the types of data organizations work with become more varied, traditional data warehouses start to show their limits. Structured data fits neatly into tables, but what about videos, logs, images, or JSON documents with unpredictable formats? This is where the concept of a data lake comes into play.&lt;/p&gt;
&lt;p&gt;In this post, we’ll explore what a data lake is, how it compares to a data warehouse, and why it’s become a cornerstone of modern data architecture.&lt;/p&gt;
&lt;h2&gt;What is a Data Lake?&lt;/h2&gt;
&lt;p&gt;A data lake is a centralized repository designed to store data in its raw form. Whether the data is structured like CSV files, semi-structured like JSON, or unstructured like text or images, the lake accepts it all. It acts as a catch-all layer for every piece of data an organization might want to use for analysis, training models, or historical archiving.&lt;/p&gt;
&lt;p&gt;Unlike a data warehouse, which expects a predefined schema and consistent structure, a data lake embraces flexibility. The idea is to collect the data first and figure out how to use it later—a principle often referred to as schema-on-read.&lt;/p&gt;
&lt;p&gt;This approach enables data engineers and scientists to access and experiment with data that hasn’t yet been modeled or cleaned. It fosters innovation by removing upfront constraints about how data should look.&lt;/p&gt;
&lt;h2&gt;Key Characteristics&lt;/h2&gt;
&lt;p&gt;At its core, a data lake is built on inexpensive, scalable storage—typically object storage like Amazon S3, Azure Data Lake Storage, or Google Cloud Storage. These systems offer the capacity to store petabytes of data without the overhead of traditional database systems.&lt;/p&gt;
&lt;p&gt;Because lakes deal with raw data, they don’t enforce strict schemas when data is written. Instead, structure is applied at query time. This allows different teams to interpret the same data in different ways, depending on the analysis they want to perform.&lt;/p&gt;
&lt;p&gt;This flexibility is powerful, but it comes with a cost: governance becomes more challenging. Without strong metadata management and data cataloging, lakes can quickly turn into what’s often called a “data swamp”—a cluttered repository that’s hard to navigate or trust.&lt;/p&gt;
&lt;h2&gt;Data Lakes vs Data Warehouses&lt;/h2&gt;
&lt;p&gt;The primary difference between data lakes and data warehouses lies in structure and purpose.&lt;/p&gt;
&lt;p&gt;Data warehouses are optimized for structured data, curated models, and consistent performance. They serve business users who need reliable access to cleaned, aggregated data for dashboards and reports.&lt;/p&gt;
&lt;p&gt;Data lakes are optimized for scale and flexibility. They support raw data, including logs, sensor output, and third-party feeds, making them ideal for machine learning and advanced analytics. While a warehouse is all about predefined questions and structured answers, a lake is about exploration and experimentation.&lt;/p&gt;
&lt;p&gt;In practice, many organizations use both. The lake acts as the foundation, storing everything, while the warehouse sits on top as a refined layer for operational analytics. This layered architecture sets the stage for more advanced approaches, such as the data lakehouse, which we&apos;ll explore later in this series.&lt;/p&gt;
&lt;h2&gt;Building and Managing a Data Lake&lt;/h2&gt;
&lt;p&gt;Creating a data lake involves more than dumping files into storage. A well-functioning lake includes clear organization, access controls, and metadata layers that describe what each dataset is, where it came from, and how it’s used.&lt;/p&gt;
&lt;p&gt;Data is often organized into zones. A raw zone stores unprocessed source data. A staging or clean zone contains transformed and validated datasets. A curated zone includes data that’s ready for consumption by analysts or applications.&lt;/p&gt;
&lt;p&gt;Maintaining this structure helps manage lifecycle policies, access permissions, and lineage. Cataloging tools like AWS Glue, Apache Hive Metastore, or more modern solutions like Amundsen or DataHub help track what’s in the lake and make it discoverable.&lt;/p&gt;
&lt;p&gt;Processing engines like Apache Spark, Presto, or Dremio allow users to query data directly in the lake, using SQL or custom logic. These tools interpret files stored in formats like Parquet, ORC, or Avro, applying structure dynamically based on metadata or inferred schema.&lt;/p&gt;
&lt;h2&gt;When to Use a Data Lake&lt;/h2&gt;
&lt;p&gt;A data lake makes the most sense when you’re dealing with large volumes of diverse data types or when you&apos;re unsure how the data will be used. It’s particularly valuable in environments focused on research, machine learning, or combining traditional business data with less conventional sources like social media or IoT signals.&lt;/p&gt;
&lt;p&gt;However, if you need consistent, curated data for business reporting, a warehouse may be the better choice. Data lakes and warehouses serve different needs, and understanding how they complement each other is key to building a balanced architecture.&lt;/p&gt;
&lt;p&gt;In the next post, we’ll look at storage formats and compression—essential building blocks for making data lakes and warehouses efficient, scalable, and cost-effective.&lt;/p&gt;
</content:encoded><author>Alex Merced</author></item><item><title>Introduction to Data Engineering Concepts | Storage Formats and Compression</title><link>https://iceberglakehouse.com/posts/2025-05-intro-to-data-engineering-concepts-09/</link><guid isPermaLink="true">https://iceberglakehouse.com/posts/2025-05-intro-to-data-engineering-concepts-09/</guid><description>
## Free Resources  
- **[Free Apache Iceberg Course](https://hello.dremio.com/webcast-an-apache-iceberg-lakehouse-crash-course-reg.html?utm_source=ev...</description><pubDate>Fri, 02 May 2025 09:00:00 GMT</pubDate><content:encoded>&lt;h2&gt;Free Resources&lt;/h2&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;&lt;a href=&quot;https://hello.dremio.com/webcast-an-apache-iceberg-lakehouse-crash-course-reg.html?utm_source=ev_external_blog&amp;amp;utm_medium=influencer&amp;amp;utm_campaign=intro_to_de&amp;amp;utm_content=alexmerced&amp;amp;utm_term=external_blog&quot;&gt;Free Apache Iceberg Course&lt;/a&gt;&lt;/strong&gt;&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;&lt;a href=&quot;https://hello.dremio.com/wp-apache-iceberg-the-definitive-guide-reg.html?utm_source=ev_external_blog&amp;amp;utm_medium=influencer&amp;amp;utm_campaign=intro_to_de&amp;amp;utm_content=alexmerced&amp;amp;utm_term=external_blog&quot;&gt;Free Copy of “Apache Iceberg: The Definitive Guide”&lt;/a&gt;&lt;/strong&gt;&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;&lt;a href=&quot;https://hello.dremio.com/wp-apache-polaris-guide-reg.html?utm_source=ev_external_blog&amp;amp;utm_medium=influencer&amp;amp;utm_campaign=intro_to_de&amp;amp;utm_content=alexmerced&amp;amp;utm_term=external_blog&quot;&gt;Free Copy of “Apache Polaris: The Definitive Guide”&lt;/a&gt;&lt;/strong&gt;&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;&lt;a href=&quot;https://medium.com/data-engineering-with-dremio/2025-guide-to-architecting-an-iceberg-lakehouse-9b19ed42c9de&quot;&gt;2025 Apache Iceberg Architecture Guide&lt;/a&gt;&lt;/strong&gt;&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;&lt;a href=&quot;https://medium.alexmerced.blog/guide-to-finding-apache-iceberg-events-near-you-and-being-part-of-the-greater-iceberg-community-0c38ae785ddb&quot;&gt;How to Join the Iceberg Community&lt;/a&gt;&lt;/strong&gt;&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;&lt;a href=&quot;https://youtube.com/playlist?list=PLsLAVBjQJO0p0Yq1fLkoHvt2lEJj5pcYe&amp;amp;si=WTSnqjXZv6Glkc3y&quot;&gt;Iceberg Lakehouse Engineering Video Playlist&lt;/a&gt;&lt;/strong&gt;&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;&lt;a href=&quot;https://medium.com/data-engineering-with-dremio/ultimate-directory-of-apache-iceberg-resources-e3e02efac62e&quot;&gt;Ultimate Apache Iceberg Resource Guide&lt;/a&gt;&lt;/strong&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;When working with large-scale data systems, it&apos;s not just what data you store that matters—it&apos;s how you store it. The choice of storage format and compression strategy can make a significant difference in performance, cost, and usability. These decisions affect how quickly you can query data, how much storage space you need, and even how compatible your data is with various processing tools.&lt;/p&gt;
&lt;p&gt;In this post, we’ll explore the most common data storage formats, the role of compression, and how these choices impact modern data engineering workflows.&lt;/p&gt;
&lt;h2&gt;Why Storage Format Matters&lt;/h2&gt;
&lt;p&gt;Raw data often arrives in simple formats like CSV or JSON, and for small volumes, these formats work just fine. But as data grows into gigabytes or terabytes, inefficiencies start to show.&lt;/p&gt;
&lt;p&gt;Text-based formats like CSV are easy to read and parse, but they lack schema enforcement, are verbose, and are slow to process in distributed systems. JSON adds some flexibility by allowing nested structures, but it can still be quite large and inefficient when stored at scale.&lt;/p&gt;
&lt;p&gt;Columnar formats, by contrast, are designed for analytics. Instead of storing data row by row, they store values column by column. This layout enables faster queries and better compression—especially for workloads that scan only a few columns at a time.&lt;/p&gt;
&lt;p&gt;Imagine a table with hundreds of columns, but your query only needs five. With a row-based format, the system must read everything. With a columnar format, it reads just what’s needed. This is a game-changer for performance and cost in systems like data lakes and warehouses.&lt;/p&gt;
&lt;h2&gt;Common Formats in Practice&lt;/h2&gt;
&lt;p&gt;Several formats are widely used in data engineering, each with trade-offs.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;CSV&lt;/strong&gt; remains popular due to its simplicity and universal support. But it lacks strong typing and is prone to edge-case issues, such as inconsistent delimiters or quoting problems. It&apos;s best used for small datasets or temporary interoperability.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;JSON&lt;/strong&gt; and &lt;strong&gt;XML&lt;/strong&gt; are useful for semi-structured data. JSON, in particular, is common in APIs and logs. However, it’s not space-efficient and can be slow to parse at scale.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Parquet&lt;/strong&gt; is a columnar format developed by Apache. It&apos;s optimized for big data workloads and supports advanced features like nested schemas and predicate pushdown. Parquet is well-supported across tools like Spark, Hive, Dremio, and data warehouses like BigQuery and Snowflake.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Avro&lt;/strong&gt; is a row-based format with support for schema evolution. It’s often used in streaming applications and data serialization. While it’s not as query-efficient as Parquet, it excels in write-heavy and messaging scenarios.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;ORC&lt;/strong&gt; (Optimized Row Columnar) is similar to Parquet but originally developed for the Hadoop ecosystem. It offers strong compression and performance benefits for read-heavy workloads.&lt;/p&gt;
&lt;p&gt;Choosing between these often comes down to the nature of the workload. If you&apos;re doing analytics over large datasets, columnar formats like Parquet or ORC are usually the right call. If you&apos;re capturing events or streaming messages, Avro might be a better fit.&lt;/p&gt;
&lt;h2&gt;The Role of Compression&lt;/h2&gt;
&lt;p&gt;Compression reduces file sizes by encoding repeated or predictable patterns more efficiently. In distributed systems, this saves both storage space and network bandwidth, speeding up data movement and reducing cost.&lt;/p&gt;
&lt;p&gt;Compression can be applied at the file level or at the column level (in columnar formats). Modern formats like Parquet support multiple compression codecs, including Snappy, Gzip, Brotli, and Zstd.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Snappy&lt;/strong&gt; offers fast compression and decompression, making it a good default choice when speed matters more than maximum size reduction. &lt;strong&gt;Gzip&lt;/strong&gt; provides better compression ratios but is slower. &lt;strong&gt;Zstd&lt;/strong&gt; and &lt;strong&gt;Brotli&lt;/strong&gt; strike a balance, offering both speed and compression efficiency.&lt;/p&gt;
&lt;p&gt;When choosing a compression strategy, consider the use case. For interactive querying, speed matters, so faster codecs like Snappy are preferred. For archival data or large transfers, stronger compression may save more money in the long run.&lt;/p&gt;
&lt;h2&gt;Compatibility and Ecosystem Support&lt;/h2&gt;
&lt;p&gt;Storage format decisions also impact which tools you can use. Most modern data tools support Parquet and Avro natively, but compatibility can vary depending on the processing engine.&lt;/p&gt;
&lt;p&gt;For example, if you&apos;re building a data lake on S3 and using Apache Spark for processing, Parquet is almost always a safe choice. It integrates well with tools like Hive Metastore, Presto, Trino, and Dremio.&lt;/p&gt;
&lt;p&gt;If you’re using Kafka or other message queues, Avro is a common format due to its compactness and schema registry support.&lt;/p&gt;
&lt;p&gt;It’s also worth considering schema evolution—how well a format handles changes in the data structure over time. Avro and Parquet both support schema evolution, which allows you to add or remove fields without breaking downstream systems. This is crucial in agile environments where data changes frequently.&lt;/p&gt;
&lt;h2&gt;Putting It All Together&lt;/h2&gt;
&lt;p&gt;The best storage strategy balances performance, flexibility, and compatibility. There’s no one-size-fits-all answer, but understanding the characteristics of each format—and how compression affects storage and query speed—allows you to make informed choices.&lt;/p&gt;
&lt;p&gt;As data engineers, our job is to pick the right tools for the job, not just default to what’s familiar. Thoughtful decisions at the storage layer can ripple across the entire data stack, affecting cost, speed, and scalability.&lt;/p&gt;
&lt;p&gt;In the next post, we’ll turn our attention to data quality and validation—because no matter how well your data is stored, it’s only as good as it is accurate, complete, and trustworthy.&lt;/p&gt;
</content:encoded><author>Alex Merced</author></item><item><title>Introduction to Data Engineering Concepts | Data Quality and Validation</title><link>https://iceberglakehouse.com/posts/2025-05-intro-to-data-engineering-concepts-10/</link><guid isPermaLink="true">https://iceberglakehouse.com/posts/2025-05-intro-to-data-engineering-concepts-10/</guid><description>
## Free Resources  
- **[Free Apache Iceberg Course](https://hello.dremio.com/webcast-an-apache-iceberg-lakehouse-crash-course-reg.html?utm_source=ev...</description><pubDate>Fri, 02 May 2025 09:00:00 GMT</pubDate><content:encoded>&lt;h2&gt;Free Resources&lt;/h2&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;&lt;a href=&quot;https://hello.dremio.com/webcast-an-apache-iceberg-lakehouse-crash-course-reg.html?utm_source=ev_external_blog&amp;amp;utm_medium=influencer&amp;amp;utm_campaign=intro_to_de&amp;amp;utm_content=alexmerced&amp;amp;utm_term=external_blog&quot;&gt;Free Apache Iceberg Course&lt;/a&gt;&lt;/strong&gt;&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;&lt;a href=&quot;https://hello.dremio.com/wp-apache-iceberg-the-definitive-guide-reg.html?utm_source=ev_external_blog&amp;amp;utm_medium=influencer&amp;amp;utm_campaign=intro_to_de&amp;amp;utm_content=alexmerced&amp;amp;utm_term=external_blog&quot;&gt;Free Copy of “Apache Iceberg: The Definitive Guide”&lt;/a&gt;&lt;/strong&gt;&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;&lt;a href=&quot;https://hello.dremio.com/wp-apache-polaris-guide-reg.html?utm_source=ev_external_blog&amp;amp;utm_medium=influencer&amp;amp;utm_campaign=intro_to_de&amp;amp;utm_content=alexmerced&amp;amp;utm_term=external_blog&quot;&gt;Free Copy of “Apache Polaris: The Definitive Guide”&lt;/a&gt;&lt;/strong&gt;&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;&lt;a href=&quot;https://medium.com/data-engineering-with-dremio/2025-guide-to-architecting-an-iceberg-lakehouse-9b19ed42c9de&quot;&gt;2025 Apache Iceberg Architecture Guide&lt;/a&gt;&lt;/strong&gt;&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;&lt;a href=&quot;https://medium.alexmerced.blog/guide-to-finding-apache-iceberg-events-near-you-and-being-part-of-the-greater-iceberg-community-0c38ae785ddb&quot;&gt;How to Join the Iceberg Community&lt;/a&gt;&lt;/strong&gt;&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;&lt;a href=&quot;https://youtube.com/playlist?list=PLsLAVBjQJO0p0Yq1fLkoHvt2lEJj5pcYe&amp;amp;si=WTSnqjXZv6Glkc3y&quot;&gt;Iceberg Lakehouse Engineering Video Playlist&lt;/a&gt;&lt;/strong&gt;&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;&lt;a href=&quot;https://medium.com/data-engineering-with-dremio/ultimate-directory-of-apache-iceberg-resources-e3e02efac62e&quot;&gt;Ultimate Apache Iceberg Resource Guide&lt;/a&gt;&lt;/strong&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;In any data system, quality is not optional—it’s foundational. No matter how scalable your architecture is, or how fast your queries run, if the underlying data is inaccurate, incomplete, or inconsistent, the results will be misleading. And bad data leads to bad decisions.&lt;/p&gt;
&lt;p&gt;This post focuses on data quality and validation. We&apos;ll look at what makes data &amp;quot;good,&amp;quot; why quality issues emerge, and how engineers can build checks and balances into pipelines to ensure the reliability of their datasets.&lt;/p&gt;
&lt;h2&gt;Defining Data Quality&lt;/h2&gt;
&lt;p&gt;At its core, data quality is about trust. Can the data be used confidently for reporting, analytics, or decision-making? While quality is a broad concept, it typically includes several dimensions:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Accuracy&lt;/strong&gt;: Does the data reflect reality? For example, does a customer record show the correct name and email?&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Completeness&lt;/strong&gt;: Are all required fields populated? Missing data can render entire records useless.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Consistency&lt;/strong&gt;: Is the data uniform across systems? If two systems say different things about the same event, which one is right?&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Timeliness&lt;/strong&gt;: Is the data fresh enough for its intended purpose? A report showing yesterday’s numbers might be fine—or it might be too late.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Uniqueness&lt;/strong&gt;: Are there duplicate records that shouldn’t exist?&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;These attributes form the foundation of what we think of as “high-quality” data. But quality isn&apos;t static—it needs to be monitored continuously.&lt;/p&gt;
&lt;h2&gt;Where Data Quality Breaks Down&lt;/h2&gt;
&lt;p&gt;Quality issues usually arise at system boundaries. When data moves from one source to another—say, from a transactional database to a warehouse or from an API to a data lake—transformations, encoding issues, and format mismatches can cause subtle errors.&lt;/p&gt;
&lt;p&gt;Sometimes data is flawed at the source. A user enters a malformed email address, or a sensor transmits faulty readings due to hardware glitches. Other times, issues emerge downstream, such as when a pipeline fails silently or when schema changes aren’t communicated across teams.&lt;/p&gt;
&lt;p&gt;Even well-designed systems can encounter quality problems if the underlying business logic evolves. For example, a rule that defines how revenue is calculated may change, invalidating previous calculations if pipelines aren’t updated accordingly.&lt;/p&gt;
&lt;h2&gt;The Role of Validation&lt;/h2&gt;
&lt;p&gt;To combat these issues, validation is key. Validation is the act of checking data against expected rules and assumptions—often before it gets loaded into a final destination.&lt;/p&gt;
&lt;p&gt;This can happen at multiple stages of a pipeline. During ingestion, validation might confirm that all required fields are present and formatted correctly. During transformation, it might enforce business rules, such as ensuring that order totals are positive or that timestamps are within reasonable ranges.&lt;/p&gt;
&lt;p&gt;Validation can be passive—logging anomalies for review—or active, stopping a pipeline if thresholds are exceeded. Both approaches have their place. In some cases, it&apos;s better to allow partial data to flow through and alert the team. In others, it’s critical to block the update to prevent contamination of production datasets.&lt;/p&gt;
&lt;h2&gt;Tools for Data Quality&lt;/h2&gt;
&lt;p&gt;Several tools and frameworks have emerged to help engineers define, monitor, and enforce data quality checks.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Great Expectations&lt;/strong&gt; is one of the most well-known. It allows you to define “expectations” about your data—essentially, assertions about what should be true. These expectations can be validated at runtime, and the results can be logged, visualized, or used to trigger alerts.&lt;/p&gt;
&lt;p&gt;Another option is &lt;strong&gt;Amazon Deequ&lt;/strong&gt;, a library built on top of Apache Spark that performs similar validations at scale. It’s particularly useful in large distributed environments where running manual checks would be too costly.&lt;/p&gt;
&lt;p&gt;Some orchestration platforms, like Airflow and Dagster, support custom sensors or hooks that let you embed validation logic directly into the DAG. This tight integration makes it easier to halt jobs or notify teams when something goes wrong.&lt;/p&gt;
&lt;p&gt;Beyond tools, quality also depends on process. Data contracts, code reviews, and automated testing all contribute to building a culture where quality is prioritized from the start, not added as an afterthought.&lt;/p&gt;
&lt;h2&gt;Designing for Trust&lt;/h2&gt;
&lt;p&gt;A key principle in data engineering is that quality doesn&apos;t just happen—it must be designed. That means proactively defining what “correct” looks like, instrumenting checks, and making sure failures are surfaced early.&lt;/p&gt;
&lt;p&gt;Dashboards and data catalogs can help surface issues. But even more important is visibility: stakeholders need to know when data is delayed, incomplete, or incorrect. Setting up alerts based on data quality metrics helps teams respond quickly before problems reach downstream consumers.&lt;/p&gt;
&lt;p&gt;The cost of low-quality data isn&apos;t just technical—it&apos;s strategic. If users lose faith in the data, they stop relying on it. And once trust is gone, it’s incredibly hard to rebuild.&lt;/p&gt;
&lt;p&gt;In the next post, we’ll examine how metadata, lineage, and governance play a role in maintaining data integrity across complex systems. Knowing where your data came from and how it was transformed is just as important as validating its contents.&lt;/p&gt;
</content:encoded><author>Alex Merced</author></item><item><title>Introduction to Data Engineering Concepts | Metadata, Lineage, and Governance</title><link>https://iceberglakehouse.com/posts/2025-05-intro-to-data-engineering-concepts-11/</link><guid isPermaLink="true">https://iceberglakehouse.com/posts/2025-05-intro-to-data-engineering-concepts-11/</guid><description>
## Free Resources  
- **[Free Apache Iceberg Course](https://hello.dremio.com/webcast-an-apache-iceberg-lakehouse-crash-course-reg.html?utm_source=ev...</description><pubDate>Fri, 02 May 2025 09:00:00 GMT</pubDate><content:encoded>&lt;h2&gt;Free Resources&lt;/h2&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;&lt;a href=&quot;https://hello.dremio.com/webcast-an-apache-iceberg-lakehouse-crash-course-reg.html?utm_source=ev_external_blog&amp;amp;utm_medium=influencer&amp;amp;utm_campaign=intro_to_de&amp;amp;utm_content=alexmerced&amp;amp;utm_term=external_blog&quot;&gt;Free Apache Iceberg Course&lt;/a&gt;&lt;/strong&gt;&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;&lt;a href=&quot;https://hello.dremio.com/wp-apache-iceberg-the-definitive-guide-reg.html?utm_source=ev_external_blog&amp;amp;utm_medium=influencer&amp;amp;utm_campaign=intro_to_de&amp;amp;utm_content=alexmerced&amp;amp;utm_term=external_blog&quot;&gt;Free Copy of “Apache Iceberg: The Definitive Guide”&lt;/a&gt;&lt;/strong&gt;&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;&lt;a href=&quot;https://hello.dremio.com/wp-apache-polaris-guide-reg.html?utm_source=ev_external_blog&amp;amp;utm_medium=influencer&amp;amp;utm_campaign=intro_to_de&amp;amp;utm_content=alexmerced&amp;amp;utm_term=external_blog&quot;&gt;Free Copy of “Apache Polaris: The Definitive Guide”&lt;/a&gt;&lt;/strong&gt;&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;&lt;a href=&quot;https://medium.com/data-engineering-with-dremio/2025-guide-to-architecting-an-iceberg-lakehouse-9b19ed42c9de&quot;&gt;2025 Apache Iceberg Architecture Guide&lt;/a&gt;&lt;/strong&gt;&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;&lt;a href=&quot;https://medium.alexmerced.blog/guide-to-finding-apache-iceberg-events-near-you-and-being-part-of-the-greater-iceberg-community-0c38ae785ddb&quot;&gt;How to Join the Iceberg Community&lt;/a&gt;&lt;/strong&gt;&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;&lt;a href=&quot;https://youtube.com/playlist?list=PLsLAVBjQJO0p0Yq1fLkoHvt2lEJj5pcYe&amp;amp;si=WTSnqjXZv6Glkc3y&quot;&gt;Iceberg Lakehouse Engineering Video Playlist&lt;/a&gt;&lt;/strong&gt;&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;&lt;a href=&quot;https://medium.com/data-engineering-with-dremio/ultimate-directory-of-apache-iceberg-resources-e3e02efac62e&quot;&gt;Ultimate Apache Iceberg Resource Guide&lt;/a&gt;&lt;/strong&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;As data systems grow more complex, understanding where your data came from, how it has changed, and who is responsible for it becomes just as critical as the data itself. It’s not enough to know that a dataset exists—you need to know how it was created, whether it’s trustworthy, and how it fits into the broader system.&lt;/p&gt;
&lt;p&gt;In this post, we’ll break down three interconnected concepts—metadata, data lineage, and governance—and explore why they’re essential to building transparent, scalable, and compliant data infrastructure.&lt;/p&gt;
&lt;h2&gt;What Is Metadata?&lt;/h2&gt;
&lt;p&gt;Metadata is data about data. It describes the contents, structure, and context of a dataset, giving you the information needed to understand how to work with it.&lt;/p&gt;
&lt;p&gt;At the most basic level, metadata includes things like column names, data types, and row counts. But it can go much deeper. Metadata can describe data freshness (when it was last updated), sensitivity (whether it contains personally identifiable information), and ownership (who created or maintains the dataset).&lt;/p&gt;
&lt;p&gt;Well-managed metadata serves as a map to your data ecosystem. It helps engineers understand dependencies, enables analysts to find the right datasets, and assists compliance teams in locating sensitive information.&lt;/p&gt;
&lt;p&gt;Without metadata, even high-quality data becomes hard to use. Teams end up duplicating effort, making incorrect assumptions, or spending more time asking questions than building insights.&lt;/p&gt;
&lt;h2&gt;Understanding Data Lineage&lt;/h2&gt;
&lt;p&gt;Data lineage is the history of how data moves and changes through your systems. It traces the path from the original source—say, a transactional database or API—all the way to its final destination in a dashboard, report, or machine learning model.&lt;/p&gt;
&lt;p&gt;Lineage tells you not just where the data is now, but how it got there. Which tables did it pass through? What transformations were applied? Was any filtering, aggregation, or enrichment performed?&lt;/p&gt;
&lt;p&gt;This visibility is crucial for several reasons. First, it helps with debugging. When a report shows an unexpected number, lineage lets you trace the logic backwards to find the source of the issue. Second, it supports impact analysis. If a schema changes in a source table, you can immediately see which downstream systems are affected.&lt;/p&gt;
&lt;p&gt;In regulated industries, lineage is also a compliance requirement. Auditors often want to see a clear trail from raw data to final output to ensure accuracy, transparency, and accountability.&lt;/p&gt;
&lt;h2&gt;The Role of Data Governance&lt;/h2&gt;
&lt;p&gt;Data governance is the set of policies, processes, and roles that ensure data is managed responsibly across an organization. It covers who has access to what data, how it should be handled, and how changes are documented and approved.&lt;/p&gt;
&lt;p&gt;Governance is often misunderstood as being purely about control, but it’s really about enabling trust at scale. In small teams, people can rely on informal communication to manage data. In large organizations, clear governance is the only way to prevent chaos.&lt;/p&gt;
&lt;p&gt;Good governance defines roles and responsibilities. Who is the data owner? Who approves changes? Who can grant access? It also sets standards for naming, documentation, and data classification so that teams can work together without constant re-alignment.&lt;/p&gt;
&lt;p&gt;This becomes even more important in environments with sensitive data. Personally identifiable information (PII), financial records, and health data all come with legal and ethical obligations. Governance ensures these datasets are properly secured, audited, and retained only as long as necessary.&lt;/p&gt;
&lt;h2&gt;Tools and Practices&lt;/h2&gt;
&lt;p&gt;To manage metadata, lineage, and governance effectively, many organizations turn to dedicated platforms. Tools like Amundsen, DataHub, and Apache Atlas offer data cataloging and discovery features that make metadata more accessible and actionable.&lt;/p&gt;
&lt;p&gt;These platforms often integrate with processing engines and orchestration tools to automatically collect lineage. For example, if a pipeline built in Airflow or dbt modifies a dataset, the lineage graph is updated to reflect that change.&lt;/p&gt;
&lt;p&gt;But tools alone aren’t enough. Teams need practices that reinforce good habits—such as documenting changes, defining clear data ownership, and reviewing access permissions regularly.&lt;/p&gt;
&lt;p&gt;Automation can help, especially in dynamic environments where datasets are frequently added or updated. But governance must also be embedded into the culture. Engineers, analysts, and stakeholders all play a part in maintaining data integrity and clarity.&lt;/p&gt;
&lt;h2&gt;Bringing It All Together&lt;/h2&gt;
&lt;p&gt;Metadata, lineage, and governance are not isolated concerns. Together, they create a foundation for transparency and trust. They help organizations understand what data they have, how it’s being used, and whether it can be relied upon.&lt;/p&gt;
&lt;p&gt;Without this foundation, even the best-engineered pipelines can become liabilities. But with it, data becomes a strategic asset—one that teams can build on confidently, securely, and efficiently.&lt;/p&gt;
&lt;p&gt;In the next post, we’ll explore how workflow orchestration ties these pieces together, enabling you to manage complex data pipelines reliably across diverse tools and systems.&lt;/p&gt;
</content:encoded><author>Alex Merced</author></item><item><title>Introduction to Data Engineering Concepts | Scheduling and Workflow Orchestration</title><link>https://iceberglakehouse.com/posts/2025-05-intro-to-data-engineering-concepts-12/</link><guid isPermaLink="true">https://iceberglakehouse.com/posts/2025-05-intro-to-data-engineering-concepts-12/</guid><description>
## Free Resources  
- **[Free Apache Iceberg Course](https://hello.dremio.com/webcast-an-apache-iceberg-lakehouse-crash-course-reg.html?utm_source=ev...</description><pubDate>Fri, 02 May 2025 09:00:00 GMT</pubDate><content:encoded>&lt;h2&gt;Free Resources&lt;/h2&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;&lt;a href=&quot;https://hello.dremio.com/webcast-an-apache-iceberg-lakehouse-crash-course-reg.html?utm_source=ev_external_blog&amp;amp;utm_medium=influencer&amp;amp;utm_campaign=intro_to_de&amp;amp;utm_content=alexmerced&amp;amp;utm_term=external_blog&quot;&gt;Free Apache Iceberg Course&lt;/a&gt;&lt;/strong&gt;&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;&lt;a href=&quot;https://hello.dremio.com/wp-apache-iceberg-the-definitive-guide-reg.html?utm_source=ev_external_blog&amp;amp;utm_medium=influencer&amp;amp;utm_campaign=intro_to_de&amp;amp;utm_content=alexmerced&amp;amp;utm_term=external_blog&quot;&gt;Free Copy of “Apache Iceberg: The Definitive Guide”&lt;/a&gt;&lt;/strong&gt;&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;&lt;a href=&quot;https://hello.dremio.com/wp-apache-polaris-guide-reg.html?utm_source=ev_external_blog&amp;amp;utm_medium=influencer&amp;amp;utm_campaign=intro_to_de&amp;amp;utm_content=alexmerced&amp;amp;utm_term=external_blog&quot;&gt;Free Copy of “Apache Polaris: The Definitive Guide”&lt;/a&gt;&lt;/strong&gt;&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;&lt;a href=&quot;https://medium.com/data-engineering-with-dremio/2025-guide-to-architecting-an-iceberg-lakehouse-9b19ed42c9de&quot;&gt;2025 Apache Iceberg Architecture Guide&lt;/a&gt;&lt;/strong&gt;&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;&lt;a href=&quot;https://medium.alexmerced.blog/guide-to-finding-apache-iceberg-events-near-you-and-being-part-of-the-greater-iceberg-community-0c38ae785ddb&quot;&gt;How to Join the Iceberg Community&lt;/a&gt;&lt;/strong&gt;&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;&lt;a href=&quot;https://youtube.com/playlist?list=PLsLAVBjQJO0p0Yq1fLkoHvt2lEJj5pcYe&amp;amp;si=WTSnqjXZv6Glkc3y&quot;&gt;Iceberg Lakehouse Engineering Video Playlist&lt;/a&gt;&lt;/strong&gt;&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;&lt;a href=&quot;https://medium.com/data-engineering-with-dremio/ultimate-directory-of-apache-iceberg-resources-e3e02efac62e&quot;&gt;Ultimate Apache Iceberg Resource Guide&lt;/a&gt;&lt;/strong&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;As data pipelines grow in complexity, managing them manually becomes unsustainable. Whether you&apos;re running daily ETL jobs, refreshing dashboards, or processing streaming data in micro-batches, you need a way to coordinate and monitor these tasks reliably. That’s where workflow orchestration comes in.&lt;/p&gt;
&lt;p&gt;In this post, we&apos;ll explore what orchestration means in the context of data engineering, how it differs from simple job scheduling, and what tools and design patterns help keep data workflows organized, observable, and resilient.&lt;/p&gt;
&lt;h2&gt;From Scheduling to Orchestration&lt;/h2&gt;
&lt;p&gt;At the simplest level, scheduling is about running tasks at a certain time. A cron job that triggers a Python script every morning is a form of scheduling. For small pipelines with few dependencies, this can be enough.&lt;/p&gt;
&lt;p&gt;But modern data systems rarely involve just one job. Instead, they include chains of tasks—data extractions, file transformations, validation checks, and loads into various targets. These tasks have dependencies, need error handling, and often require conditional logic. This is where orchestration becomes essential.&lt;/p&gt;
&lt;p&gt;Workflow orchestration is the discipline of managing task execution across a defined sequence, ensuring that tasks run in the correct order, on time, and with awareness of success or failure. It&apos;s not just about launching scripts—it&apos;s about understanding how those scripts relate to one another, how they behave under different conditions, and how to recover when something goes wrong.&lt;/p&gt;
&lt;h2&gt;Directed Acyclic Graphs (DAGs)&lt;/h2&gt;
&lt;p&gt;Most orchestration systems use the concept of a Directed Acyclic Graph (DAG) to represent workflows. In a DAG, each node represents a task, and edges represent dependencies. The &amp;quot;acyclic&amp;quot; part means there are no loops—each task runs once, and the flow moves in one direction.&lt;/p&gt;
&lt;p&gt;This structure allows you to define complex workflows declaratively. For example, you might define a pipeline where data is first extracted from an API, then validated, transformed, and finally loaded into a data warehouse. If any step fails, the system can stop the pipeline, alert the team, or retry the task based on configuration.&lt;/p&gt;
&lt;p&gt;DAGs also make it easier to track the status of each component. You can visualize which tasks succeeded, which are still running, and where failures occurred. This visibility is crucial for maintaining trust in your data pipelines.&lt;/p&gt;
&lt;h2&gt;Common Orchestration Tools&lt;/h2&gt;
&lt;p&gt;Several orchestration frameworks have become standard in the data engineering ecosystem.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Apache Airflow&lt;/strong&gt; is one of the most widely adopted tools. It allows users to define DAGs using Python code, which makes it highly flexible and programmable. Airflow includes scheduling, retries, logging, and a web UI for monitoring workflows.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Prefect&lt;/strong&gt; takes a modern approach by separating the orchestration layer from execution, which makes it more cloud-native and resilient to task failures. Prefect’s focus on observability and developer experience has made it popular for teams managing dynamic workloads.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Dagster&lt;/strong&gt; emphasizes data assets and type safety. It treats data pipelines as modular, testable units and integrates tightly with modern tooling, including dbt and cloud environments.&lt;/p&gt;
&lt;p&gt;Each of these tools supports task dependencies, conditional logic, parallelism, and failure recovery. Choosing the right one often comes down to team preference, operational needs, and ecosystem compatibility.&lt;/p&gt;
&lt;h2&gt;Best Practices in Workflow Design&lt;/h2&gt;
&lt;p&gt;Designing orchestration workflows requires more than chaining tasks together. Robust pipelines include thoughtful handling of edge cases and clear observability. That means:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Using retries and timeouts to deal with flaky services or transient failures.&lt;/li&gt;
&lt;li&gt;Logging meaningful output so that issues can be diagnosed quickly.&lt;/li&gt;
&lt;li&gt;Isolating tasks so that a failure in one part doesn’t compromise unrelated workflows.&lt;/li&gt;
&lt;li&gt;Tagging or labeling jobs by function or owner to improve maintainability.&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;It also means thinking about idempotency. Tasks should be safe to rerun if needed. For example, a data load job that inserts duplicate rows each time it runs will cause problems if retried. Designing tasks to either overwrite cleanly or check for prior completion helps prevent these issues.&lt;/p&gt;
&lt;p&gt;Another key practice is modularity. Instead of building large monolithic DAGs, break workflows into reusable components. This makes it easier to test, maintain, and scale your pipelines as your data ecosystem evolves.&lt;/p&gt;
&lt;h2&gt;Observability and Alerting&lt;/h2&gt;
&lt;p&gt;A well-orchestrated pipeline doesn’t just run—it tells you how it’s running. Observability is about surfacing the right information at the right time so that engineers can respond to issues quickly.&lt;/p&gt;
&lt;p&gt;Good orchestration tools provide dashboards, logs, and metrics. But equally important are alerts that notify the right people when something goes wrong. Alerts should be actionable and avoid noise. A system that sends alerts on every minor warning will eventually be ignored.&lt;/p&gt;
&lt;p&gt;Integrating with monitoring platforms like Prometheus, Grafana, or external alerting tools like PagerDuty or Slack helps ensure that teams can respond to problems before they affect end users.&lt;/p&gt;
&lt;h2&gt;Orchestration as the Backbone&lt;/h2&gt;
&lt;p&gt;Workflow orchestration isn’t just a technical layer—it’s the backbone of operational data systems. It connects ingestion, transformation, validation, and delivery in a reliable and auditable way. When done well, it turns complex processes into predictable, repeatable workflows that teams can build on confidently.&lt;/p&gt;
&lt;p&gt;In the next post, we’ll explore how to build scalable pipelines, including how to think about performance, parallelism, and distribution when dealing with large or fast-growing datasets.&lt;/p&gt;
</content:encoded><author>Alex Merced</author></item><item><title>Introduction to Data Engineering Concepts | Building Scalable Pipelines</title><link>https://iceberglakehouse.com/posts/2025-05-intro-to-data-engineering-concepts-13/</link><guid isPermaLink="true">https://iceberglakehouse.com/posts/2025-05-intro-to-data-engineering-concepts-13/</guid><description>
## Free Resources  
- **[Free Apache Iceberg Course](https://hello.dremio.com/webcast-an-apache-iceberg-lakehouse-crash-course-reg.html?utm_source=ev...</description><pubDate>Fri, 02 May 2025 09:00:00 GMT</pubDate><content:encoded>&lt;h2&gt;Free Resources&lt;/h2&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;&lt;a href=&quot;https://hello.dremio.com/webcast-an-apache-iceberg-lakehouse-crash-course-reg.html?utm_source=ev_external_blog&amp;amp;utm_medium=influencer&amp;amp;utm_campaign=intro_to_de&amp;amp;utm_content=alexmerced&amp;amp;utm_term=external_blog&quot;&gt;Free Apache Iceberg Course&lt;/a&gt;&lt;/strong&gt;&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;&lt;a href=&quot;https://hello.dremio.com/wp-apache-iceberg-the-definitive-guide-reg.html?utm_source=ev_external_blog&amp;amp;utm_medium=influencer&amp;amp;utm_campaign=intro_to_de&amp;amp;utm_content=alexmerced&amp;amp;utm_term=external_blog&quot;&gt;Free Copy of “Apache Iceberg: The Definitive Guide”&lt;/a&gt;&lt;/strong&gt;&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;&lt;a href=&quot;https://hello.dremio.com/wp-apache-polaris-guide-reg.html?utm_source=ev_external_blog&amp;amp;utm_medium=influencer&amp;amp;utm_campaign=intro_to_de&amp;amp;utm_content=alexmerced&amp;amp;utm_term=external_blog&quot;&gt;Free Copy of “Apache Polaris: The Definitive Guide”&lt;/a&gt;&lt;/strong&gt;&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;&lt;a href=&quot;https://medium.com/data-engineering-with-dremio/2025-guide-to-architecting-an-iceberg-lakehouse-9b19ed42c9de&quot;&gt;2025 Apache Iceberg Architecture Guide&lt;/a&gt;&lt;/strong&gt;&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;&lt;a href=&quot;https://medium.alexmerced.blog/guide-to-finding-apache-iceberg-events-near-you-and-being-part-of-the-greater-iceberg-community-0c38ae785ddb&quot;&gt;How to Join the Iceberg Community&lt;/a&gt;&lt;/strong&gt;&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;&lt;a href=&quot;https://youtube.com/playlist?list=PLsLAVBjQJO0p0Yq1fLkoHvt2lEJj5pcYe&amp;amp;si=WTSnqjXZv6Glkc3y&quot;&gt;Iceberg Lakehouse Engineering Video Playlist&lt;/a&gt;&lt;/strong&gt;&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;&lt;a href=&quot;https://medium.com/data-engineering-with-dremio/ultimate-directory-of-apache-iceberg-resources-e3e02efac62e&quot;&gt;Ultimate Apache Iceberg Resource Guide&lt;/a&gt;&lt;/strong&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;As data volumes increase and workflows grow more interconnected, the ability to build scalable data pipelines becomes essential. It&apos;s not enough for a pipeline to work—it needs to keep working as data grows from gigabytes to terabytes, as new sources are added, and as more users rely on the output for decision-making.&lt;/p&gt;
&lt;p&gt;In this post, we’ll explore what makes a pipeline scalable, the principles behind designing for growth, and the tools and patterns that data engineers use to manage complexity at scale.&lt;/p&gt;
&lt;h2&gt;What Do We Mean by Scalability?&lt;/h2&gt;
&lt;p&gt;Scalability is about more than just performance. It&apos;s the ability of a system to maintain its functionality and responsiveness as load increases. In the context of data pipelines, this means handling larger datasets, higher data velocity, and more frequent processing without constant reengineering.&lt;/p&gt;
&lt;p&gt;A scalable pipeline gracefully adapts to changes in data size, structure, and frequency. It’s designed in a modular way, so that bottlenecks can be addressed without rewriting the entire system. And it’s observable and maintainable, so issues can be diagnosed before they affect users.&lt;/p&gt;
&lt;p&gt;Scalability also involves cost efficiency. Throwing more resources at a slow pipeline might fix the symptoms, but a well-designed system scales intelligently, minimizing unnecessary computation and data movement.&lt;/p&gt;
&lt;h2&gt;Parallelism and Distribution&lt;/h2&gt;
&lt;p&gt;One of the core principles behind scalability is parallelism—the ability to split work into independent chunks that can be processed simultaneously.&lt;/p&gt;
&lt;p&gt;In batch workflows, this might mean partitioning data by date or region and processing each partition in parallel. In streaming systems, it means dividing incoming data into partitions or shards that are consumed by multiple workers.&lt;/p&gt;
&lt;p&gt;Distributed computing frameworks like Apache Spark, Flink, and Dask are designed with this in mind. They break down data into smaller units, distribute them across a cluster of machines, and execute tasks in parallel, tracking dependencies and ensuring consistency across the system.&lt;/p&gt;
&lt;p&gt;But parallelism introduces its own challenges. Data skew—when one partition is significantly larger than others—can lead to uneven workloads and poor performance. Effective partitioning strategies and thoughtful job configuration are key to maintaining balance.&lt;/p&gt;
&lt;h2&gt;Minimizing Data Movement&lt;/h2&gt;
&lt;p&gt;Another aspect of scalability is reducing how often and how far data moves. Every transfer across a network or system boundary adds latency, cost, and potential failure points.&lt;/p&gt;
&lt;p&gt;Where possible, pipelines should process data close to where it&apos;s stored. For example, using a query engine like Dremio or Presto to query data directly from object storage avoids the overhead of loading it into a warehouse first.&lt;/p&gt;
&lt;p&gt;Materializing only what’s needed, caching intermediate results, and pushing filters down into source systems are all ways to reduce unnecessary computation and movement.&lt;/p&gt;
&lt;p&gt;Streaming pipelines, in particular, benefit from minimizing state size and using windowed processing, so that each event is handled quickly and discarded once processed.&lt;/p&gt;
&lt;h2&gt;Managing Resources&lt;/h2&gt;
&lt;p&gt;Scalable pipelines require careful resource management. Compute, memory, and I/O all need to be provisioned in a way that meets demand without excessive overhead.&lt;/p&gt;
&lt;p&gt;Autoscaling, used in many cloud-native environments, allows processing clusters to grow and shrink based on workload. This is especially valuable for unpredictable or bursty workloads, where fixed infrastructure would either overrun or sit idle.&lt;/p&gt;
&lt;p&gt;Monitoring and alerting tools provide visibility into where resources are being used inefficiently. Long-running jobs, slow joins, or excessive data shuffles can all indicate areas where performance tuning is needed.&lt;/p&gt;
&lt;p&gt;Tuning batch sizes, controlling concurrency, and using backpressure mechanisms in streaming systems help maintain throughput without overloading infrastructure.&lt;/p&gt;
&lt;h2&gt;Designing for Change&lt;/h2&gt;
&lt;p&gt;Scalability isn’t just about today’s workload—it’s about tomorrow’s. Data pipelines should be designed to evolve.&lt;/p&gt;
&lt;p&gt;This means avoiding hard-coded assumptions about schema, partitions, or file sizes. It means using configuration over code where possible, and abstracting logic into reusable modules that can be adapted as requirements shift.&lt;/p&gt;
&lt;p&gt;Schema evolution support, metadata management, and data contracts between producers and consumers help ensure that changes can be made safely, without breaking downstream systems.&lt;/p&gt;
&lt;p&gt;Testing plays a big role here as well. Unit tests for transformations, integration tests for pipeline steps, and data quality checks all contribute to a system that can grow without becoming brittle.&lt;/p&gt;
&lt;h2&gt;Bringing It All Together&lt;/h2&gt;
&lt;p&gt;Scalable pipelines don’t happen by accident. They’re the result of intentional design choices that account for volume, velocity, and variability.&lt;/p&gt;
&lt;p&gt;By embracing parallelism, minimizing data movement, managing resources effectively, and planning for change, data engineers can build pipelines that not only meet today’s demands but are ready for tomorrow’s challenges.&lt;/p&gt;
&lt;p&gt;In the next post, we’ll look at how DevOps principles apply to data engineering—covering CI/CD, infrastructure as code, and the tools that support reliable and automated data deployments.&lt;/p&gt;
</content:encoded><author>Alex Merced</author></item><item><title>Introduction to Data Engineering Concepts | DevOps for Data Engineering</title><link>https://iceberglakehouse.com/posts/2025-05-intro-to-data-engineering-concepts-14/</link><guid isPermaLink="true">https://iceberglakehouse.com/posts/2025-05-intro-to-data-engineering-concepts-14/</guid><description>
## Free Resources  
- **[Free Apache Iceberg Course](https://hello.dremio.com/webcast-an-apache-iceberg-lakehouse-crash-course-reg.html?utm_source=ev...</description><pubDate>Fri, 02 May 2025 09:00:00 GMT</pubDate><content:encoded>&lt;h2&gt;Free Resources&lt;/h2&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;&lt;a href=&quot;https://hello.dremio.com/webcast-an-apache-iceberg-lakehouse-crash-course-reg.html?utm_source=ev_external_blog&amp;amp;utm_medium=influencer&amp;amp;utm_campaign=intro_to_de&amp;amp;utm_content=alexmerced&amp;amp;utm_term=external_blog&quot;&gt;Free Apache Iceberg Course&lt;/a&gt;&lt;/strong&gt;&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;&lt;a href=&quot;https://hello.dremio.com/wp-apache-iceberg-the-definitive-guide-reg.html?utm_source=ev_external_blog&amp;amp;utm_medium=influencer&amp;amp;utm_campaign=intro_to_de&amp;amp;utm_content=alexmerced&amp;amp;utm_term=external_blog&quot;&gt;Free Copy of “Apache Iceberg: The Definitive Guide”&lt;/a&gt;&lt;/strong&gt;&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;&lt;a href=&quot;https://hello.dremio.com/wp-apache-polaris-guide-reg.html?utm_source=ev_external_blog&amp;amp;utm_medium=influencer&amp;amp;utm_campaign=intro_to_de&amp;amp;utm_content=alexmerced&amp;amp;utm_term=external_blog&quot;&gt;Free Copy of “Apache Polaris: The Definitive Guide”&lt;/a&gt;&lt;/strong&gt;&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;&lt;a href=&quot;https://medium.com/data-engineering-with-dremio/2025-guide-to-architecting-an-iceberg-lakehouse-9b19ed42c9de&quot;&gt;2025 Apache Iceberg Architecture Guide&lt;/a&gt;&lt;/strong&gt;&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;&lt;a href=&quot;https://medium.alexmerced.blog/guide-to-finding-apache-iceberg-events-near-you-and-being-part-of-the-greater-iceberg-community-0c38ae785ddb&quot;&gt;How to Join the Iceberg Community&lt;/a&gt;&lt;/strong&gt;&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;&lt;a href=&quot;https://youtube.com/playlist?list=PLsLAVBjQJO0p0Yq1fLkoHvt2lEJj5pcYe&amp;amp;si=WTSnqjXZv6Glkc3y&quot;&gt;Iceberg Lakehouse Engineering Video Playlist&lt;/a&gt;&lt;/strong&gt;&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;&lt;a href=&quot;https://medium.com/data-engineering-with-dremio/ultimate-directory-of-apache-iceberg-resources-e3e02efac62e&quot;&gt;Ultimate Apache Iceberg Resource Guide&lt;/a&gt;&lt;/strong&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;As data systems grow more complex and interconnected, the principles of DevOps—long applied to software engineering—have become increasingly relevant to data engineering. Continuous integration, infrastructure as code, testing, and automation aren’t just for deploying apps anymore. They’re essential for delivering reliable, maintainable, and scalable data pipelines.&lt;/p&gt;
&lt;p&gt;In this post, we’ll explore how DevOps practices translate into the world of data engineering, why they matter, and what tools and techniques help bring them to life in modern data teams.&lt;/p&gt;
&lt;h2&gt;Bridging the Gap Between Code and Data&lt;/h2&gt;
&lt;p&gt;At the heart of DevOps is the idea that development and operations should be integrated. In traditional software development, this means automating the steps from writing code to running it in production. For data engineering, the challenge is similar—but the output isn&apos;t always a user-facing app. Instead, it&apos;s pipelines, transformations, and datasets that power reports, dashboards, and machine learning models.&lt;/p&gt;
&lt;p&gt;The core question becomes: how do we ensure that changes to data workflows are tested, deployed, and monitored with the same rigor as application code?&lt;/p&gt;
&lt;p&gt;The answer lies in adopting DevOps-inspired practices like version control, automated testing, continuous deployment, and infrastructure automation—all tailored to the specifics of data systems.&lt;/p&gt;
&lt;h2&gt;Version Control for Pipelines and Configurations&lt;/h2&gt;
&lt;p&gt;Just like in software engineering, all code that defines your data infrastructure—SQL queries, transformation logic, orchestration DAGs, and even schema definitions—should live in version-controlled repositories.&lt;/p&gt;
&lt;p&gt;This makes it easier to collaborate, review changes, and roll back when something breaks. Tools like Git, combined with platforms like GitHub or GitLab, provide the foundation. Branching strategies and pull requests help teams manage change in a structured, auditable way.&lt;/p&gt;
&lt;p&gt;Even configurations—such as data source definitions or schedule timings—can and should be versioned, ideally alongside the pipeline logic they support.&lt;/p&gt;
&lt;h2&gt;Continuous Integration and Testing&lt;/h2&gt;
&lt;p&gt;Data pipelines are code, and they should be tested like code. This includes unit tests for transformation logic, integration tests for full pipeline runs, and data quality checks that assert assumptions about the shape and content of your data.&lt;/p&gt;
&lt;p&gt;CI pipelines, powered by tools like GitHub Actions, GitLab CI, or Jenkins, can run these tests automatically on each commit or pull request. They ensure that changes don’t break existing functionality or introduce regressions.&lt;/p&gt;
&lt;p&gt;Testing data workflows is more nuanced than testing application logic. It often involves staging environments with synthetic or sample data, mocking external dependencies, and verifying outputs across time windows. But the goal is the same: catch problems early, not after they hit production.&lt;/p&gt;
&lt;h2&gt;Infrastructure as Code&lt;/h2&gt;
&lt;p&gt;Managing infrastructure manually—whether it’s a Spark cluster, an Airflow deployment, or a cloud storage bucket—doesn’t scale. Infrastructure as code (IaC) provides a way to define your environment in declarative files that can be versioned, reviewed, and deployed automatically.&lt;/p&gt;
&lt;p&gt;Tools like Terraform, Pulumi, and CloudFormation allow data teams to define compute resources, networking, permissions, and even pipeline configurations as code. Combined with CI/CD, IaC enables repeatable deployments, easier disaster recovery, and consistent environments across dev, staging, and production.&lt;/p&gt;
&lt;p&gt;IaC also helps in tracking infrastructure changes over time. When something breaks, you can look at the exact commit that introduced the change—not just guess what might have gone wrong.&lt;/p&gt;
&lt;h2&gt;Continuous Deployment for Pipelines&lt;/h2&gt;
&lt;p&gt;Once code is tested and approved, it needs to be deployed. Continuous deployment automates this step, pushing new pipeline definitions or transformation logic into production systems with minimal manual intervention.&lt;/p&gt;
&lt;p&gt;In practice, this might mean updating DAGs in Airflow, deploying dbt models, or rolling out new configurations to a Kafka stream processor. The process should include validation steps, such as verifying schema compatibility or testing data output in a sandbox environment before it goes live.&lt;/p&gt;
&lt;p&gt;Feature flags and gradual rollouts—techniques borrowed from application development—can also be applied to data. They allow teams to test changes on a subset of data or users before promoting them system-wide.&lt;/p&gt;
&lt;h2&gt;Monitoring and Incident Response&lt;/h2&gt;
&lt;p&gt;Finally, DevOps emphasizes the importance of monitoring and observability. Data pipelines need the same treatment. Logs, metrics, and alerts should provide insight into pipeline health, performance, and failures.&lt;/p&gt;
&lt;p&gt;Tools like Prometheus, Grafana, and cloud-native observability platforms can be integrated with orchestration tools to expose runtime metrics. Custom dashboards can show pipeline durations, success rates, and error counts. Alerts can notify teams when jobs fail or when output data violates expectations.&lt;/p&gt;
&lt;p&gt;Just as importantly, incidents should feed back into improvement. Postmortems, runbooks, and blameless retrospectives help teams learn from failures and evolve their systems.&lt;/p&gt;
&lt;h2&gt;Shifting the Culture&lt;/h2&gt;
&lt;p&gt;Adopting DevOps for data engineering is as much about culture as it is about tools. It means treating data workflows with the same discipline as software systems—building, testing, deploying, and monitoring them in automated, repeatable ways.&lt;/p&gt;
&lt;p&gt;This cultural shift leads to faster iterations, fewer outages, and more confidence in the data products that teams rely on. It also reduces the operational load on engineers, freeing them to focus on value creation instead of firefighting.&lt;/p&gt;
&lt;p&gt;In the next post, we’ll step back and look at the cloud ecosystem that underpins much of this work. Understanding the role of managed services and cloud-native tools is key to building a modern, agile data platform.&lt;/p&gt;
</content:encoded><author>Alex Merced</author></item><item><title>Introduction to Data Engineering Concepts | Cloud Data Platforms and the Modern Stack</title><link>https://iceberglakehouse.com/posts/2025-05-intro-to-data-engineering-concepts-15/</link><guid isPermaLink="true">https://iceberglakehouse.com/posts/2025-05-intro-to-data-engineering-concepts-15/</guid><description>
## Free Resources  
- **[Free Apache Iceberg Course](https://hello.dremio.com/webcast-an-apache-iceberg-lakehouse-crash-course-reg.html?utm_source=ev...</description><pubDate>Fri, 02 May 2025 09:00:00 GMT</pubDate><content:encoded>&lt;h2&gt;Free Resources&lt;/h2&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;&lt;a href=&quot;https://hello.dremio.com/webcast-an-apache-iceberg-lakehouse-crash-course-reg.html?utm_source=ev_external_blog&amp;amp;utm_medium=influencer&amp;amp;utm_campaign=intro_to_de&amp;amp;utm_content=alexmerced&amp;amp;utm_term=external_blog&quot;&gt;Free Apache Iceberg Course&lt;/a&gt;&lt;/strong&gt;&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;&lt;a href=&quot;https://hello.dremio.com/wp-apache-iceberg-the-definitive-guide-reg.html?utm_source=ev_external_blog&amp;amp;utm_medium=influencer&amp;amp;utm_campaign=intro_to_de&amp;amp;utm_content=alexmerced&amp;amp;utm_term=external_blog&quot;&gt;Free Copy of “Apache Iceberg: The Definitive Guide”&lt;/a&gt;&lt;/strong&gt;&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;&lt;a href=&quot;https://hello.dremio.com/wp-apache-polaris-guide-reg.html?utm_source=ev_external_blog&amp;amp;utm_medium=influencer&amp;amp;utm_campaign=intro_to_de&amp;amp;utm_content=alexmerced&amp;amp;utm_term=external_blog&quot;&gt;Free Copy of “Apache Polaris: The Definitive Guide”&lt;/a&gt;&lt;/strong&gt;&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;&lt;a href=&quot;https://medium.com/data-engineering-with-dremio/2025-guide-to-architecting-an-iceberg-lakehouse-9b19ed42c9de&quot;&gt;2025 Apache Iceberg Architecture Guide&lt;/a&gt;&lt;/strong&gt;&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;&lt;a href=&quot;https://medium.alexmerced.blog/guide-to-finding-apache-iceberg-events-near-you-and-being-part-of-the-greater-iceberg-community-0c38ae785ddb&quot;&gt;How to Join the Iceberg Community&lt;/a&gt;&lt;/strong&gt;&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;&lt;a href=&quot;https://youtube.com/playlist?list=PLsLAVBjQJO0p0Yq1fLkoHvt2lEJj5pcYe&amp;amp;si=WTSnqjXZv6Glkc3y&quot;&gt;Iceberg Lakehouse Engineering Video Playlist&lt;/a&gt;&lt;/strong&gt;&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;&lt;a href=&quot;https://medium.com/data-engineering-with-dremio/ultimate-directory-of-apache-iceberg-resources-e3e02efac62e&quot;&gt;Ultimate Apache Iceberg Resource Guide&lt;/a&gt;&lt;/strong&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;The cloud has transformed how organizations approach data engineering. What once required physical servers, manual provisioning, and heavyweight infrastructure can now be spun up in minutes with managed, scalable services. But with this convenience comes complexity—deciding how to compose the right mix of tools and platforms for your data workflows.&lt;/p&gt;
&lt;p&gt;In this post, we’ll explore what defines the modern data stack, how cloud platforms like AWS, GCP, and Azure fit into the picture, and what principles guide the design of flexible, cloud-native data architectures.&lt;/p&gt;
&lt;h2&gt;Moving Beyond On-Premise&lt;/h2&gt;
&lt;p&gt;In traditional, on-premise data systems, teams had to manage everything themselves—hardware, networking, databases, storage, and backups. Scaling required buying more servers. Upgrades were slow, and experimentation was costly.&lt;/p&gt;
&lt;p&gt;Cloud platforms shifted this model. Infrastructure became elastic. Managed services replaced self-hosted databases and batch processing engines. What used to take weeks could now be done in hours. This shift enabled data engineers to focus more on business logic and less on infrastructure maintenance.&lt;/p&gt;
&lt;p&gt;But while the cloud solved many problems, it also introduced new decisions. With so many tools available, how do you choose the right combination? That’s where the concept of the modern data stack comes in.&lt;/p&gt;
&lt;h2&gt;What Is the Modern Data Stack?&lt;/h2&gt;
&lt;p&gt;The modern data stack refers to a collection of tools—often cloud-native—that work together to support the full data lifecycle: ingestion, transformation, storage, orchestration, and analysis.&lt;/p&gt;
&lt;p&gt;Typically, this stack includes:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;A cloud data warehouse like Snowflake, BigQuery, or Redshift&lt;/li&gt;
&lt;li&gt;An ingestion tool such as Fivetran, Airbyte, or custom streaming connectors&lt;/li&gt;
&lt;li&gt;A transformation framework like dbt&lt;/li&gt;
&lt;li&gt;An orchestration platform like Airflow or Prefect&lt;/li&gt;
&lt;li&gt;BI tools such as Looker, Mode, or Tableau&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;These tools are designed to be modular and API-driven. You can swap components as your needs evolve, without having to rebuild the entire system. They also tend to embrace SQL, making them accessible to a broader range of users, including analysts and analytics engineers.&lt;/p&gt;
&lt;p&gt;This composability is powerful, but it requires thoughtful integration. Data engineers must understand how data flows across services, how metadata is preserved, and where bottlenecks can emerge.&lt;/p&gt;
&lt;h2&gt;Managed Services in the Cloud&lt;/h2&gt;
&lt;p&gt;Each major cloud provider offers a suite of services tailored to data engineering.&lt;/p&gt;
&lt;p&gt;On &lt;strong&gt;AWS&lt;/strong&gt;, services like S3 (storage), Glue (ETL), Redshift (warehousing), and Kinesis (streaming) form the core building blocks. AWS is known for its breadth and flexibility, making it a strong choice for teams that want control and are comfortable managing complexity.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Google Cloud Platform (GCP)&lt;/strong&gt; centers around BigQuery, a serverless, high-performance data warehouse. Paired with Dataflow (streaming and batch processing), Pub/Sub (messaging), and Looker (BI), GCP offers a tight integration between services with a focus on simplicity and scale.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Microsoft Azure&lt;/strong&gt; provides tools like Synapse Analytics, Data Factory, and Event Hubs. It often appeals to enterprise environments already invested in Microsoft’s ecosystem, offering deep integration with Active Directory, Power BI, and other services.&lt;/p&gt;
&lt;p&gt;Each platform brings its own pricing models, performance characteristics, and operational trade-offs. Choosing one often comes down to organizational context—existing infrastructure, skillsets, and vendor relationships.&lt;/p&gt;
&lt;h2&gt;Designing for Agility&lt;/h2&gt;
&lt;p&gt;A key advantage of the cloud is its ability to support experimentation. You can test new tools, build proof-of-concepts, and iterate quickly without long procurement cycles or sunk infrastructure costs.&lt;/p&gt;
&lt;p&gt;This agility enables teams to build for today while planning for tomorrow. For example, a team might start with batch ingestion and transformation using dbt and Airflow. As data needs grow, they can add streaming layers with Kafka and Spark, or move toward a lakehouse architecture using Iceberg and Dremio.&lt;/p&gt;
&lt;p&gt;To design for agility, it’s important to decouple systems where possible. Avoid hard-wiring logic across tools. Use metadata and configuration layers to manage pipeline logic. Embrace standards like Parquet or Arrow to ensure interoperability between tools.&lt;/p&gt;
&lt;p&gt;Observability and governance also become more important in a distributed cloud environment. Knowing where your data is, how it’s being used, and who has access requires integrated monitoring, logging, and metadata management.&lt;/p&gt;
&lt;h2&gt;The Cloud is Not Just a Hosting Model&lt;/h2&gt;
&lt;p&gt;Adopting cloud data platforms is not just about moving infrastructure off-premise—it’s about rethinking how teams operate. Cloud-native architectures prioritize scalability, flexibility, and automation.&lt;/p&gt;
&lt;p&gt;They allow you to treat data as a product, with well-defined interfaces, quality guarantees, and ownership. They enable collaboration across roles—engineers, analysts, and scientists—by providing shared platforms and standardized workflows.&lt;/p&gt;
&lt;p&gt;Ultimately, the modern data stack is not a fixed set of tools, but a mindset. It&apos;s about building systems that are composable, observable, and adaptable. It’s about enabling fast iteration without sacrificing reliability.&lt;/p&gt;
&lt;p&gt;In the next post, we’ll shift into the final phase of this series and explore the evolution toward data lakehouse architectures—what they are, why they matter, and how they unify the best of both lakes and warehouses.&lt;/p&gt;
</content:encoded><author>Alex Merced</author></item><item><title>Introduction to Data Engineering Concepts | Data Lakehouse Architecture Explained</title><link>https://iceberglakehouse.com/posts/2025-05-intro-to-data-engineering-concepts-16/</link><guid isPermaLink="true">https://iceberglakehouse.com/posts/2025-05-intro-to-data-engineering-concepts-16/</guid><description>
## Free Resources  
- **[Free Apache Iceberg Course](https://hello.dremio.com/webcast-an-apache-iceberg-lakehouse-crash-course-reg.html?utm_source=ev...</description><pubDate>Fri, 02 May 2025 09:00:00 GMT</pubDate><content:encoded>&lt;h2&gt;Free Resources&lt;/h2&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;&lt;a href=&quot;https://hello.dremio.com/webcast-an-apache-iceberg-lakehouse-crash-course-reg.html?utm_source=ev_external_blog&amp;amp;utm_medium=influencer&amp;amp;utm_campaign=intro_to_de&amp;amp;utm_content=alexmerced&amp;amp;utm_term=external_blog&quot;&gt;Free Apache Iceberg Course&lt;/a&gt;&lt;/strong&gt;&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;&lt;a href=&quot;https://hello.dremio.com/wp-apache-iceberg-the-definitive-guide-reg.html?utm_source=ev_external_blog&amp;amp;utm_medium=influencer&amp;amp;utm_campaign=intro_to_de&amp;amp;utm_content=alexmerced&amp;amp;utm_term=external_blog&quot;&gt;Free Copy of “Apache Iceberg: The Definitive Guide”&lt;/a&gt;&lt;/strong&gt;&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;&lt;a href=&quot;https://hello.dremio.com/wp-apache-polaris-guide-reg.html?utm_source=ev_external_blog&amp;amp;utm_medium=influencer&amp;amp;utm_campaign=intro_to_de&amp;amp;utm_content=alexmerced&amp;amp;utm_term=external_blog&quot;&gt;Free Copy of “Apache Polaris: The Definitive Guide”&lt;/a&gt;&lt;/strong&gt;&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;&lt;a href=&quot;https://medium.com/data-engineering-with-dremio/2025-guide-to-architecting-an-iceberg-lakehouse-9b19ed42c9de&quot;&gt;2025 Apache Iceberg Architecture Guide&lt;/a&gt;&lt;/strong&gt;&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;&lt;a href=&quot;https://medium.alexmerced.blog/guide-to-finding-apache-iceberg-events-near-you-and-being-part-of-the-greater-iceberg-community-0c38ae785ddb&quot;&gt;How to Join the Iceberg Community&lt;/a&gt;&lt;/strong&gt;&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;&lt;a href=&quot;https://youtube.com/playlist?list=PLsLAVBjQJO0p0Yq1fLkoHvt2lEJj5pcYe&amp;amp;si=WTSnqjXZv6Glkc3y&quot;&gt;Iceberg Lakehouse Engineering Video Playlist&lt;/a&gt;&lt;/strong&gt;&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;&lt;a href=&quot;https://medium.com/data-engineering-with-dremio/ultimate-directory-of-apache-iceberg-resources-e3e02efac62e&quot;&gt;Ultimate Apache Iceberg Resource Guide&lt;/a&gt;&lt;/strong&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;Data lakes and data warehouses each brought strengths and limitations to the way organizations manage analytics. Lakes offered flexibility and scale, but lacked consistency and performance. Warehouses delivered speed and structure, but often at the cost of rigidity and duplication. The data lakehouse aims to unify the best of both worlds.&lt;/p&gt;
&lt;p&gt;In this post, we’ll explore what a data lakehouse is, how it differs from its predecessors, and why it represents a fundamental shift in modern data architecture.&lt;/p&gt;
&lt;h2&gt;The Problem with Separate Systems&lt;/h2&gt;
&lt;p&gt;Historically, data teams maintained two separate systems: a data lake for raw, large-scale data and a warehouse for clean, curated analytics. This split introduced a number of challenges.&lt;/p&gt;
&lt;p&gt;Data had to be copied and transformed between systems. Pipelines became complex and brittle, often requiring multiple processing steps to move data from lake storage into a format usable by the warehouse. Governance and metadata management were fragmented. And teams ended up managing duplicate logic in two places, increasing both cost and risk.&lt;/p&gt;
&lt;p&gt;This led to a common problem: organizations had access to a lot of data, but not in a way that was fully consistent, trustworthy, or timely.&lt;/p&gt;
&lt;h2&gt;What is a Lakehouse?&lt;/h2&gt;
&lt;p&gt;A lakehouse is a single data architecture that combines the scalability and cost-efficiency of a data lake with the data management features of a warehouse. Instead of maintaining separate systems for raw and curated data, a lakehouse enables you to store all data in one place—typically an object store like S3 or ADLS—while layering in transactional guarantees, schema enforcement, and performance optimizations.&lt;/p&gt;
&lt;p&gt;The core idea is to treat the lake as the foundation, and then build capabilities on top that make it feel like a warehouse: support for SQL queries, fine-grained access controls, data versioning, and support for BI tools.&lt;/p&gt;
&lt;p&gt;With a lakehouse, you can ingest raw data, apply transformations, and serve both data scientists and business analysts from the same platform—without having to move or duplicate data between systems.&lt;/p&gt;
&lt;h2&gt;Key Capabilities&lt;/h2&gt;
&lt;p&gt;A few innovations make the lakehouse model possible:&lt;/p&gt;
&lt;p&gt;First, &lt;strong&gt;table formats&lt;/strong&gt; like Apache Iceberg and Delta Lake introduce ACID transactions to files stored in data lakes. This means you can safely update, insert, and delete records with consistency, even across distributed systems.&lt;/p&gt;
&lt;p&gt;Second, &lt;strong&gt;query engines&lt;/strong&gt; like Dremio, Trino, and Starburst have matured to the point where they can run fast, complex SQL queries directly against files in the lake—especially when using efficient columnar formats like Parquet.&lt;/p&gt;
&lt;p&gt;Third, metadata and cataloging layers have improved, enabling better schema management, lineage tracking, and discovery across lakehouse tables.&lt;/p&gt;
&lt;p&gt;Together, these advancements bridge the gap between raw storage and structured analytics, making it possible to build a cohesive data platform without compromise.&lt;/p&gt;
&lt;h2&gt;Benefits of the Lakehouse Approach&lt;/h2&gt;
&lt;p&gt;One of the most compelling benefits of a lakehouse is &lt;strong&gt;simplification&lt;/strong&gt;. Instead of building multiple pipelines to synchronize data between systems, teams can work from a single source of truth. This reduces latency, lowers operational complexity, and improves data consistency.&lt;/p&gt;
&lt;p&gt;Lakehouses are also &lt;strong&gt;cost-effective&lt;/strong&gt;. Object storage is cheaper and more scalable than traditional databases. And by avoiding the need to load data into separate warehouses, you eliminate redundant storage and computation.&lt;/p&gt;
&lt;p&gt;From a flexibility standpoint, the lakehouse supports a wide range of use cases—from batch analytics to interactive SQL to machine learning—all from the same underlying data.&lt;/p&gt;
&lt;p&gt;Importantly, the lakehouse model supports &lt;strong&gt;open standards&lt;/strong&gt;. With formats like Iceberg, you’re not locked into a single vendor’s ecosystem. Your data remains portable, and you can build your stack using best-of-breed components.&lt;/p&gt;
&lt;h2&gt;A New Foundation for the Future&lt;/h2&gt;
&lt;p&gt;The data lakehouse is more than a marketing term—it represents a practical response to the needs of modern data teams. As data volumes continue to grow, and as organizations seek faster, more reliable insights, the need for unified, scalable architectures becomes clear.&lt;/p&gt;
&lt;p&gt;By combining the raw power of data lakes with the structure and performance of data warehouses, the lakehouse offers a way to do more with less—less duplication, less movement, and less friction.&lt;/p&gt;
&lt;p&gt;In the next post, we’ll dig deeper into the technologies that make the lakehouse possible, starting with Apache Iceberg, Apache Arrow, and Apache Polaris. These tools form the foundation of many modern analytic platforms and help bring the lakehouse vision to life.&lt;/p&gt;
</content:encoded><author>Alex Merced</author></item><item><title>Introduction to Data Engineering Concepts | Apache Iceberg, Arrow, and Polaris</title><link>https://iceberglakehouse.com/posts/2025-05-intro-to-data-engineering-concepts-17/</link><guid isPermaLink="true">https://iceberglakehouse.com/posts/2025-05-intro-to-data-engineering-concepts-17/</guid><description>
## Free Resources  
- **[Free Apache Iceberg Course](https://hello.dremio.com/webcast-an-apache-iceberg-lakehouse-crash-course-reg.html?utm_source=ev...</description><pubDate>Fri, 02 May 2025 09:00:00 GMT</pubDate><content:encoded>&lt;h2&gt;Free Resources&lt;/h2&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;&lt;a href=&quot;https://hello.dremio.com/webcast-an-apache-iceberg-lakehouse-crash-course-reg.html?utm_source=ev_external_blog&amp;amp;utm_medium=influencer&amp;amp;utm_campaign=intro_to_de&amp;amp;utm_content=alexmerced&amp;amp;utm_term=external_blog&quot;&gt;Free Apache Iceberg Course&lt;/a&gt;&lt;/strong&gt;&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;&lt;a href=&quot;https://hello.dremio.com/wp-apache-iceberg-the-definitive-guide-reg.html?utm_source=ev_external_blog&amp;amp;utm_medium=influencer&amp;amp;utm_campaign=intro_to_de&amp;amp;utm_content=alexmerced&amp;amp;utm_term=external_blog&quot;&gt;Free Copy of “Apache Iceberg: The Definitive Guide”&lt;/a&gt;&lt;/strong&gt;&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;&lt;a href=&quot;https://hello.dremio.com/wp-apache-polaris-guide-reg.html?utm_source=ev_external_blog&amp;amp;utm_medium=influencer&amp;amp;utm_campaign=intro_to_de&amp;amp;utm_content=alexmerced&amp;amp;utm_term=external_blog&quot;&gt;Free Copy of “Apache Polaris: The Definitive Guide”&lt;/a&gt;&lt;/strong&gt;&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;&lt;a href=&quot;https://medium.com/data-engineering-with-dremio/2025-guide-to-architecting-an-iceberg-lakehouse-9b19ed42c9de&quot;&gt;2025 Apache Iceberg Architecture Guide&lt;/a&gt;&lt;/strong&gt;&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;&lt;a href=&quot;https://medium.alexmerced.blog/guide-to-finding-apache-iceberg-events-near-you-and-being-part-of-the-greater-iceberg-community-0c38ae785ddb&quot;&gt;How to Join the Iceberg Community&lt;/a&gt;&lt;/strong&gt;&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;&lt;a href=&quot;https://youtube.com/playlist?list=PLsLAVBjQJO0p0Yq1fLkoHvt2lEJj5pcYe&amp;amp;si=WTSnqjXZv6Glkc3y&quot;&gt;Iceberg Lakehouse Engineering Video Playlist&lt;/a&gt;&lt;/strong&gt;&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;&lt;a href=&quot;https://medium.com/data-engineering-with-dremio/ultimate-directory-of-apache-iceberg-resources-e3e02efac62e&quot;&gt;Ultimate Apache Iceberg Resource Guide&lt;/a&gt;&lt;/strong&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;As the data lakehouse ecosystem matures, new technologies are emerging to close the gap between raw, scalable storage and the structured, governed world of traditional analytics. Apache Iceberg, Apache Arrow, and Apache Polaris are three such technologies—each playing a distinct role in enabling high-performance, cloud-native data platforms that prioritize openness, flexibility, and consistency.&lt;/p&gt;
&lt;p&gt;In this post, we’ll explore what each of these technologies brings to the table and how they work together to power modern data workflows.&lt;/p&gt;
&lt;h2&gt;Apache Iceberg: The Table Format That Changes Everything&lt;/h2&gt;
&lt;p&gt;Apache Iceberg is more than just a file format—it’s a table format designed to bring SQL-like features to cloud object storage. In traditional data lakes, data is stored in files, but there’s no built-in concept of a table. This makes operations like updates, deletes, or time travel difficult to implement consistently.&lt;/p&gt;
&lt;p&gt;Iceberg solves that by introducing a transactional metadata layer. Tables are made up of snapshots, each pointing to a set of manifest files that describe the underlying data files. Every time data is written or updated, a new snapshot is created, and the metadata is atomically updated.&lt;/p&gt;
&lt;p&gt;This architecture enables reliable schema evolution, partition pruning, and time travel. It also supports concurrent writes across engines, making Iceberg a foundational layer for scalable, multi-engine data platforms.&lt;/p&gt;
&lt;p&gt;Importantly, Iceberg is engine-agnostic. Spark, Flink, Trino, Snowflake, and Dremio all support reading and writing to Iceberg tables, which allows data teams to avoid vendor lock-in and build modular systems.&lt;/p&gt;
&lt;h2&gt;Apache Arrow: A Universal Memory Format&lt;/h2&gt;
&lt;p&gt;If Iceberg handles data at rest, Apache Arrow handles data in motion. Arrow is a columnar in-memory format optimized for analytical processing. It allows systems to share data across process boundaries without serialization overhead, which dramatically reduces latency in data transfers.&lt;/p&gt;
&lt;p&gt;In practice, Arrow powers faster execution of queries, especially in environments where performance is critical. Engines like Dremio and frameworks like pandas or Apache Flight use Arrow to move data between components efficiently.&lt;/p&gt;
&lt;p&gt;Because Arrow defines a common representation for tabular data in memory, it allows tools built in different languages and frameworks to interoperate seamlessly. That’s a big deal in heterogeneous environments where Python, Java, and C++ may all play a role in the same workflow.&lt;/p&gt;
&lt;p&gt;Together, Iceberg and Arrow represent a powerful separation of concerns: Arrow optimizes processing in RAM, while Iceberg provides the transactional storage layer on disk.&lt;/p&gt;
&lt;h2&gt;Apache Polaris: The Missing Catalog Layer&lt;/h2&gt;
&lt;p&gt;As Iceberg adoption grows, managing Iceberg tables across distributed query engines becomes a challenge. That’s where Apache Polaris comes in.&lt;/p&gt;
&lt;p&gt;Polaris is an implementation of the Apache Iceberg REST catalog specification. It provides a centralized service for managing metadata about Iceberg tables and their organizational structure. Instead of having every engine implement its own catalog logic, Polaris provides a shared layer that orchestrates access across tools like Spark, Flink, Trino, and Snowflake.&lt;/p&gt;
&lt;p&gt;At the heart of Polaris is the concept of a &lt;strong&gt;catalog&lt;/strong&gt;—a logical container for Iceberg tables, configured to point to your cloud storage. Polaris supports both internal and external catalogs. Internal catalogs are fully managed within Polaris, while external catalogs sync with systems like Snowflake or Dremio Arctic. This flexibility lets you bring your existing Iceberg assets under centralized governance without locking them in.&lt;/p&gt;
&lt;p&gt;Polaris organizes tables into &lt;strong&gt;namespaces&lt;/strong&gt;, which are essentially folders within a catalog. These namespaces can be nested to reflect organizational or project hierarchies. Within a namespace, you register Iceberg tables, which can then be accessed by multiple engines through a consistent API.&lt;/p&gt;
&lt;p&gt;To connect to Polaris, engines use &lt;strong&gt;service principals&lt;/strong&gt;—authenticated entities with specific privileges. These principals are grouped into &lt;strong&gt;principal roles&lt;/strong&gt;, which receive access rights from &lt;strong&gt;catalog roles&lt;/strong&gt;. This role-based access control (RBAC) system allows for fine-grained security across catalogs, namespaces, and tables.&lt;/p&gt;
&lt;p&gt;What makes Polaris especially powerful is its ability to vend &lt;strong&gt;temporary credentials&lt;/strong&gt; during query execution. When a query runs, Polaris provides secure access to the underlying storage without exposing long-term cloud credentials. This mechanism, known as credential vending, ensures both security and operational flexibility.&lt;/p&gt;
&lt;h2&gt;A Unified Ecosystem&lt;/h2&gt;
&lt;p&gt;Together, Apache Iceberg, Arrow, and Polaris create a cohesive environment where data can be stored, processed, and accessed consistently and securely—regardless of the engine being used.&lt;/p&gt;
&lt;p&gt;Iceberg brings data warehouse-like capabilities to cloud storage. Arrow enables high-performance, memory-efficient processing across languages and systems. Polaris acts as the control plane, coordinating access and governance.&lt;/p&gt;
&lt;p&gt;This architecture aligns with the ideals of the data lakehouse: open standards, decoupled compute and storage, and interoperability across tools. By building on these technologies, organizations can future-proof their data platforms while empowering teams to work with the tools they prefer.&lt;/p&gt;
&lt;p&gt;In the next and final post in this series, we’ll look at Dremio—a platform that ties these components together to deliver interactive, self-service analytics directly on the data lake, without moving data or duplicating logic.&lt;/p&gt;
</content:encoded><author>Alex Merced</author></item><item><title>Introduction to Data Engineering Concepts | The Power of Dremio in the Modern Lakehouse</title><link>https://iceberglakehouse.com/posts/2025-05-intro-to-data-engineering-concepts-18/</link><guid isPermaLink="true">https://iceberglakehouse.com/posts/2025-05-intro-to-data-engineering-concepts-18/</guid><description>
## Free Resources  
- **[Free Apache Iceberg Course](https://hello.dremio.com/webcast-an-apache-iceberg-lakehouse-crash-course-reg.html?utm_source=ev...</description><pubDate>Fri, 02 May 2025 09:00:00 GMT</pubDate><content:encoded>&lt;h2&gt;Free Resources&lt;/h2&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;&lt;a href=&quot;https://hello.dremio.com/webcast-an-apache-iceberg-lakehouse-crash-course-reg.html?utm_source=ev_external_blog&amp;amp;utm_medium=influencer&amp;amp;utm_campaign=intro_to_de&amp;amp;utm_content=alexmerced&amp;amp;utm_term=external_blog&quot;&gt;Free Apache Iceberg Course&lt;/a&gt;&lt;/strong&gt;&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;&lt;a href=&quot;https://hello.dremio.com/wp-apache-iceberg-the-definitive-guide-reg.html?utm_source=ev_external_blog&amp;amp;utm_medium=influencer&amp;amp;utm_campaign=intro_to_de&amp;amp;utm_content=alexmerced&amp;amp;utm_term=external_blog&quot;&gt;Free Copy of “Apache Iceberg: The Definitive Guide”&lt;/a&gt;&lt;/strong&gt;&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;&lt;a href=&quot;https://hello.dremio.com/wp-apache-polaris-guide-reg.html?utm_source=ev_external_blog&amp;amp;utm_medium=influencer&amp;amp;utm_campaign=intro_to_de&amp;amp;utm_content=alexmerced&amp;amp;utm_term=external_blog&quot;&gt;Free Copy of “Apache Polaris: The Definitive Guide”&lt;/a&gt;&lt;/strong&gt;&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;&lt;a href=&quot;https://medium.com/data-engineering-with-dremio/2025-guide-to-architecting-an-iceberg-lakehouse-9b19ed42c9de&quot;&gt;2025 Apache Iceberg Architecture Guide&lt;/a&gt;&lt;/strong&gt;&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;&lt;a href=&quot;https://medium.alexmerced.blog/guide-to-finding-apache-iceberg-events-near-you-and-being-part-of-the-greater-iceberg-community-0c38ae785ddb&quot;&gt;How to Join the Iceberg Community&lt;/a&gt;&lt;/strong&gt;&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;&lt;a href=&quot;https://youtube.com/playlist?list=PLsLAVBjQJO0p0Yq1fLkoHvt2lEJj5pcYe&amp;amp;si=WTSnqjXZv6Glkc3y&quot;&gt;Iceberg Lakehouse Engineering Video Playlist&lt;/a&gt;&lt;/strong&gt;&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;&lt;a href=&quot;https://medium.com/data-engineering-with-dremio/ultimate-directory-of-apache-iceberg-resources-e3e02efac62e&quot;&gt;Ultimate Apache Iceberg Resource Guide&lt;/a&gt;&lt;/strong&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;As organizations shift toward data lakehouse architectures, the question isn’t just how to store massive volumes of data—it’s how to optimize it for fast, reliable access without adding complexity or operational overhead. Dremio addresses this challenge head-on by combining performance, governance, and openness into a platform built natively on Apache Iceberg, Apache Arrow, and Apache Polaris.&lt;/p&gt;
&lt;p&gt;In this final post of our series, we’ll explore how Dremio ties together the technologies we&apos;ve discussed—like clustering, reflections, and cataloging—into an integrated solution for modern data engineering. We’ll cover what makes Dremio unique, how its latest innovations like Iceberg Clustering and Autonomous Reflections work, and why these capabilities are a breakthrough for data teams aiming to do more with less.&lt;/p&gt;
&lt;h2&gt;Built for the Modern Stack&lt;/h2&gt;
&lt;p&gt;Dremio isn&apos;t just a SQL engine—it’s a full data platform built for the lakehouse era. It operates directly on data stored in open formats like Parquet and Iceberg, using Apache Arrow for in-memory performance and Apache Polaris for metadata management and governance. The result is a platform that offers sub-second queries, native support for open standards, and a unified experience across ingestion, transformation, exploration, and security.&lt;/p&gt;
&lt;p&gt;Instead of requiring teams to move data into a proprietary warehouse, Dremio enables query federation across lakes, catalogs, and traditional databases. Whether your data lives in S3, GCS, Azure, or multiple warehouses, Dremio can connect, query, and govern it—all without duplication or data movement.&lt;/p&gt;
&lt;p&gt;But what truly sets Dremio apart is its focus on intelligent automation and data layout optimization. Let’s break down how these features work.&lt;/p&gt;
&lt;h2&gt;Iceberg Clustering: Smarter Data Organization&lt;/h2&gt;
&lt;p&gt;As datasets grow, traditional partitioning strategies fall short. Over-partitioning leads to a flood of small files. Under-partitioning causes massive scan overhead. Dremio introduces Iceberg Clustering to address this gap.&lt;/p&gt;
&lt;p&gt;Instead of dividing data into rigid partitions, clustering organizes rows based on column value proximity using Z-ordering, a type of space-filling curve. This technique braids together bits from multiple columns to form an index that preserves locality. The closer the index values, the closer the original rows were in value space—making it easier for the engine to skip irrelevant data.&lt;/p&gt;
&lt;p&gt;By clustering non-partitioned tables, Dremio can dramatically reduce the number of data files and row groups scanned during queries. The result: faster performance without the rigidity or complexity of traditional partitioning.&lt;/p&gt;
&lt;p&gt;This process is incremental and adaptive. Dremio monitors data file overlap (measured via clustering depth) and selectively rewrites files to restore efficient layout. You don’t have to re-cluster everything or worry about perfect partition granularity—Dremio handles it dynamically and intelligently.&lt;/p&gt;
&lt;h2&gt;Autonomous Reflections: AI for Query Optimization&lt;/h2&gt;
&lt;p&gt;Materialized views are great—until you have to decide which ones to create, maintain, and drop. Dremio automates this process with Autonomous Reflections, which monitor your workloads, identify performance bottlenecks, and generate pre-aggregated or pre-filtered views to accelerate queries.&lt;/p&gt;
&lt;p&gt;The system analyzes usage patterns and query plans, scores potential reflections based on estimated time savings, and creates only those that deliver meaningful impact. It even keeps them up to date using live metadata refresh and incremental updates, ensuring performance gains without sacrificing freshness.&lt;/p&gt;
&lt;p&gt;Reflections are created, scored, and dropped automatically based on cost-benefit analysis, with strict guardrails to avoid wasting resources. This isn’t just automation—it’s intelligent, usage-aware optimization.&lt;/p&gt;
&lt;p&gt;With Dremio’s Autonomous Reflections, query acceleration becomes invisible to the user. Queries run faster, dashboards load quicker, and teams no longer need to guess which workloads justify a materialized view. The platform adapts as your usage changes.&lt;/p&gt;
&lt;h2&gt;Governance and Discoverability with Polaris&lt;/h2&gt;
&lt;p&gt;Managing Iceberg tables at scale requires more than just metadata tracking—it requires unified governance. Dremio’s integration with Apache Polaris gives teams a central catalog that enforces access controls, tracks lineage, and supports multi-engine access through open REST protocols.&lt;/p&gt;
&lt;p&gt;Whether you’re using Spark, Trino, Flink, or Dremio itself, Polaris provides a consistent layer for managing catalogs, namespaces, and Iceberg tables. Service principals and RBAC ensure secure access, while credential vending allows query engines to read data without exposing cloud credentials.&lt;/p&gt;
&lt;p&gt;By offering a unified metastore for all your Iceberg assets, Polaris makes it easier to scale governance and integrate with diverse compute engines, all while maintaining data sovereignty and visibility.&lt;/p&gt;
&lt;h2&gt;AI-Ready Data, Out of the Box&lt;/h2&gt;
&lt;p&gt;As data volumes soar and AI workloads increase, organizations need data platforms that deliver speed and clarity—not maintenance overhead. Dremio’s new features don’t just optimize query performance; they also support AI and analytics with intelligent automation, semantic search, and unified metadata.&lt;/p&gt;
&lt;p&gt;AI-Enabled Semantic Search lets users discover datasets using plain language, not SQL. This reduces time spent hunting for data and accelerates exploration for analysts and data scientists alike. Combined with reflections and clustering, the platform ensures these queries return results fast.&lt;/p&gt;
&lt;p&gt;And because Dremio is built on open standards—Iceberg, Arrow, and Polaris—you can trust that your data architecture will remain portable, interoperable, and vendor-neutral.&lt;/p&gt;
&lt;h2&gt;Real-World Results&lt;/h2&gt;
&lt;p&gt;Dremio has already demonstrated the power of this approach internally. After deploying clustering and autonomous reflections across its own internal lakehouse, Dremio saw:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;80% of dashboards accelerated automatically&lt;/li&gt;
&lt;li&gt;10x reduction in 90th percentile query times&lt;/li&gt;
&lt;li&gt;30x improvement in CPU efficiency per query&lt;/li&gt;
&lt;li&gt;Substantial infrastructure savings by right-sizing compute resources&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;These improvements weren’t the result of hand-tuning or custom engineering. They were achieved through intelligent automation—something every team can now access.&lt;/p&gt;
&lt;h2&gt;Conclusion&lt;/h2&gt;
&lt;p&gt;Data lakehouses offer unmatched flexibility, but performance and manageability have long remained pain points. With features like Iceberg Clustering, Autonomous Reflections, and Polaris Catalog, Dremio turns the lakehouse into a high-performance, governed, and self-optimizing platform.&lt;/p&gt;
&lt;p&gt;For data engineers, this means fewer manual interventions, faster time-to-insight, and greater confidence in how data is delivered. For analysts and AI teams, it means sub-second queries and easy access to the data they need—no pipeline delays, no tuning required.&lt;/p&gt;
&lt;p&gt;As the final stop in this series, Dremio represents the culmination of modern data engineering principles: openness, automation, and efficiency. If you&apos;re building on Iceberg and want to unlock its full potential, Dremio offers a platform designed not just to support your architecture, but to elevate it.&lt;/p&gt;
&lt;p&gt;To see it in action, try Dremio for free or explore the latest launch to learn how these capabilities can help your team build a faster, smarter lakehouse.&lt;/p&gt;
</content:encoded><author>Alex Merced</author></item><item><title>A Journey from AI to LLMs and MCP - 10 - Sampling and Prompts in MCP — Making Agent Workflows Smarter and Safer</title><link>https://iceberglakehouse.com/posts/2025-04-sampling-and-prompts-in-mcp/</link><guid isPermaLink="true">https://iceberglakehouse.com/posts/2025-04-sampling-and-prompts-in-mcp/</guid><description>
## Free Resources  
- **[Free Apache Iceberg Course](https://hello.dremio.com/webcast-an-apache-iceberg-lakehouse-crash-course-reg.html?utm_source=ev...</description><pubDate>Mon, 14 Apr 2025 09:00:00 GMT</pubDate><content:encoded>&lt;h2&gt;Free Resources&lt;/h2&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;&lt;a href=&quot;https://hello.dremio.com/webcast-an-apache-iceberg-lakehouse-crash-course-reg.html?utm_source=ev_external_blog&amp;amp;utm_medium=influencer&amp;amp;utm_campaign=AItoLLMS&amp;amp;utm_content=alexmerced&amp;amp;utm_term=external_blog&quot;&gt;Free Apache Iceberg Course&lt;/a&gt;&lt;/strong&gt;&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;&lt;a href=&quot;https://hello.dremio.com/wp-apache-iceberg-the-definitive-guide-reg.html?utm_source=ev_external_blog&amp;amp;utm_medium=influencer&amp;amp;utm_campaign=AItoLLMS&amp;amp;utm_content=alexmerced&amp;amp;utm_term=external_blog&quot;&gt;Free Copy of “Apache Iceberg: The Definitive Guide”&lt;/a&gt;&lt;/strong&gt;&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;&lt;a href=&quot;https://medium.com/data-engineering-with-dremio/2025-guide-to-architecting-an-iceberg-lakehouse-9b19ed42c9de&quot;&gt;2025 Apache Iceberg Architecture Guide&lt;/a&gt;&lt;/strong&gt;&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;&lt;a href=&quot;https://medium.alexmerced.blog/guide-to-finding-apache-iceberg-events-near-you-and-being-part-of-the-greater-iceberg-community-0c38ae785ddb&quot;&gt;How to Join the Iceberg Community&lt;/a&gt;&lt;/strong&gt;&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;&lt;a href=&quot;https://youtube.com/playlist?list=PLsLAVBjQJO0p0Yq1fLkoHvt2lEJj5pcYe&amp;amp;si=WTSnqjXZv6Glkc3y&quot;&gt;Iceberg Lakehouse Engineering Video Playlist&lt;/a&gt;&lt;/strong&gt;&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;&lt;a href=&quot;https://medium.com/data-engineering-with-dremio/ultimate-directory-of-apache-iceberg-resources-e3e02efac62e&quot;&gt;Ultimate Apache Iceberg Resource Guide&lt;/a&gt;&lt;/strong&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;We’ve now seen how the &lt;strong&gt;Model Context Protocol (MCP)&lt;/strong&gt; allows LLMs to read resources and call tools—giving them access to both data and action.&lt;/p&gt;
&lt;p&gt;But what if your &lt;strong&gt;MCP server&lt;/strong&gt; needs the LLM to make a decision?&lt;/p&gt;
&lt;p&gt;What if it needs to:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Analyze a file before running a tool?&lt;/li&gt;
&lt;li&gt;Draft a message for approval?&lt;/li&gt;
&lt;li&gt;Ask the model to choose between options?&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;That’s where &lt;strong&gt;Sampling&lt;/strong&gt; comes in.&lt;/p&gt;
&lt;p&gt;And what if you want to give the user—or the LLM—reusable, structured prompt templates for common workflows?&lt;/p&gt;
&lt;p&gt;That’s where &lt;strong&gt;Prompts&lt;/strong&gt; come in.&lt;/p&gt;
&lt;p&gt;In this final post of the series, we’ll explore:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;How &lt;strong&gt;sampling&lt;/strong&gt; allows servers to request completions from LLMs&lt;/li&gt;
&lt;li&gt;How &lt;strong&gt;prompts&lt;/strong&gt; enable reusable, guided AI interactions&lt;/li&gt;
&lt;li&gt;Best practices for both features&lt;/li&gt;
&lt;li&gt;Real-world use cases that combine everything we’ve covered so far&lt;/li&gt;
&lt;/ul&gt;
&lt;h2&gt;What Is Sampling in MCP?&lt;/h2&gt;
&lt;p&gt;&lt;strong&gt;Sampling&lt;/strong&gt; is the ability for an MCP server to ask the host to run an LLM completion—on behalf of a tool, prompt, or workflow.&lt;/p&gt;
&lt;p&gt;It lets your server say:&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;“Hey, LLM, here’s a prompt and some context. Please respond.”&lt;/p&gt;
&lt;/blockquote&gt;
&lt;h3&gt;Why is this useful?&lt;/h3&gt;
&lt;ul&gt;
&lt;li&gt;You can &lt;strong&gt;generate intermediate reasoning steps&lt;/strong&gt;&lt;/li&gt;
&lt;li&gt;Let the model &lt;strong&gt;propose actions&lt;/strong&gt; before executing them&lt;/li&gt;
&lt;li&gt;Create more natural &lt;strong&gt;multi-turn agent workflows&lt;/strong&gt;&lt;/li&gt;
&lt;li&gt;Maintain human-in-the-loop &lt;strong&gt;approval and visibility&lt;/strong&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;h2&gt;Sampling Flow&lt;/h2&gt;
&lt;p&gt;Here’s the typical lifecycle:&lt;/p&gt;
&lt;ol&gt;
&lt;li&gt;The server sends a &lt;code&gt;sampling/createMessage&lt;/code&gt; request&lt;/li&gt;
&lt;li&gt;The host (Claude Desktop, etc.) can &lt;strong&gt;review or modify&lt;/strong&gt; the prompt&lt;/li&gt;
&lt;li&gt;The host runs the LLM completion&lt;/li&gt;
&lt;li&gt;The result is sent back to the server&lt;/li&gt;
&lt;/ol&gt;
&lt;blockquote&gt;
&lt;p&gt;This architecture puts &lt;strong&gt;control and visibility in the hands of the user&lt;/strong&gt;, even when the agent logic runs server-side.&lt;/p&gt;
&lt;/blockquote&gt;
&lt;h2&gt;✉️ Message Format&lt;/h2&gt;
&lt;p&gt;Here’s an example &lt;code&gt;sampling/createMessage&lt;/code&gt; request:&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-json&quot;&gt;{
  &amp;quot;messages&amp;quot;: [
    {
      &amp;quot;role&amp;quot;: &amp;quot;user&amp;quot;,
      &amp;quot;content&amp;quot;: {
        &amp;quot;type&amp;quot;: &amp;quot;text&amp;quot;,
        &amp;quot;text&amp;quot;: &amp;quot;Please summarize this log file.&amp;quot;
      }
    }
  ],
  &amp;quot;systemPrompt&amp;quot;: &amp;quot;You are a helpful developer assistant.&amp;quot;,
  &amp;quot;includeContext&amp;quot;: &amp;quot;thisServer&amp;quot;,
  &amp;quot;maxTokens&amp;quot;: 300
}
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;The host chooses which model to use, what context to include, and whether to show the prompt to the user for confirmation.&lt;/p&gt;
&lt;p&gt;Response:&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-json&quot;&gt;{
  &amp;quot;model&amp;quot;: &amp;quot;claude-3-sonnet&amp;quot;,
  &amp;quot;role&amp;quot;: &amp;quot;assistant&amp;quot;,
  &amp;quot;content&amp;quot;: {
    &amp;quot;type&amp;quot;: &amp;quot;text&amp;quot;,
    &amp;quot;text&amp;quot;: &amp;quot;The log file contains several timeout errors and warnings related to database connections.&amp;quot;
  }
}
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Now the server can act on that response—log it, return it as tool output, or chain it into another step.&lt;/p&gt;
&lt;h3&gt;Best Practices for Sampling&lt;/h3&gt;
&lt;h4&gt;Best Practice	Why It Matters&lt;/h4&gt;
&lt;ul&gt;
&lt;li&gt;Use clear system prompts	Guides model behavior contextually&lt;/li&gt;
&lt;li&gt;Limit tokens	Prevent runaway completions&lt;/li&gt;
&lt;li&gt;Structure responses	Enables downstream parsing (e.g. JSON, bullets)&lt;/li&gt;
&lt;li&gt;Include only relevant context	Keep prompts focused and cost-effective&lt;/li&gt;
&lt;li&gt;Respect user control	The host mediates the actual LLM call&lt;/li&gt;
&lt;/ul&gt;
&lt;h3&gt;What Are Prompts in MCP?&lt;/h3&gt;
&lt;p&gt;Prompts are reusable, structured templates that servers can expose to clients.&lt;/p&gt;
&lt;p&gt;Think of them like slash commands or predefined workflows:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;
&lt;p&gt;Pre-filled with helpful defaults&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;Accept arguments (e.g. &amp;quot;project name&amp;quot;, &amp;quot;file path&amp;quot;)&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;Optionally include embedded resources&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;Surface in the client UI&lt;/p&gt;
&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;Prompts help users and LLMs collaborate efficiently by standardizing useful tasks.&lt;/p&gt;
&lt;h3&gt;✨ Prompt Structure&lt;/h3&gt;
&lt;p&gt;Prompts have:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;
&lt;p&gt;A name (identifier)&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;A description (for discovery)&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;A list of arguments (optional)&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;A template for generating messages&lt;/p&gt;
&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;Example:&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-json&quot;&gt;{
  &amp;quot;name&amp;quot;: &amp;quot;explain-code&amp;quot;,
  &amp;quot;description&amp;quot;: &amp;quot;Explain how this code works&amp;quot;,
  &amp;quot;arguments&amp;quot;: [
    {
      &amp;quot;name&amp;quot;: &amp;quot;language&amp;quot;,
      &amp;quot;description&amp;quot;: &amp;quot;Programming language&amp;quot;,
      &amp;quot;required&amp;quot;: true
    },
    {
      &amp;quot;name&amp;quot;: &amp;quot;code&amp;quot;,
      &amp;quot;description&amp;quot;: &amp;quot;The code to analyze&amp;quot;,
      &amp;quot;required&amp;quot;: true
    }
  ]
}
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Clients use:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;
&lt;p&gt;&lt;code&gt;prompts/list&lt;/code&gt; to discover prompts&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;code&gt;prompts/get&lt;/code&gt; to resolve a prompt and arguments into messages&lt;/p&gt;
&lt;/li&gt;
&lt;/ul&gt;
&lt;h3&gt;Dynamic Prompt Example&lt;/h3&gt;
&lt;p&gt;A server might expose:&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-json&quot;&gt;{
  &amp;quot;name&amp;quot;: &amp;quot;analyze-logs&amp;quot;,
  &amp;quot;description&amp;quot;: &amp;quot;Summarize recent logs and detect anomalies&amp;quot;,
  &amp;quot;arguments&amp;quot;: [
    {
      &amp;quot;name&amp;quot;: &amp;quot;timeframe&amp;quot;,
      &amp;quot;required&amp;quot;: true
    }
  ]
}
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;When the user (or LLM) runs it with:&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-json&quot;&gt;{
  &amp;quot;timeframe&amp;quot;: &amp;quot;1h&amp;quot;
}
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;The resolved prompt could include:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;
&lt;p&gt;A message like: &lt;code&gt;“Please summarize the following logs from the past hour.”&lt;/code&gt;&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;An embedded resource (e.g. &lt;code&gt;logs://recent?timeframe=1h&lt;/code&gt;)&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;Output ready for sampling&lt;/p&gt;
&lt;/li&gt;
&lt;/ul&gt;
&lt;h3&gt;Sampling + Prompts = Dynamic Workflows&lt;/h3&gt;
&lt;p&gt;When you combine prompts + sampling + tools, you unlock real agent behavior.&lt;/p&gt;
&lt;p&gt;Example Workflow:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;
&lt;p&gt;User selects prompt: &amp;quot;Analyze logs and suggest next steps&amp;quot;&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;Server resolves the prompt and calls sampling/createMessage&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;LLM returns: “The logs show repeated auth failures. Suggest checking OAuth config.”&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;Server calls tools/call to run check_auth_config&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;LLM reviews the result and writes a summary&lt;/p&gt;
&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;All controlled via:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;
&lt;p&gt;Standardized MCP messages&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;User-visible approvals&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;Modular server logic&lt;/p&gt;
&lt;/li&gt;
&lt;/ul&gt;
&lt;h3&gt;🔐 Security and Control&lt;/h3&gt;
&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Feature&lt;/th&gt;
&lt;th&gt;How It&apos;s Handled&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Prompt visibility&lt;/td&gt;
&lt;td&gt;Clients decide which prompts to expose&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Sampling review&lt;/td&gt;
&lt;td&gt;Hosts can show/reject sampling requests&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Input validation&lt;/td&gt;
&lt;td&gt;Servers validate prompt arguments&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Model usage control&lt;/td&gt;
&lt;td&gt;Hosts select models and limit token costs&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Prompt injection risks&lt;/td&gt;
&lt;td&gt;Validate user inputs, escape content if needed&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;
&lt;hr&gt;
&lt;h3&gt;🧠 Why These Matter for AI Agents&lt;/h3&gt;
&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Capability&lt;/th&gt;
&lt;th&gt;Sampling Provides&lt;/th&gt;
&lt;th&gt;Prompts Provide&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Decision-making&lt;/td&gt;
&lt;td&gt;Dynamic LLM completions&lt;/td&gt;
&lt;td&gt;Guided, structured input&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Flexibility&lt;/td&gt;
&lt;td&gt;Server can request help anytime&lt;/td&gt;
&lt;td&gt;Users can run reusable workflows&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Interactivity&lt;/td&gt;
&lt;td&gt;Chain actions with feedback&lt;/td&gt;
&lt;td&gt;Improve LLM collaboration&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Composability&lt;/td&gt;
&lt;td&gt;Mix prompts + tools + resources&lt;/td&gt;
&lt;td&gt;Enable custom interfaces&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;
&lt;hr&gt;
&lt;h3&gt;🧩 Wrapping It All Together&lt;/h3&gt;
&lt;p&gt;Over this 10-part series, we’ve explored the full landscape of AI agent development using &lt;strong&gt;MCP&lt;/strong&gt;:&lt;/p&gt;
&lt;p&gt;✅ LLMs and how they work&lt;br&gt;
✅ Fine-tuning, prompting, and RAG&lt;br&gt;
✅ Agent frameworks and limitations&lt;br&gt;
✅ MCP’s architecture and interoperability&lt;br&gt;
✅ Resources and tools&lt;br&gt;
✅ Prompts and sampling&lt;/p&gt;
&lt;p&gt;MCP gives us standardized, modular building blocks for creating AI agents that are:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Portable across environments&lt;/strong&gt;&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Decoupled from model providers&lt;/strong&gt;&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Secure, observable, and controlled&lt;/strong&gt;&lt;/li&gt;
&lt;/ul&gt;
</content:encoded><author>Alex Merced</author></item><item><title>A Journey from AI to LLMs and MCP - 9 - Tools in MCP — Giving LLMs the Power to Act</title><link>https://iceberglakehouse.com/posts/2025-04-tools-in-mcp/</link><guid isPermaLink="true">https://iceberglakehouse.com/posts/2025-04-tools-in-mcp/</guid><description>
## Free Resources  
- **[Free Apache Iceberg Course](https://hello.dremio.com/webcast-an-apache-iceberg-lakehouse-crash-course-reg.html?utm_source=ev...</description><pubDate>Sun, 13 Apr 2025 09:00:00 GMT</pubDate><content:encoded>&lt;h2&gt;Free Resources&lt;/h2&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;&lt;a href=&quot;https://hello.dremio.com/webcast-an-apache-iceberg-lakehouse-crash-course-reg.html?utm_source=ev_external_blog&amp;amp;utm_medium=influencer&amp;amp;utm_campaign=AItoLLMS&amp;amp;utm_content=alexmerced&amp;amp;utm_term=external_blog&quot;&gt;Free Apache Iceberg Course&lt;/a&gt;&lt;/strong&gt;&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;&lt;a href=&quot;https://hello.dremio.com/wp-apache-iceberg-the-definitive-guide-reg.html?utm_source=ev_external_blog&amp;amp;utm_medium=influencer&amp;amp;utm_campaign=AItoLLMS&amp;amp;utm_content=alexmerced&amp;amp;utm_term=external_blog&quot;&gt;Free Copy of “Apache Iceberg: The Definitive Guide”&lt;/a&gt;&lt;/strong&gt;&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;&lt;a href=&quot;https://medium.com/data-engineering-with-dremio/2025-guide-to-architecting-an-iceberg-lakehouse-9b19ed42c9de&quot;&gt;2025 Apache Iceberg Architecture Guide&lt;/a&gt;&lt;/strong&gt;&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;&lt;a href=&quot;https://medium.alexmerced.blog/guide-to-finding-apache-iceberg-events-near-you-and-being-part-of-the-greater-iceberg-community-0c38ae785ddb&quot;&gt;How to Join the Iceberg Community&lt;/a&gt;&lt;/strong&gt;&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;&lt;a href=&quot;https://youtube.com/playlist?list=PLsLAVBjQJO0p0Yq1fLkoHvt2lEJj5pcYe&amp;amp;si=WTSnqjXZv6Glkc3y&quot;&gt;Iceberg Lakehouse Engineering Video Playlist&lt;/a&gt;&lt;/strong&gt;&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;&lt;a href=&quot;https://medium.com/data-engineering-with-dremio/ultimate-directory-of-apache-iceberg-resources-e3e02efac62e&quot;&gt;Ultimate Apache Iceberg Resource Guide&lt;/a&gt;&lt;/strong&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;In the previous post, we looked at &lt;strong&gt;Resources&lt;/strong&gt; in the Model Context Protocol (MCP): how LLMs can securely access real-world data to ground their understanding. But sometimes, &lt;em&gt;reading&lt;/em&gt; isn’t enough.&lt;/p&gt;
&lt;p&gt;Sometimes, you want the model to &lt;strong&gt;do something&lt;/strong&gt;.&lt;/p&gt;
&lt;p&gt;That’s where &lt;strong&gt;Tools&lt;/strong&gt; in MCP come in.&lt;/p&gt;
&lt;p&gt;In this post, we’ll explore:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;What tools are in MCP&lt;/li&gt;
&lt;li&gt;How tools are discovered and invoked&lt;/li&gt;
&lt;li&gt;How LLMs can use tools (with user control)&lt;/li&gt;
&lt;li&gt;Common tool patterns and security practices&lt;/li&gt;
&lt;li&gt;Real-world examples: from file system commands to API wrappers&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;Let’s dive in.&lt;/p&gt;
&lt;h2&gt;What Are Tools in MCP?&lt;/h2&gt;
&lt;p&gt;&lt;strong&gt;Tools&lt;/strong&gt; are executable functions that an LLM (or the user) can call via the MCP client. Unlike resources—which are passive data—&lt;strong&gt;tools are active operations&lt;/strong&gt;.&lt;/p&gt;
&lt;p&gt;Examples include:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Running a shell command&lt;/li&gt;
&lt;li&gt;Calling a REST API&lt;/li&gt;
&lt;li&gt;Summarizing a document&lt;/li&gt;
&lt;li&gt;Posting a GitHub issue&lt;/li&gt;
&lt;li&gt;Triggering a build process&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;Each tool includes:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;A &lt;strong&gt;name&lt;/strong&gt; (unique identifier)&lt;/li&gt;
&lt;li&gt;A &lt;strong&gt;description&lt;/strong&gt; (for UI/model understanding)&lt;/li&gt;
&lt;li&gt;An &lt;strong&gt;input schema&lt;/strong&gt; (JSON schema describing expected parameters)&lt;/li&gt;
&lt;/ul&gt;
&lt;blockquote&gt;
&lt;p&gt;Tools allow models to interact with the world beyond natural language—under user oversight.&lt;/p&gt;
&lt;/blockquote&gt;
&lt;h2&gt;Discovering Tools&lt;/h2&gt;
&lt;p&gt;Clients can list available tools via:
&lt;code&gt;tools/list&lt;/code&gt;&lt;/p&gt;
&lt;p&gt;Example response:&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-json&quot;&gt;{
  &amp;quot;tools&amp;quot;: [
    {
      &amp;quot;name&amp;quot;: &amp;quot;calculate_sum&amp;quot;,
      &amp;quot;description&amp;quot;: &amp;quot;Add two numbers together&amp;quot;,
      &amp;quot;inputSchema&amp;quot;: {
        &amp;quot;type&amp;quot;: &amp;quot;object&amp;quot;,
        &amp;quot;properties&amp;quot;: {
          &amp;quot;a&amp;quot;: { &amp;quot;type&amp;quot;: &amp;quot;number&amp;quot; },
          &amp;quot;b&amp;quot;: { &amp;quot;type&amp;quot;: &amp;quot;number&amp;quot; }
        },
        &amp;quot;required&amp;quot;: [&amp;quot;a&amp;quot;, &amp;quot;b&amp;quot;]
      }
    }
  ]
}
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;This allows clients (and LLMs) to decide which tools are available and how to call them properly.&lt;/p&gt;
&lt;h2&gt;⚙️ Calling a Tool&lt;/h2&gt;
&lt;p&gt;To execute a tool, the client sends:&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-bash&quot;&gt;tools/call
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;With this payload:&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-json&quot;&gt;{
  &amp;quot;name&amp;quot;: &amp;quot;calculate_sum&amp;quot;,
  &amp;quot;arguments&amp;quot;: {
    &amp;quot;a&amp;quot;: 3,
    &amp;quot;b&amp;quot;: 5
  }
}
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;The server responds with:&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-json&quot;&gt;{
  &amp;quot;content&amp;quot;: [
    {
      &amp;quot;type&amp;quot;: &amp;quot;text&amp;quot;,
      &amp;quot;text&amp;quot;: &amp;quot;8&amp;quot;
    }
  ]
}
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;That’s it! The LLM can now use this output in a multi-step reasoning chain.&lt;/p&gt;
&lt;h3&gt;Model-Controlled Tool Use&lt;/h3&gt;
&lt;p&gt;Tools are designed to be invoked by models automatically. The host mediates this interaction with:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;
&lt;p&gt;Approval flows (user-in-the-loop)&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;Permission gating&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;Logging and auditing&lt;/p&gt;
&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;This is what enables “agentic behavior.” For example:&lt;/p&gt;
&lt;p&gt;Claude sees a CSV file and decides to call analyze_csv to compute averages—without a user explicitly requesting it.&lt;/p&gt;
&lt;h3&gt;Tool Design Patterns&lt;/h3&gt;
&lt;p&gt;Let’s look at some common and powerful tool types:&lt;/p&gt;
&lt;h4&gt;System Tools&lt;/h4&gt;
&lt;pre&gt;&lt;code class=&quot;language-json&quot;&gt;{
  &amp;quot;name&amp;quot;: &amp;quot;run_command&amp;quot;,
  &amp;quot;description&amp;quot;: &amp;quot;Execute a shell command&amp;quot;,
  &amp;quot;inputSchema&amp;quot;: {
    &amp;quot;type&amp;quot;: &amp;quot;object&amp;quot;,
    &amp;quot;properties&amp;quot;: {
      &amp;quot;command&amp;quot;: { &amp;quot;type&amp;quot;: &amp;quot;string&amp;quot; },
      &amp;quot;args&amp;quot;: {
        &amp;quot;type&amp;quot;: &amp;quot;array&amp;quot;,
        &amp;quot;items&amp;quot;: { &amp;quot;type&amp;quot;: &amp;quot;string&amp;quot; }
      }
    }
  }
}
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Use case: Let the LLM grep a log file, or check system uptime.&lt;/p&gt;
&lt;h4&gt;API Integrations&lt;/h4&gt;
&lt;pre&gt;&lt;code class=&quot;language-json&quot;&gt;{
  &amp;quot;name&amp;quot;: &amp;quot;create_github_issue&amp;quot;,
  &amp;quot;description&amp;quot;: &amp;quot;Open a new issue on GitHub&amp;quot;,
  &amp;quot;inputSchema&amp;quot;: {
    &amp;quot;type&amp;quot;: &amp;quot;object&amp;quot;,
    &amp;quot;properties&amp;quot;: {
      &amp;quot;repo&amp;quot;: { &amp;quot;type&amp;quot;: &amp;quot;string&amp;quot; },
      &amp;quot;title&amp;quot;: { &amp;quot;type&amp;quot;: &amp;quot;string&amp;quot; },
      &amp;quot;body&amp;quot;: { &amp;quot;type&amp;quot;: &amp;quot;string&amp;quot; }
    }
  }
}
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Use case: Let an AI dev assistant file bugs or suggest changes.&lt;/p&gt;
&lt;h4&gt;Data Analysis&lt;/h4&gt;
&lt;pre&gt;&lt;code class=&quot;language-json&quot;&gt;{
  &amp;quot;name&amp;quot;: &amp;quot;summarize_csv&amp;quot;,
  &amp;quot;description&amp;quot;: &amp;quot;Summarize a CSV file&amp;quot;,
  &amp;quot;inputSchema&amp;quot;: {
    &amp;quot;type&amp;quot;: &amp;quot;object&amp;quot;,
    &amp;quot;properties&amp;quot;: {
      &amp;quot;filepath&amp;quot;: { &amp;quot;type&amp;quot;: &amp;quot;string&amp;quot; }
    }
  }
}
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Use case: Let the LLM analyze performance metrics or user data.&lt;/p&gt;
&lt;h4&gt;Security Best Practices&lt;/h4&gt;
&lt;p&gt;Giving LLMs the ability to take action means security is critical. Here’s how to stay safe:&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Validate all input&lt;/strong&gt;
Use detailed JSON schemas&lt;/p&gt;
&lt;p&gt;Sanitize input (e.g., file paths, commands)&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Use access controls&lt;/strong&gt;
Gate sensitive tools behind roles&lt;/p&gt;
&lt;p&gt;Allow user opt-in or approval&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Log and monitor usage&lt;/strong&gt;
Track which tools are used, with what arguments&lt;/p&gt;
&lt;p&gt;Log errors and output for audit trails&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Handle errors gracefully&lt;/strong&gt;
Return structured errors inside the result, not just raw exceptions. This helps the LLM adapt.&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-json&quot;&gt;{
  &amp;quot;isError&amp;quot;: true,
  &amp;quot;content&amp;quot;: [
    {
      &amp;quot;type&amp;quot;: &amp;quot;text&amp;quot;,
      &amp;quot;text&amp;quot;: &amp;quot;Error: File not found.&amp;quot;
    }
  ]
}
&lt;/code&gt;&lt;/pre&gt;
&lt;h4&gt;Example: Implementing a Tool Server in Python&lt;/h4&gt;
&lt;pre&gt;&lt;code class=&quot;language-python&quot;&gt;@mcp.tool()
async def get_weather(city: str) -&amp;gt; str:
    &amp;quot;&amp;quot;&amp;quot;Return current weather for a city.&amp;quot;&amp;quot;&amp;quot;
    data = await fetch_weather(city)
    return f&amp;quot;The temperature in {city} is {data[&apos;temp&apos;]}°C.&amp;quot;
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;This tool will automatically appear in the tools/list response and can be invoked by the LLM or user.&lt;/p&gt;
&lt;h3&gt;Why Tools Matter for Agents&lt;/h3&gt;
&lt;p&gt;Agents aren’t just chatbots—they&apos;re interactive systems. Tools give them the ability to:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;
&lt;p&gt;Take real-world actions&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;Build dynamic workflows&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;Chain reasoning across multiple steps&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;Drive automation in safe, auditable ways&lt;/p&gt;
&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;Combined with resources, prompts, and sampling, tools make LLMs feel like collaborative assistants, not just text predictors.&lt;/p&gt;
&lt;h3&gt;Recap: Tools in MCP&lt;/h3&gt;
&lt;ul&gt;
&lt;li&gt;Concept	Description&lt;/li&gt;
&lt;li&gt;Tool definition	Name, description, and input schema&lt;/li&gt;
&lt;li&gt;Invocation	tools/call with arguments&lt;/li&gt;
&lt;li&gt;Output	Text or structured response&lt;/li&gt;
&lt;li&gt;Use case examples	Shell commands, API calls, code generation, analysis&lt;/li&gt;
&lt;li&gt;Security guidelines	Validate input, log usage, gate sensitive actions&lt;/li&gt;
&lt;/ul&gt;
&lt;h3&gt;Coming Up Next: Sampling and Prompts — Letting the Server Ask the Model for Help&lt;/h3&gt;
&lt;p&gt;In the final two posts of this series, we’ll explore:&lt;/p&gt;
&lt;p&gt;✅ Sampling — How servers can request completions from the LLM during workflows
✅ Prompts — Reusable templates for user-driven or model-driven actions&lt;/p&gt;
&lt;p&gt;Tools give LLMs the power to act. With proper controls and schemas, they become safe, composable building blocks for real-world automation.&lt;/p&gt;
</content:encoded><author>Alex Merced</author></item><item><title>A Journey from AI to LLMs and MCP - 8 - Resources in MCP — Serving Relevant Data Securely to LLMs</title><link>https://iceberglakehouse.com/posts/2025-04-resources-in-mcp/</link><guid isPermaLink="true">https://iceberglakehouse.com/posts/2025-04-resources-in-mcp/</guid><description>
## Free Resources  
- **[Free Apache Iceberg Course](https://hello.dremio.com/webcast-an-apache-iceberg-lakehouse-crash-course-reg.html?utm_source=ev...</description><pubDate>Sat, 12 Apr 2025 09:00:00 GMT</pubDate><content:encoded>&lt;h2&gt;Free Resources&lt;/h2&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;&lt;a href=&quot;https://hello.dremio.com/webcast-an-apache-iceberg-lakehouse-crash-course-reg.html?utm_source=ev_external_blog&amp;amp;utm_medium=influencer&amp;amp;utm_campaign=AItoLLMS&amp;amp;utm_content=alexmerced&amp;amp;utm_term=external_blog&quot;&gt;Free Apache Iceberg Course&lt;/a&gt;&lt;/strong&gt;&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;&lt;a href=&quot;https://hello.dremio.com/wp-apache-iceberg-the-definitive-guide-reg.html?utm_source=ev_external_blog&amp;amp;utm_medium=influencer&amp;amp;utm_campaign=AItoLLMS&amp;amp;utm_content=alexmerced&amp;amp;utm_term=external_blog&quot;&gt;Free Copy of “Apache Iceberg: The Definitive Guide”&lt;/a&gt;&lt;/strong&gt;&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;&lt;a href=&quot;https://medium.com/data-engineering-with-dremio/2025-guide-to-architecting-an-iceberg-lakehouse-9b19ed42c9de&quot;&gt;2025 Apache Iceberg Architecture Guide&lt;/a&gt;&lt;/strong&gt;&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;&lt;a href=&quot;https://medium.alexmerced.blog/guide-to-finding-apache-iceberg-events-near-you-and-being-part-of-the-greater-iceberg-community-0c38ae785ddb&quot;&gt;How to Join the Iceberg Community&lt;/a&gt;&lt;/strong&gt;&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;&lt;a href=&quot;https://youtube.com/playlist?list=PLsLAVBjQJO0p0Yq1fLkoHvt2lEJj5pcYe&amp;amp;si=WTSnqjXZv6Glkc3y&quot;&gt;Iceberg Lakehouse Engineering Video Playlist&lt;/a&gt;&lt;/strong&gt;&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;&lt;a href=&quot;https://medium.com/data-engineering-with-dremio/ultimate-directory-of-apache-iceberg-resources-e3e02efac62e&quot;&gt;Ultimate Apache Iceberg Resource Guide&lt;/a&gt;&lt;/strong&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;In the previous post, we explored the architecture of the &lt;strong&gt;Model Context Protocol (MCP)&lt;/strong&gt;—a flexible, standardized way to connect LLMs to tools, data, and workflows. One of MCP’s most powerful capabilities is its ability to expose &lt;strong&gt;resources&lt;/strong&gt; to language models in a structured, secure, and controllable way.&lt;/p&gt;
&lt;p&gt;In this post, we’ll dive into:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;What MCP resources are&lt;/li&gt;
&lt;li&gt;How they’re discovered and accessed&lt;/li&gt;
&lt;li&gt;Text vs binary resources&lt;/li&gt;
&lt;li&gt;Dynamic templates and subscriptions&lt;/li&gt;
&lt;li&gt;Best practices for implementation and security&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;If you want to give LLMs real, relevant context from your systems—without compromising safety or control—&lt;strong&gt;resources&lt;/strong&gt; are the foundation.&lt;/p&gt;
&lt;h2&gt;What Are Resources in MCP?&lt;/h2&gt;
&lt;p&gt;&lt;strong&gt;Resources&lt;/strong&gt; represent data that a model or client can read.&lt;/p&gt;
&lt;p&gt;This might include:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Local files (e.g. &lt;code&gt;file:///logs/server.log&lt;/code&gt;)&lt;/li&gt;
&lt;li&gt;Database records (e.g. &lt;code&gt;postgres://db/customers&lt;/code&gt;)&lt;/li&gt;
&lt;li&gt;Web content (e.g. &lt;code&gt;https://api.example.com/data&lt;/code&gt;)&lt;/li&gt;
&lt;li&gt;Images or screenshots (e.g. &lt;code&gt;screen://localhost/monitor1&lt;/code&gt;)&lt;/li&gt;
&lt;li&gt;Structured system data (e.g. logs, metrics, config files)&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;Each resource is identified by a &lt;strong&gt;URI&lt;/strong&gt;, and can be &lt;strong&gt;read&lt;/strong&gt;, &lt;strong&gt;discovered&lt;/strong&gt;, and optionally &lt;strong&gt;subscribed to&lt;/strong&gt; for updates.&lt;/p&gt;
&lt;h2&gt;Resource Discovery&lt;/h2&gt;
&lt;p&gt;Clients can ask a server to list available resources using:
&lt;code&gt;resources/list&lt;/code&gt;&lt;/p&gt;
&lt;p&gt;The server responds with an array of structured metadata:&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-json&quot;&gt;{
  &amp;quot;resources&amp;quot;: [
    {
      &amp;quot;uri&amp;quot;: &amp;quot;file:///logs/app.log&amp;quot;,
      &amp;quot;name&amp;quot;: &amp;quot;Application Logs&amp;quot;,
      &amp;quot;description&amp;quot;: &amp;quot;Recent server logs&amp;quot;,
      &amp;quot;mimeType&amp;quot;: &amp;quot;text/plain&amp;quot;
    }
  ]
}
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Clients (or users) can browse these like a menu, selecting what context to send to the model.&lt;/p&gt;
&lt;h3&gt;Resource Templates&lt;/h3&gt;
&lt;p&gt;In addition to static lists, servers can expose URI templates using RFC 6570 syntax:&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-json&quot;&gt;{
  &amp;quot;uriTemplate&amp;quot;: &amp;quot;file:///logs/{date}.log&amp;quot;,
  &amp;quot;name&amp;quot;: &amp;quot;Log by Date&amp;quot;,
  &amp;quot;description&amp;quot;: &amp;quot;Access logs by date (e.g., 2024-04-01)&amp;quot;,
  &amp;quot;mimeType&amp;quot;: &amp;quot;text/plain&amp;quot;
}
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;This allows dynamic access to parameterized content—great for APIs, time-based logs, or file hierarchies.&lt;/p&gt;
&lt;h3&gt;Reading a Resource&lt;/h3&gt;
&lt;p&gt;To retrieve the content of a resource, clients use:&lt;/p&gt;
&lt;p&gt;&lt;code&gt;resources/read&lt;/code&gt; With a payload like:&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-json&quot;&gt;{
  &amp;quot;uri&amp;quot;: &amp;quot;file:///logs/app.log&amp;quot;
}
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;The server responds with the content in one of two formats:&lt;/p&gt;
&lt;h4&gt;Text Resource&lt;/h4&gt;
&lt;pre&gt;&lt;code class=&quot;language-json&quot;&gt;{
  &amp;quot;contents&amp;quot;: [
    {
      &amp;quot;uri&amp;quot;: &amp;quot;file:///logs/app.log&amp;quot;,
      &amp;quot;mimeType&amp;quot;: &amp;quot;text/plain&amp;quot;,
      &amp;quot;text&amp;quot;: &amp;quot;Error: Timeout on request...\n&amp;quot;
    }
  ]
}
&lt;/code&gt;&lt;/pre&gt;
&lt;h4&gt;Binary Resource (e.g. image, PDF)&lt;/h4&gt;
&lt;pre&gt;&lt;code class=&quot;language-json&quot;&gt;{
  &amp;quot;contents&amp;quot;: [
    {
      &amp;quot;uri&amp;quot;: &amp;quot;screen://localhost/display1&amp;quot;,
      &amp;quot;mimeType&amp;quot;: &amp;quot;image/png&amp;quot;,
      &amp;quot;blob&amp;quot;: &amp;quot;iVBORw0KGgoAAAANSUhEUgAAA...&amp;quot;
    }
  ]
}
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Clients can choose how and when to inject these into the model’s prompt, depending on MIME type and length.&lt;/p&gt;
&lt;h3&gt;Real-Time Updates&lt;/h3&gt;
&lt;p&gt;Resources aren’t static—they can change. MCP supports subscriptions to keep context fresh.&lt;/p&gt;
&lt;h4&gt;List Updates&lt;/h4&gt;
&lt;p&gt;If the list of resources changes, the server can notify the client with:&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-bash&quot;&gt;notifications/resources/list_changed
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Useful when new logs, files, or endpoints become available.&lt;/p&gt;
&lt;h4&gt;Content Updates&lt;/h4&gt;
&lt;p&gt;Clients can subscribe to specific resource URIs:&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-bash&quot;&gt;resources/subscribe
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;When the resource changes, the server sends:&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-bash&quot;&gt;notifications/resources/updated
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;This is ideal for live logs, dashboards, or real-time documents.&lt;/p&gt;
&lt;h3&gt;Security Best Practices&lt;/h3&gt;
&lt;p&gt;Exposing resources to models requires careful control. MCP includes flexible patterns for securing access:&lt;/p&gt;
&lt;h4&gt;Best Practices for Server Developers&lt;/h4&gt;
&lt;ul&gt;
&lt;li&gt;
&lt;p&gt;Validate all URIs: No open file reads!&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;Whitelist paths or endpoints for file access&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;Use descriptive names and MIME types to help clients filter content&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;Provide helpful descriptions for the LLM and user&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;Support URI templates for scalable access&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;Audit access and subscriptions&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;Avoid leaking secrets in content or metadata&lt;/p&gt;
&lt;/li&gt;
&lt;/ul&gt;
&lt;h4&gt;Example: Safe Log Server&lt;/h4&gt;
&lt;pre&gt;&lt;code class=&quot;language-ts&quot;&gt;server.setRequestHandler(ListResourcesRequestSchema, async () =&amp;gt; {
  return {
    resources: [
      {
        uri: &amp;quot;file:///logs/app.log&amp;quot;,
        name: &amp;quot;App Logs&amp;quot;,
        mimeType: &amp;quot;text/plain&amp;quot;
      }
    ]
  };
});

server.setRequestHandler(ReadResourceRequestSchema, async (request) =&amp;gt; {
  const uri = request.params.uri;

  if (!uri.startsWith(&amp;quot;file:///logs/&amp;quot;)) {
    throw new Error(&amp;quot;Access denied&amp;quot;);
  }

  const content = await readFile(uri); // Add sanitization here
  return {
    contents: [{
      uri,
      mimeType: &amp;quot;text/plain&amp;quot;,
      text: content
    }]
  };
});
&lt;/code&gt;&lt;/pre&gt;
&lt;h3&gt;Why Resources Matter for AI Agents&lt;/h3&gt;
&lt;p&gt;LLMs are context-hungry. They reason better when they have:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;
&lt;p&gt;Real-time logs&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;Source code&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;System metrics&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;API responses&lt;/p&gt;
&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;By serving these as resources, MCP gives agents the data they need—on demand, with full user control, and without bloating prompt templates.&lt;/p&gt;
&lt;h3&gt;Recap: Resources at a Glance&lt;/h3&gt;
&lt;ul&gt;
&lt;li&gt;Feature	Description&lt;/li&gt;
&lt;li&gt;URI-based identifiers	Unique path to each piece of content&lt;/li&gt;
&lt;li&gt;Text &amp;amp; binary support	Suitable for logs, images, PDFs, etc.&lt;/li&gt;
&lt;li&gt;Dynamic templates	Construct URIs on the fly&lt;/li&gt;
&lt;li&gt;Real-time updates	Subscriptions for changing content&lt;/li&gt;
&lt;li&gt;Secure access patterns	URI validation, MIME filtering, whitelisting&lt;/li&gt;
&lt;/ul&gt;
&lt;h3&gt;Coming Up Next: Tools in MCP — Giving LLMs the Power to Act&lt;/h3&gt;
&lt;p&gt;So far, we’ve shown how MCP feeds models with data. But what if we want the model to take action?&lt;/p&gt;
&lt;p&gt;In the next post, we’ll explore tools in MCP:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;
&lt;p&gt;How LLMs call functions safely&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;Tool schemas and invocation patterns&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;Real-world examples: shell commands, API calls, and more&lt;/p&gt;
&lt;/li&gt;
&lt;/ul&gt;
</content:encoded><author>Alex Merced</author></item><item><title>A Journey from AI to LLMs and MCP - 7 - Under the Hood — The Architecture of MCP and Its Core Components</title><link>https://iceberglakehouse.com/posts/2025-04-under-the-hood-of-mcp/</link><guid isPermaLink="true">https://iceberglakehouse.com/posts/2025-04-under-the-hood-of-mcp/</guid><description>
# A Journey from AI to LLMs and MCP - 7 - Under the Hood — The Architecture of MCP and Its Core Components

## Free Resources  
- **[Free Apache Iceb...</description><pubDate>Fri, 11 Apr 2025 09:00:00 GMT</pubDate><content:encoded>&lt;h1&gt;A Journey from AI to LLMs and MCP - 7 - Under the Hood — The Architecture of MCP and Its Core Components&lt;/h1&gt;
&lt;h2&gt;Free Resources&lt;/h2&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;&lt;a href=&quot;https://hello.dremio.com/webcast-an-apache-iceberg-lakehouse-crash-course-reg.html?utm_source=ev_external_blog&amp;amp;utm_medium=influencer&amp;amp;utm_campaign=AItoLLMS&amp;amp;utm_content=alexmerced&amp;amp;utm_term=external_blog&quot;&gt;Free Apache Iceberg Course&lt;/a&gt;&lt;/strong&gt;&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;&lt;a href=&quot;https://hello.dremio.com/wp-apache-iceberg-the-definitive-guide-reg.html?utm_source=ev_external_blog&amp;amp;utm_medium=influencer&amp;amp;utm_campaign=AItoLLMS&amp;amp;utm_content=alexmerced&amp;amp;utm_term=external_blog&quot;&gt;Free Copy of “Apache Iceberg: The Definitive Guide”&lt;/a&gt;&lt;/strong&gt;&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;&lt;a href=&quot;https://medium.com/data-engineering-with-dremio/2025-guide-to-architecting-an-iceberg-lakehouse-9b19ed42c9de&quot;&gt;2025 Apache Iceberg Architecture Guide&lt;/a&gt;&lt;/strong&gt;&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;&lt;a href=&quot;https://medium.alexmerced.blog/guide-to-finding-apache-iceberg-events-near-you-and-being-part-of-the-greater-iceberg-community-0c38ae785ddb&quot;&gt;How to Join the Iceberg Community&lt;/a&gt;&lt;/strong&gt;&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;&lt;a href=&quot;https://youtube.com/playlist?list=PLsLAVBjQJO0p0Yq1fLkoHvt2lEJj5pcYe&amp;amp;si=WTSnqjXZv6Glkc3y&quot;&gt;Iceberg Lakehouse Engineering Video Playlist&lt;/a&gt;&lt;/strong&gt;&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;&lt;a href=&quot;https://medium.com/data-engineering-with-dremio/ultimate-directory-of-apache-iceberg-resources-e3e02efac62e&quot;&gt;Ultimate Apache Iceberg Resource Guide&lt;/a&gt;&lt;/strong&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;In our last post, we introduced the &lt;strong&gt;Model Context Protocol (MCP)&lt;/strong&gt; as a standard way to connect AI models and agents to tools, data, and workflows—much like how the Apache Iceberg REST protocol brings interoperability to data engines.&lt;/p&gt;
&lt;p&gt;Now it’s time to open the black box.&lt;/p&gt;
&lt;p&gt;In this post, we’ll break down:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;The architecture of MCP&lt;/li&gt;
&lt;li&gt;The responsibilities of hosts, clients, and servers&lt;/li&gt;
&lt;li&gt;The message lifecycle and transport layers&lt;/li&gt;
&lt;li&gt;How tools, resources, and prompts plug into the system&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;By the end, you’ll understand &lt;strong&gt;how MCP enables secure, modular communication between LLMs and the systems they need to work with.&lt;/strong&gt;&lt;/p&gt;
&lt;h2&gt;Big Picture: How MCP Fits Together&lt;/h2&gt;
&lt;p&gt;MCP follows a &lt;strong&gt;client-server architecture&lt;/strong&gt; that enables many-to-many connections between models and systems.&lt;/p&gt;
&lt;p&gt;Here’s the high-level setup:&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;+------------------------+      +--------------------+
|    Claude Desktop      |      |      Web IDE       |
| (Host + MCP Client)    |      | (Host + MCP Client)|
+------------------------+      +--------------------+
             |                         |
             |     MCP Protocol        |
             |                         |
             v                         v
+------------------------+    +---------------------------+
|   Local Tool Server    |    |     Cloud API Server      |
| (Exposes tools/resources)|  | (Exposes prompts/tools)   |
+------------------------+    +---------------------------+

&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Each &lt;strong&gt;host&lt;/strong&gt; runs one or more &lt;strong&gt;clients&lt;/strong&gt;, which connect to independent &lt;strong&gt;MCP servers&lt;/strong&gt; exposing functionality in a standardized format.&lt;/p&gt;
&lt;h2&gt;Key Concepts&lt;/h2&gt;
&lt;p&gt;Let’s look at the core components that make this work.&lt;/p&gt;
&lt;h3&gt;1. Hosts&lt;/h3&gt;
&lt;p&gt;Hosts are the applications that run the LLM (e.g. Claude Desktop, VS Code extension, custom browser app). They manage:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;The model interaction (LLM prompts and completions)&lt;/li&gt;
&lt;li&gt;UI and user input&lt;/li&gt;
&lt;li&gt;A registry of connected clients&lt;/li&gt;
&lt;/ul&gt;
&lt;blockquote&gt;
&lt;p&gt;A host might display tools in a sidebar, allow users to pick files (resources), or visualize prompts in a command palette.&lt;/p&gt;
&lt;/blockquote&gt;
&lt;h3&gt;2. Clients&lt;/h3&gt;
&lt;p&gt;An &lt;strong&gt;MCP client&lt;/strong&gt; lives inside a host and connects to a single MCP server. It handles:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Transport layer setup (e.g. stdio or HTTP/SSE)&lt;/li&gt;
&lt;li&gt;Message exchange (requests, notifications, etc.)&lt;/li&gt;
&lt;li&gt;Proxying server capabilities to the host/model&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;Each client maintains a &lt;strong&gt;1:1 connection with one server&lt;/strong&gt;.&lt;/p&gt;
&lt;h3&gt;3. Servers&lt;/h3&gt;
&lt;p&gt;Servers expose real-world capabilities using the MCP spec. They can:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Serve &lt;strong&gt;resources&lt;/strong&gt; (files, logs, database records)&lt;/li&gt;
&lt;li&gt;Define and execute &lt;strong&gt;tools&lt;/strong&gt;&lt;/li&gt;
&lt;li&gt;Offer reusable &lt;strong&gt;prompts&lt;/strong&gt;&lt;/li&gt;
&lt;li&gt;Request &lt;strong&gt;sampling&lt;/strong&gt; (LLM completions)&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;Servers can run locally (e.g. on your machine) or remotely (e.g. in a cloud API gateway), and can be implemented in any language (Python, TypeScript, C#, etc.).&lt;/p&gt;
&lt;h2&gt;Message Lifecycle in MCP&lt;/h2&gt;
&lt;p&gt;MCP uses a &lt;strong&gt;JSON-RPC 2.0 message format&lt;/strong&gt; to communicate between clients and servers. All communication flows through a structured lifecycle:&lt;/p&gt;
&lt;h3&gt;1. Initialization&lt;/h3&gt;
&lt;p&gt;Before communication starts:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Client sends an &lt;code&gt;initialize&lt;/code&gt; request&lt;/li&gt;
&lt;li&gt;Server responds with capabilities&lt;/li&gt;
&lt;li&gt;Client sends an &lt;code&gt;initialized&lt;/code&gt; notification&lt;/li&gt;
&lt;/ul&gt;
&lt;blockquote&gt;
&lt;p&gt;This sets up feature negotiation and version compatibility.&lt;/p&gt;
&lt;/blockquote&gt;
&lt;h3&gt;2. Message Types&lt;/h3&gt;
&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Type&lt;/th&gt;
&lt;th&gt;Description&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Request&lt;/td&gt;
&lt;td&gt;A message expecting a response (e.g. &lt;code&gt;tools/call&lt;/code&gt;)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Response&lt;/td&gt;
&lt;td&gt;Result from a request (e.g. tool output)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Notification&lt;/td&gt;
&lt;td&gt;One-way message with no response expected&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Error&lt;/td&gt;
&lt;td&gt;Sent when a request fails or is invalid&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;
&lt;p&gt;Each message is wrapped in a &lt;strong&gt;transport layer&lt;/strong&gt; (more on that next).&lt;/p&gt;
&lt;h2&gt;Transport Layer — How Messages Move&lt;/h2&gt;
&lt;p&gt;MCP supports multiple transport mechanisms:&lt;/p&gt;
&lt;h3&gt;Stdio Transport&lt;/h3&gt;
&lt;ul&gt;
&lt;li&gt;Uses standard input/output&lt;/li&gt;
&lt;li&gt;Ideal for local tools and scripts&lt;/li&gt;
&lt;li&gt;Simple, reliable, and works well with command-line tools&lt;/li&gt;
&lt;/ul&gt;
&lt;h3&gt;HTTP + SSE Transport&lt;/h3&gt;
&lt;ul&gt;
&lt;li&gt;Uses HTTP POST for client-to-server messages&lt;/li&gt;
&lt;li&gt;Uses &lt;strong&gt;Server-Sent Events (SSE)&lt;/strong&gt; for real-time server-to-client updates&lt;/li&gt;
&lt;li&gt;Useful for remote or cloud-based servers&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;All transports carry JSON-RPC messages and follow the same protocol semantics.&lt;/p&gt;
&lt;h2&gt;MCP Capabilities&lt;/h2&gt;
&lt;p&gt;MCP defines a small number of &lt;strong&gt;core capabilities&lt;/strong&gt;, each with its own request/response patterns.&lt;/p&gt;
&lt;h3&gt;Resources&lt;/h3&gt;
&lt;p&gt;Servers can expose structured data like:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Files&lt;/li&gt;
&lt;li&gt;Logs&lt;/li&gt;
&lt;li&gt;API responses&lt;/li&gt;
&lt;li&gt;Screenshots or binary data&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;Clients can:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;List available resources&lt;/li&gt;
&lt;li&gt;Read their contents&lt;/li&gt;
&lt;li&gt;Subscribe to updates (e.g. file changes)&lt;/li&gt;
&lt;/ul&gt;
&lt;h3&gt;Tools&lt;/h3&gt;
&lt;p&gt;Servers define &lt;strong&gt;callable functions&lt;/strong&gt; that agents can invoke. Each tool has:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;A name&lt;/li&gt;
&lt;li&gt;Description&lt;/li&gt;
&lt;li&gt;JSON schema for inputs&lt;/li&gt;
&lt;li&gt;Output format (text or structured)&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;Tools are &lt;strong&gt;model-controlled&lt;/strong&gt;, meaning the LLM can decide which tool to use based on context.&lt;/p&gt;
&lt;h3&gt;Prompts&lt;/h3&gt;
&lt;p&gt;Servers can expose &lt;strong&gt;reusable prompt templates&lt;/strong&gt; with:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Named arguments&lt;/li&gt;
&lt;li&gt;Context bindings (e.g. resources)&lt;/li&gt;
&lt;li&gt;Multi-step workflows&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;Prompts are &lt;strong&gt;user-controlled&lt;/strong&gt;, meaning users select when to run them.&lt;/p&gt;
&lt;h3&gt;Sampling&lt;/h3&gt;
&lt;p&gt;Servers can &lt;strong&gt;ask&lt;/strong&gt; the host model for completions:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Specify conversation history and preferences&lt;/li&gt;
&lt;li&gt;Include system prompt and context&lt;/li&gt;
&lt;li&gt;Receive structured completions (text, image, etc.)&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;This allows &lt;strong&gt;server-side workflows&lt;/strong&gt; to request natural language responses from the model in real time.&lt;/p&gt;
&lt;h2&gt;Security and Isolation&lt;/h2&gt;
&lt;p&gt;MCP provides strong boundaries between components:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Hosts control what clients and models can see&lt;/li&gt;
&lt;li&gt;Servers expose only the capabilities they choose&lt;/li&gt;
&lt;li&gt;Clients can sandbox or restrict tool access&lt;/li&gt;
&lt;li&gt;Sampling keeps users in control of what prompts and completions occur&lt;/li&gt;
&lt;/ul&gt;
&lt;blockquote&gt;
&lt;p&gt;This makes MCP suitable for sensitive environments like IDEs, enterprise apps, and privacy-conscious tools.&lt;/p&gt;
&lt;/blockquote&gt;
&lt;h2&gt;Why This Architecture Matters&lt;/h2&gt;
&lt;p&gt;By standardizing communication between LLMs and tools:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;You can plug a new tool into your environment without modifying your agent&lt;/li&gt;
&lt;li&gt;You can build servers once and use them across different LLM clients (Claude, custom, etc.)&lt;/li&gt;
&lt;li&gt;You get &lt;strong&gt;clear separation of concerns&lt;/strong&gt;: tools, data, and models are independently managed&lt;/li&gt;
&lt;/ul&gt;
&lt;h2&gt;🔮 Coming Up Next: Resources in MCP — Serving Relevant Data Securely&lt;/h2&gt;
&lt;p&gt;In the next post, we’ll zoom in on the &lt;strong&gt;Resources&lt;/strong&gt; capability:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;How to structure resources&lt;/li&gt;
&lt;li&gt;How models use them&lt;/li&gt;
&lt;li&gt;Real-world use cases: logs, code, documents, screenshots&lt;/li&gt;
&lt;/ul&gt;
</content:encoded><author>Alex Merced</author></item><item><title>Journey from AI to LLMs and MCP - 6 - Enter the Model Context Protocol (MCP) — The Interoperability Layer for AI Agents</title><link>https://iceberglakehouse.com/posts/2025-04-model-context-protocol/</link><guid isPermaLink="true">https://iceberglakehouse.com/posts/2025-04-model-context-protocol/</guid><description>
## Free Resources  
- **[Free Apache Iceberg Course](https://hello.dremio.com/webcast-an-apache-iceberg-lakehouse-crash-course-reg.html?utm_source=ev...</description><pubDate>Thu, 10 Apr 2025 09:00:00 GMT</pubDate><content:encoded>&lt;h2&gt;Free Resources&lt;/h2&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;&lt;a href=&quot;https://hello.dremio.com/webcast-an-apache-iceberg-lakehouse-crash-course-reg.html?utm_source=ev_external_blog&amp;amp;utm_medium=influencer&amp;amp;utm_campaign=AItoLLMS&amp;amp;utm_content=alexmerced&amp;amp;utm_term=external_blog&quot;&gt;Free Apache Iceberg Course&lt;/a&gt;&lt;/strong&gt;&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;&lt;a href=&quot;https://hello.dremio.com/wp-apache-iceberg-the-definitive-guide-reg.html?utm_source=ev_external_blog&amp;amp;utm_medium=influencer&amp;amp;utm_campaign=AItoLLMS&amp;amp;utm_content=alexmerced&amp;amp;utm_term=external_blog&quot;&gt;Free Copy of “Apache Iceberg: The Definitive Guide”&lt;/a&gt;&lt;/strong&gt;&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;&lt;a href=&quot;https://medium.com/data-engineering-with-dremio/2025-guide-to-architecting-an-iceberg-lakehouse-9b19ed42c9de&quot;&gt;2025 Apache Iceberg Architecture Guide&lt;/a&gt;&lt;/strong&gt;&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;&lt;a href=&quot;https://medium.alexmerced.blog/guide-to-finding-apache-iceberg-events-near-you-and-being-part-of-the-greater-iceberg-community-0c38ae785ddb&quot;&gt;How to Join the Iceberg Community&lt;/a&gt;&lt;/strong&gt;&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;&lt;a href=&quot;https://youtube.com/playlist?list=PLsLAVBjQJO0p0Yq1fLkoHvt2lEJj5pcYe&amp;amp;si=WTSnqjXZv6Glkc3y&quot;&gt;Iceberg Lakehouse Engineering Video Playlist&lt;/a&gt;&lt;/strong&gt;&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;&lt;a href=&quot;https://medium.com/data-engineering-with-dremio/ultimate-directory-of-apache-iceberg-resources-e3e02efac62e&quot;&gt;Ultimate Apache Iceberg Resource Guide&lt;/a&gt;&lt;/strong&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;We’ve spent the last few posts exploring the growing power of AI agents—how they can reason, plan, and take actions across complex tasks. And we’ve looked at the frameworks that help us build these agents. But if you’ve worked with them, you’ve likely hit a wall:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Hardcoded toolchains&lt;/li&gt;
&lt;li&gt;Limited to a specific LLM provider&lt;/li&gt;
&lt;li&gt;No easy way to share tools or data between agents&lt;/li&gt;
&lt;li&gt;No consistent interface across clients&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;What if we had a &lt;strong&gt;standard&lt;/strong&gt; that let &lt;strong&gt;any agent talk to any data source or tool&lt;/strong&gt;, regardless of where it lives or what it’s built with?&lt;/p&gt;
&lt;p&gt;That’s exactly what the &lt;strong&gt;Model Context Protocol (MCP)&lt;/strong&gt; brings to the table.&lt;/p&gt;
&lt;p&gt;And if you’re from the data engineering world, MCP is to AI agents what the &lt;strong&gt;Apache Iceberg REST protocol&lt;/strong&gt; is to analytics:&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;A universal, pluggable interface that enables many clients to interact with many servers—without tight coupling.&lt;/p&gt;
&lt;/blockquote&gt;
&lt;h2&gt;What Is the Model Context Protocol (MCP)?&lt;/h2&gt;
&lt;p&gt;MCP is an &lt;strong&gt;open protocol&lt;/strong&gt; that defines how LLM-powered applications (like agents, IDEs, or copilots) can access &lt;strong&gt;context, tools, and actions&lt;/strong&gt; in a standardized way.&lt;/p&gt;
&lt;p&gt;Think of it as the &amp;quot;interface layer&amp;quot; between:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Clients&lt;/strong&gt;: LLMs or AI agents that need context and capabilities&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Servers&lt;/strong&gt;: Local or remote services that expose data, tools, or prompts&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Hosts&lt;/strong&gt;: The environment where the LLM runs (e.g., Claude Desktop, a browser extension, or an IDE plugin)&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;It defines a &lt;strong&gt;common language&lt;/strong&gt; for exchanging:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Resources&lt;/strong&gt; (data the model can read)&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Tools&lt;/strong&gt; (functions the model can invoke)&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Prompts&lt;/strong&gt; (templates the user or model can reuse)&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Sampling&lt;/strong&gt; (ways servers can request completions from the model)&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;This allows you to &lt;strong&gt;plug in new capabilities without rearchitecting your agent or retraining your model&lt;/strong&gt;.&lt;/p&gt;
&lt;h2&gt;🧱 How MCP Mirrors Apache Iceberg’s REST Protocol&lt;/h2&gt;
&lt;p&gt;Let’s draw the parallel:&lt;/p&gt;
&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Concept&lt;/th&gt;
&lt;th&gt;Apache Iceberg REST&lt;/th&gt;
&lt;th&gt;Model Context Protocol (MCP)&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Standardized API&lt;/td&gt;
&lt;td&gt;REST endpoints for table ops&lt;/td&gt;
&lt;td&gt;JSON-RPC messages for context/tools&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Decouples client/server&lt;/td&gt;
&lt;td&gt;Any engine ↔ any Iceberg catalog&lt;/td&gt;
&lt;td&gt;Any LLM/agent ↔ any tool or data backend&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Multi-client support&lt;/td&gt;
&lt;td&gt;Spark, Trino, Flink, Dremio&lt;/td&gt;
&lt;td&gt;Claude, custom agents, IDEs, terminals&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Pluggable backends&lt;/td&gt;
&lt;td&gt;S3, HDFS, Minio, Pure Storage, GCS&lt;/td&gt;
&lt;td&gt;Filesystem, APIs, databases, web services&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Interoperable tooling&lt;/td&gt;
&lt;td&gt;REST = portable across ecosystems&lt;/td&gt;
&lt;td&gt;MCP = portable across LLM environments&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;
&lt;p&gt;Just as Iceberg REST made it possible for &lt;strong&gt;Dremio&lt;/strong&gt; to talk to a table created in &lt;strong&gt;Snowflake&lt;/strong&gt;, MCP allows a tool exposed in &lt;strong&gt;Python on your laptop&lt;/strong&gt; to be used by an LLM in &lt;strong&gt;Claude Desktop&lt;/strong&gt;, a VS Code agent, or even a web-based chatbot.&lt;/p&gt;
&lt;h2&gt;🔁 MCP in Action — A Real-World Use Case&lt;/h2&gt;
&lt;p&gt;Imagine this workflow:&lt;/p&gt;
&lt;ol&gt;
&lt;li&gt;You’re coding in an IDE powered by an AI assistant&lt;/li&gt;
&lt;li&gt;The model wants to read your logs and run some shell scripts&lt;/li&gt;
&lt;li&gt;Your data lives locally, and your tools are custom-built in Python&lt;/li&gt;
&lt;/ol&gt;
&lt;p&gt;With MCP:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;The IDE (host) runs an &lt;strong&gt;MCP client&lt;/strong&gt;&lt;/li&gt;
&lt;li&gt;Your Python tool is exposed via an &lt;strong&gt;MCP server&lt;/strong&gt;&lt;/li&gt;
&lt;li&gt;The AI assistant (client) calls your custom “tail logs” tool&lt;/li&gt;
&lt;li&gt;The results are streamed back, all through the &lt;strong&gt;standardized protocol&lt;/strong&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;And tomorrow, you could replace that assistant with a different model or switch to a browser-based environment—and everything would still work.&lt;/p&gt;
&lt;h2&gt;The Core Components of MCP&lt;/h2&gt;
&lt;p&gt;Let’s break down the architecture:&lt;/p&gt;
&lt;h3&gt;1. &lt;strong&gt;Hosts&lt;/strong&gt;&lt;/h3&gt;
&lt;p&gt;These are environments where the LLM application lives (e.g., Claude Desktop, your IDE). They manage connections to MCP clients.&lt;/p&gt;
&lt;h3&gt;2. &lt;strong&gt;Clients&lt;/strong&gt;&lt;/h3&gt;
&lt;p&gt;Embedded in the host, each client maintains a connection to a specific server. It speaks MCP’s message protocol and exposes capabilities upstream to the model.&lt;/p&gt;
&lt;h3&gt;3. &lt;strong&gt;Servers&lt;/strong&gt;&lt;/h3&gt;
&lt;p&gt;Programs that expose capabilities like:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;code&gt;resources/list&lt;/code&gt; and &lt;code&gt;resources/read&lt;/code&gt;&lt;/li&gt;
&lt;li&gt;&lt;code&gt;tools/list&lt;/code&gt; and &lt;code&gt;tools/call&lt;/code&gt;&lt;/li&gt;
&lt;li&gt;&lt;code&gt;prompts/list&lt;/code&gt; and &lt;code&gt;prompts/get&lt;/code&gt;&lt;/li&gt;
&lt;li&gt;&lt;code&gt;sampling/createMessage&lt;/code&gt; (to request completions from the model)&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;Servers can live anywhere: locally on your machine, behind an API, or running in a cloud environment.&lt;/p&gt;
&lt;h2&gt;What Can MCP Servers Do?&lt;/h2&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Expose local or remote files&lt;/strong&gt; (logs, documents, screenshots, live data)&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Define tools&lt;/strong&gt; for executing business logic, running commands, or calling APIs&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Provide reusable prompt templates&lt;/strong&gt;&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Request completions from the host model&lt;/strong&gt; (sampling)&lt;/li&gt;
&lt;/ul&gt;
&lt;blockquote&gt;
&lt;p&gt;And all of this is done in a protocol-agnostic, secure, pluggable format.&lt;/p&gt;
&lt;/blockquote&gt;
&lt;h2&gt;Why This Matters&lt;/h2&gt;
&lt;p&gt;With MCP, we finally get &lt;strong&gt;interoperability in the AI stack&lt;/strong&gt;—a shared interface layer between:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;LLMs and tools&lt;/li&gt;
&lt;li&gt;Agents and environments&lt;/li&gt;
&lt;li&gt;Models and real-world data&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;It gives us:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Modularity&lt;/strong&gt;: Swap out components without breaking workflows&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Reusability&lt;/strong&gt;: Build once, use everywhere&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Security&lt;/strong&gt;: Limit what models can see and do through capabilities&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Observability&lt;/strong&gt;: Track how tools are used and what context is passed&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Language-agnostic integration&lt;/strong&gt;: Servers can be written in Python, JavaScript, C#, and more&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;In short, MCP helps you go from &lt;strong&gt;monolithic, tangled agents&lt;/strong&gt; to &lt;strong&gt;modular, composable AI systems&lt;/strong&gt;.&lt;/p&gt;
&lt;h2&gt;What’s Next: Diving Deeper into MCP Internals&lt;/h2&gt;
&lt;p&gt;In the next few posts, we’ll dig into each part of MCP:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Message formats and lifecycle&lt;/li&gt;
&lt;li&gt;How resources and tools are structured&lt;/li&gt;
&lt;li&gt;Sampling, prompts, and real-time feedback loops&lt;/li&gt;
&lt;li&gt;Best practices for building your own MCP server&lt;/li&gt;
&lt;/ul&gt;
</content:encoded><author>Alex Merced</author></item><item><title>A Journey from AI to LLMs and MCP - 5 - AI Agent Frameworks — Benefits and Limitations</title><link>https://iceberglakehouse.com/posts/2025-04-ai-agent-frameworks/</link><guid isPermaLink="true">https://iceberglakehouse.com/posts/2025-04-ai-agent-frameworks/</guid><description>
## Free Resources  
- **[Free Apache Iceberg Course](https://hello.dremio.com/webcast-an-apache-iceberg-lakehouse-crash-course-reg.html?utm_source=ev...</description><pubDate>Wed, 09 Apr 2025 09:00:00 GMT</pubDate><content:encoded>&lt;h2&gt;Free Resources&lt;/h2&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;&lt;a href=&quot;https://hello.dremio.com/webcast-an-apache-iceberg-lakehouse-crash-course-reg.html?utm_source=ev_external_blog&amp;amp;utm_medium=influencer&amp;amp;utm_campaign=AItoLLMS&amp;amp;utm_content=alexmerced&amp;amp;utm_term=external_blog&quot;&gt;Free Apache Iceberg Course&lt;/a&gt;&lt;/strong&gt;&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;&lt;a href=&quot;https://hello.dremio.com/wp-apache-iceberg-the-definitive-guide-reg.html?utm_source=ev_external_blog&amp;amp;utm_medium=influencer&amp;amp;utm_campaign=AItoLLMS&amp;amp;utm_content=alexmerced&amp;amp;utm_term=external_blog&quot;&gt;Free Copy of “Apache Iceberg: The Definitive Guide”&lt;/a&gt;&lt;/strong&gt;&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;&lt;a href=&quot;https://medium.com/data-engineering-with-dremio/2025-guide-to-architecting-an-iceberg-lakehouse-9b19ed42c9de&quot;&gt;2025 Apache Iceberg Architecture Guide&lt;/a&gt;&lt;/strong&gt;&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;&lt;a href=&quot;https://medium.alexmerced.blog/guide-to-finding-apache-iceberg-events-near-you-and-being-part-of-the-greater-iceberg-community-0c38ae785ddb&quot;&gt;How to Join the Iceberg Community&lt;/a&gt;&lt;/strong&gt;&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;&lt;a href=&quot;https://youtube.com/playlist?list=PLsLAVBjQJO0p0Yq1fLkoHvt2lEJj5pcYe&amp;amp;si=WTSnqjXZv6Glkc3y&quot;&gt;Iceberg Lakehouse Engineering Video Playlist&lt;/a&gt;&lt;/strong&gt;&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;&lt;a href=&quot;https://medium.com/data-engineering-with-dremio/ultimate-directory-of-apache-iceberg-resources-e3e02efac62e&quot;&gt;Ultimate Apache Iceberg Resource Guide&lt;/a&gt;&lt;/strong&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;In our last post, we explored what makes an &lt;strong&gt;AI agent&lt;/strong&gt; different from a traditional LLM—memory, tools, reasoning, and autonomy. These agents are the foundation of a new generation of intelligent applications.&lt;/p&gt;
&lt;p&gt;But how are these agents built today?&lt;/p&gt;
&lt;p&gt;Enter &lt;strong&gt;agent frameworks&lt;/strong&gt;—open-source libraries and developer toolkits that let you create goal-driven AI systems by wiring together models, memory, tools, and logic. These frameworks are enabling some of the most exciting innovations in the AI space... but they also come with trade-offs.&lt;/p&gt;
&lt;p&gt;In this post, we’ll dive into:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;What AI agent frameworks are&lt;/li&gt;
&lt;li&gt;The most popular frameworks available today&lt;/li&gt;
&lt;li&gt;The benefits they offer&lt;/li&gt;
&lt;li&gt;Where they fall short&lt;/li&gt;
&lt;li&gt;Why we need something more modular and flexible (spoiler: MCP)&lt;/li&gt;
&lt;/ul&gt;
&lt;h2&gt;What Is an AI Agent Framework?&lt;/h2&gt;
&lt;p&gt;An AI agent framework is a development toolkit that simplifies the process of building &lt;strong&gt;LLM-powered systems&lt;/strong&gt; capable of reasoning, acting, and learning in real time. These frameworks abstract away much of the complexity involved in working with large language models (LLMs) by bundling together key components like memory, tools, task planning, and context management.&lt;/p&gt;
&lt;p&gt;Agent frameworks shift the focus from &amp;quot;generating text&amp;quot; to &amp;quot;completing goals.&amp;quot; They let developers orchestrate multi-step workflows where an LLM isn&apos;t just answering questions but taking action, executing logic, and retrieving relevant data.&lt;/p&gt;
&lt;h3&gt;Memory&lt;/h3&gt;
&lt;p&gt;Memory in AI agents refers to how information from past interactions is stored, retrieved, and reused. This can be split into two primary types:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Short-term memory&lt;/strong&gt;: Keeps track of the current conversation or task state. Usually implemented as a conversation history buffer or rolling context window.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Long-term memory&lt;/strong&gt;: Stores past interactions, facts, or discoveries for reuse across sessions. Typically backed by:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;A &lt;strong&gt;vector database&lt;/strong&gt; (e.g., Pinecone, FAISS, Weaviate)&lt;/li&gt;
&lt;li&gt;Embedding models that turn text into numerical vectors&lt;/li&gt;
&lt;li&gt;A retrieval layer that finds the most relevant memories using similarity search&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;Under the hood:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Text is embedded into a vector representation (via models like OpenAI’s &lt;code&gt;text-embedding-ada-002&lt;/code&gt;)&lt;/li&gt;
&lt;li&gt;These vectors are stored in a database&lt;/li&gt;
&lt;li&gt;When new input arrives, it’s embedded and compared to stored vectors&lt;/li&gt;
&lt;li&gt;Top matches are fetched and injected into the LLM prompt as background context&lt;/li&gt;
&lt;/ul&gt;
&lt;h3&gt;Tools&lt;/h3&gt;
&lt;p&gt;Tools are external functions that the agent can invoke to perform actions or retrieve live information. These can include:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Calling an API (e.g., weather, GitHub, SQL query)&lt;/li&gt;
&lt;li&gt;Executing a shell command or script&lt;/li&gt;
&lt;li&gt;Reading a file or database&lt;/li&gt;
&lt;li&gt;Sending a message or triggering an automation&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;Frameworks like &lt;strong&gt;LangChain&lt;/strong&gt;, &lt;strong&gt;AutoGPT&lt;/strong&gt;, and &lt;strong&gt;Semantic Kernel&lt;/strong&gt; often use JSON schemas to define tool inputs and outputs. LLMs &amp;quot;see&amp;quot; tool descriptions and decide when and how to invoke them.&lt;/p&gt;
&lt;p&gt;Under the hood:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Each tool is registered with a name, description, and parameter schema&lt;/li&gt;
&lt;li&gt;The LLM is given a list of available tools and their specs&lt;/li&gt;
&lt;li&gt;When the LLM &amp;quot;decides&amp;quot; to use a tool, it returns a structured tool call (e.g., &lt;code&gt;{&amp;quot;name&amp;quot;: &amp;quot;search_docs&amp;quot;, &amp;quot;args&amp;quot;: {&amp;quot;query&amp;quot;: &amp;quot;sales trends&amp;quot;}}&lt;/code&gt;)&lt;/li&gt;
&lt;li&gt;The framework intercepts the call, executes the corresponding function, and feeds the result back to the model&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;This allows the agent to &amp;quot;act&amp;quot; on the world, not just describe it.&lt;/p&gt;
&lt;h3&gt;🧠 Reasoning and Planning&lt;/h3&gt;
&lt;p&gt;Reasoning is what enables agents to:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Decompose goals into steps&lt;/li&gt;
&lt;li&gt;Decide what tools or memory to use&lt;/li&gt;
&lt;li&gt;Track intermediate results&lt;/li&gt;
&lt;li&gt;Adjust their strategy based on feedback&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;Frameworks often support:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;React-style loops&lt;/strong&gt;: Reasoning + action → observation → repeat&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Planner-executor separation&lt;/strong&gt;: One model plans, another carries out steps&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Task graphs&lt;/strong&gt;: Nodes (LLM calls, tools, decisions) arranged in a DAG&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;Under the hood:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;The LLM is prompted to plan tasks using a scratchpad (e.g., &amp;quot;Thought → Action → Observation&amp;quot;)&lt;/li&gt;
&lt;li&gt;The agent parses the output to decide the next step&lt;/li&gt;
&lt;li&gt;Control flow logic (loops, retries, branches) is often implemented in code, not by the model&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;This turns the agent into a &lt;strong&gt;semi-autonomous problem-solver&lt;/strong&gt;, not just a one-shot prompt engine.&lt;/p&gt;
&lt;h3&gt;🧾 Context Management&lt;/h3&gt;
&lt;p&gt;Context management is about deciding &lt;strong&gt;what information gets passed into the LLM prompt&lt;/strong&gt; at any given time. This is critical because:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Token limits constrain how much data can be included&lt;/li&gt;
&lt;li&gt;Irrelevant information can degrade model performance&lt;/li&gt;
&lt;li&gt;Sensitive data must be filtered for security and compliance&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;Frameworks handle context by:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Selecting relevant memory or documents via vector search&lt;/li&gt;
&lt;li&gt;Condensing history into summaries&lt;/li&gt;
&lt;li&gt;Prioritizing inputs (e.g., task instructions, user preferences, retrieved data)&lt;/li&gt;
&lt;li&gt;Inserting only high-signal content into the prompt&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;Under the hood:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Context is assembled as structured messages (usually in OpenAI or Anthropic chat formats)&lt;/li&gt;
&lt;li&gt;Some frameworks dynamically prune, summarize, or chunk data to fit within model limits&lt;/li&gt;
&lt;li&gt;Smart caching or pagination may be used to maintain continuity across long sessions&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;Agent frameworks abstract complex functionality into composable components:&lt;/p&gt;
&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Capability&lt;/th&gt;
&lt;th&gt;What It Does&lt;/th&gt;
&lt;th&gt;How It Works Under the Hood&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Memory&lt;/td&gt;
&lt;td&gt;Recalls past interactions and facts&lt;/td&gt;
&lt;td&gt;Vector embeddings, similarity search, context injection&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Tools&lt;/td&gt;
&lt;td&gt;Executes real-world actions&lt;/td&gt;
&lt;td&gt;Function schemas, LLM tool calls, output feedback loop&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Reasoning&lt;/td&gt;
&lt;td&gt;Plans steps, decides next action&lt;/td&gt;
&lt;td&gt;Thought-action-observation loops, scratchpads&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Context Mgmt&lt;/td&gt;
&lt;td&gt;Curates what the model sees&lt;/td&gt;
&lt;td&gt;Dynamic prompt construction, summarization, filtering&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;
&lt;p&gt;Together, these allow developers to build &lt;strong&gt;goal-seeking agents&lt;/strong&gt; that work across domains—analytics, support, operations, creative work, and more.&lt;/p&gt;
&lt;p&gt;Agent frameworks provide the scaffolding. LLMs provide the intelligence.&lt;/p&gt;
&lt;h2&gt;Popular AI Agent Frameworks&lt;/h2&gt;
&lt;p&gt;Let’s look at some of the leading options:&lt;/p&gt;
&lt;h3&gt;LangChain&lt;/h3&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Language&lt;/strong&gt;: Python, JavaScript&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Strengths&lt;/strong&gt;:
&lt;ul&gt;
&lt;li&gt;Large ecosystem of components&lt;/li&gt;
&lt;li&gt;Support for chains, tools, memory, agents&lt;/li&gt;
&lt;li&gt;Integrates with most major LLMs, vector DBs, and APIs&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Limitations&lt;/strong&gt;:
&lt;ul&gt;
&lt;li&gt;Can become overly complex&lt;/li&gt;
&lt;li&gt;Boilerplate-heavy for simple tasks&lt;/li&gt;
&lt;li&gt;Hard to reason about internal agent state&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;/ul&gt;
&lt;h3&gt;AutoGPT / BabyAGI&lt;/h3&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Language&lt;/strong&gt;: Python&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Strengths&lt;/strong&gt;:
&lt;ul&gt;
&lt;li&gt;Fully autonomous task execution loops&lt;/li&gt;
&lt;li&gt;Goal-first architecture (recursive reasoning)&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Limitations&lt;/strong&gt;:
&lt;ul&gt;
&lt;li&gt;Unpredictable behavior (&amp;quot;runaway agents&amp;quot;)&lt;/li&gt;
&lt;li&gt;Tooling and error handling are immature&lt;/li&gt;
&lt;li&gt;Not production-grade (yet)&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;/ul&gt;
&lt;h3&gt;Semantic Kernel (Microsoft)&lt;/h3&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Language&lt;/strong&gt;: C#, Python&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Strengths&lt;/strong&gt;:
&lt;ul&gt;
&lt;li&gt;Enterprise-ready tooling&lt;/li&gt;
&lt;li&gt;Strong integration with Microsoft ecosystems&lt;/li&gt;
&lt;li&gt;Planner APIs and plugin system&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Limitations&lt;/strong&gt;:
&lt;ul&gt;
&lt;li&gt;Steeper learning curve&lt;/li&gt;
&lt;li&gt;Limited community and examples&lt;/li&gt;
&lt;li&gt;More opinionated structure&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;/ul&gt;
&lt;h3&gt;CrewAI / MetaGPT&lt;/h3&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Language&lt;/strong&gt;: Python&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Strengths&lt;/strong&gt;:
&lt;ul&gt;
&lt;li&gt;Multi-agent collaboration&lt;/li&gt;
&lt;li&gt;Role-based task assignment&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Limitations&lt;/strong&gt;:
&lt;ul&gt;
&lt;li&gt;Heavy on orchestration&lt;/li&gt;
&lt;li&gt;Still early in maturity&lt;/li&gt;
&lt;li&gt;Debugging agent interactions is hard&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;/ul&gt;
&lt;h2&gt;Benefits of Using an Agent Framework&lt;/h2&gt;
&lt;p&gt;These tools have unlocked new possibilities for developers building AI-powered workflows. Let’s summarize the major benefits:&lt;/p&gt;
&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Benefit&lt;/th&gt;
&lt;th&gt;Description&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Abstractions for Tools&lt;/td&gt;
&lt;td&gt;Call APIs or local functions directly from within agent flows&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Built-in Memory&lt;/td&gt;
&lt;td&gt;Manage short-term context and long-term recall without manual prompt engineering&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Modular Design&lt;/td&gt;
&lt;td&gt;Compose systems using interchangeable components&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Planning + Looping&lt;/td&gt;
&lt;td&gt;Support multi-step task execution with feedback loops&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Rapid Prototyping&lt;/td&gt;
&lt;td&gt;Build functional AI assistants quickly with reusable components&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;
&lt;p&gt;In short: &lt;strong&gt;agent frameworks supercharge developer productivity&lt;/strong&gt; when working with LLMs.&lt;/p&gt;
&lt;h2&gt;Where Agent Frameworks Fall Short&lt;/h2&gt;
&lt;p&gt;Despite all their strengths, modern agent frameworks share some core limitations:&lt;/p&gt;
&lt;h3&gt;1. &lt;strong&gt;Tight Coupling to Models and Providers&lt;/strong&gt;&lt;/h3&gt;
&lt;p&gt;Most frameworks are tightly bound to OpenAI, Anthropic, or Hugging Face models. Switching providers—or supporting multiple—is complex and risky.&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;Want to try Claude instead of GPT-4? You might need to refactor your entire chain.&lt;/p&gt;
&lt;/blockquote&gt;
&lt;h3&gt;2. &lt;strong&gt;Context Management Is Manual and Error-Prone&lt;/strong&gt;&lt;/h3&gt;
&lt;p&gt;Choosing what context to pass to the LLM (memory, docs, prior results) is often left to the developer. It’s:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Hard to debug&lt;/li&gt;
&lt;li&gt;Easy to overrun token limits&lt;/li&gt;
&lt;li&gt;Non-standardized&lt;/li&gt;
&lt;/ul&gt;
&lt;h3&gt;3. &lt;strong&gt;Lack of Interoperability&lt;/strong&gt;&lt;/h3&gt;
&lt;p&gt;Most frameworks don’t play well together. Tools, memory stores, and prompt logic often live in their own silos.&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;You can’t easily plug a LangChain tool into a Semantic Kernel workflow.&lt;/p&gt;
&lt;/blockquote&gt;
&lt;h3&gt;4. &lt;strong&gt;Hard to Secure and Monitor&lt;/strong&gt;&lt;/h3&gt;
&lt;p&gt;Giving agents tool access (e.g., shell commands, APIs) is powerful but risky:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;No standard for input validation&lt;/li&gt;
&lt;li&gt;No logging/auditing for tool usage&lt;/li&gt;
&lt;li&gt;Few controls for human-in-the-loop approvals&lt;/li&gt;
&lt;/ul&gt;
&lt;h3&gt;5. &lt;strong&gt;Opaque Agent Logic&lt;/strong&gt;&lt;/h3&gt;
&lt;p&gt;Agents often make decisions that are hard to trace or debug. Why did the agent call that tool? Why did it loop forever?&lt;/p&gt;
&lt;h2&gt;The Missing Layer: Standardized Context + Tool Protocols&lt;/h2&gt;
&lt;p&gt;We need a better abstraction layer—something that:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Decouples LLMs from the tools and data they use&lt;/li&gt;
&lt;li&gt;Allows agents to access secure, structured resources&lt;/li&gt;
&lt;li&gt;Enables modular, composable agents across languages and platforms&lt;/li&gt;
&lt;li&gt;Works with any client, model, or provider&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;That’s where the &lt;strong&gt;Model Context Protocol (MCP)&lt;/strong&gt; comes in.&lt;/p&gt;
&lt;h2&gt;What’s Next: Introducing the Model Context Protocol (MCP)&lt;/h2&gt;
&lt;p&gt;In the next post, we’ll explore:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;What MCP is&lt;/li&gt;
&lt;li&gt;How it enables secure, flexible agent architectures&lt;/li&gt;
&lt;li&gt;Why it&apos;s the “USB-C port” for LLMs and tools&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;We’ll walk through the architecture and show how MCP solves many of the problems outlined in this post.&lt;/p&gt;
</content:encoded><author>Alex Merced</author></item><item><title>A Journey from AI to LLMs and MCP - 4 - What Are AI Agents — And Why They&apos;re the Future of LLM Applications</title><link>https://iceberglakehouse.com/posts/2025-04-what-are-ai-agents/</link><guid isPermaLink="true">https://iceberglakehouse.com/posts/2025-04-what-are-ai-agents/</guid><description>
## Free Resources  
- **[Free Apache Iceberg Course](https://hello.dremio.com/webcast-an-apache-iceberg-lakehouse-crash-course-reg.html?utm_source=ev...</description><pubDate>Tue, 08 Apr 2025 09:00:00 GMT</pubDate><content:encoded>&lt;h2&gt;Free Resources&lt;/h2&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;&lt;a href=&quot;https://hello.dremio.com/webcast-an-apache-iceberg-lakehouse-crash-course-reg.html?utm_source=ev_external_blog&amp;amp;utm_medium=influencer&amp;amp;utm_campaign=AItoLLMS&amp;amp;utm_content=alexmerced&amp;amp;utm_term=external_blog&quot;&gt;Free Apache Iceberg Course&lt;/a&gt;&lt;/strong&gt;&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;&lt;a href=&quot;https://hello.dremio.com/wp-apache-iceberg-the-definitive-guide-reg.html?utm_source=ev_external_blog&amp;amp;utm_medium=influencer&amp;amp;utm_campaign=AItoLLMS&amp;amp;utm_content=alexmerced&amp;amp;utm_term=external_blog&quot;&gt;Free Copy of “Apache Iceberg: The Definitive Guide”&lt;/a&gt;&lt;/strong&gt;&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;&lt;a href=&quot;https://medium.com/data-engineering-with-dremio/2025-guide-to-architecting-an-iceberg-lakehouse-9b19ed42c9de&quot;&gt;2025 Apache Iceberg Architecture Guide&lt;/a&gt;&lt;/strong&gt;&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;&lt;a href=&quot;https://medium.alexmerced.blog/guide-to-finding-apache-iceberg-events-near-you-and-being-part-of-the-greater-iceberg-community-0c38ae785ddb&quot;&gt;How to Join the Iceberg Community&lt;/a&gt;&lt;/strong&gt;&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;&lt;a href=&quot;https://youtube.com/playlist?list=PLsLAVBjQJO0p0Yq1fLkoHvt2lEJj5pcYe&amp;amp;si=WTSnqjXZv6Glkc3y&quot;&gt;Iceberg Lakehouse Engineering Video Playlist&lt;/a&gt;&lt;/strong&gt;&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;&lt;a href=&quot;https://medium.com/data-engineering-with-dremio/ultimate-directory-of-apache-iceberg-resources-e3e02efac62e&quot;&gt;Ultimate Apache Iceberg Resource Guide&lt;/a&gt;&lt;/strong&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;We’ve explored how Large Language Models (LLMs) work, and how we can improve their performance with fine-tuning, prompt engineering, and retrieval-augmented generation (RAG). These enhancements are powerful—but they’re still fundamentally &lt;em&gt;stateless&lt;/em&gt; and reactive.&lt;/p&gt;
&lt;p&gt;To build systems that act with purpose, adapt over time, and accomplish multi-step goals, we need something more.&lt;/p&gt;
&lt;p&gt;That “something” is the &lt;strong&gt;AI Agent&lt;/strong&gt;.&lt;/p&gt;
&lt;p&gt;In this post, we’ll explore:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;What AI agents are&lt;/li&gt;
&lt;li&gt;How they differ from LLMs&lt;/li&gt;
&lt;li&gt;What components make up an agent&lt;/li&gt;
&lt;li&gt;Real-world examples of agent use&lt;/li&gt;
&lt;li&gt;Why agents are a crucial next step for AI&lt;/li&gt;
&lt;/ul&gt;
&lt;h2&gt;What Is an AI Agent?&lt;/h2&gt;
&lt;p&gt;At a high level, an &lt;strong&gt;AI agent&lt;/strong&gt; is an autonomous or semi-autonomous system built around an LLM, capable of:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Observing its environment (inputs, tools, data)&lt;/li&gt;
&lt;li&gt;Reasoning or planning&lt;/li&gt;
&lt;li&gt;Taking actions&lt;/li&gt;
&lt;li&gt;Learning or adapting over time&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;LLMs generate responses, but &lt;strong&gt;agents make decisions&lt;/strong&gt;. They don’t just answer; they &lt;em&gt;think, decide, and act&lt;/em&gt;.&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;Think of the difference between a calculator and a virtual assistant. One gives answers. The other &lt;em&gt;gets things done&lt;/em&gt;.&lt;/p&gt;
&lt;/blockquote&gt;
&lt;h2&gt;The Core Ingredients of an AI Agent&lt;/h2&gt;
&lt;p&gt;Let’s break down what typically makes up an agentic system:&lt;/p&gt;
&lt;h3&gt;1. &lt;strong&gt;LLM Core&lt;/strong&gt;&lt;/h3&gt;
&lt;p&gt;The brain of the operation. Handles natural language understanding and generation.&lt;/p&gt;
&lt;h3&gt;2. &lt;strong&gt;Tools / Actions&lt;/strong&gt;&lt;/h3&gt;
&lt;p&gt;Agents can execute external commands, like calling APIs, querying databases, or running code.&lt;/p&gt;
&lt;h3&gt;3. &lt;strong&gt;Memory&lt;/strong&gt;&lt;/h3&gt;
&lt;p&gt;Persistent memory lets agents recall previous interactions, facts, or task states.&lt;/p&gt;
&lt;h3&gt;4. &lt;strong&gt;Planner / Executor Logic&lt;/strong&gt;&lt;/h3&gt;
&lt;p&gt;This is where agents shine. They can:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Break down complex goals into subtasks&lt;/li&gt;
&lt;li&gt;Decide which tools or steps to take&lt;/li&gt;
&lt;li&gt;Loop, retry, or adapt based on results&lt;/li&gt;
&lt;/ul&gt;
&lt;h3&gt;5. &lt;strong&gt;Context Manager&lt;/strong&gt;&lt;/h3&gt;
&lt;p&gt;Decides what information (memory, documents, tool results) gets included in each LLM prompt.&lt;/p&gt;
&lt;h2&gt;LLM vs AI Agent — Key Differences&lt;/h2&gt;
&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Capability&lt;/th&gt;
&lt;th&gt;LLM&lt;/th&gt;
&lt;th&gt;AI Agent&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Input&lt;/td&gt;
&lt;td&gt;Prompt&lt;/td&gt;
&lt;td&gt;Prompt + tools + state&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Memory&lt;/td&gt;
&lt;td&gt;Ephemeral (context)&lt;/td&gt;
&lt;td&gt;Persistent (via external memory)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Reasoning&lt;/td&gt;
&lt;td&gt;Single-shot&lt;/td&gt;
&lt;td&gt;Multi-step planning&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Action-taking&lt;/td&gt;
&lt;td&gt;No&lt;/td&gt;
&lt;td&gt;Yes (tools, APIs, workflows)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Autonomy&lt;/td&gt;
&lt;td&gt;None&lt;/td&gt;
&lt;td&gt;Optional (user- or goal-directed)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Adaptability&lt;/td&gt;
&lt;td&gt;Static behavior&lt;/td&gt;
&lt;td&gt;Dynamic, can learn from feedback&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;
&lt;p&gt;LLMs are the engine. Agents are the vehicle.&lt;/p&gt;
&lt;h2&gt;Examples of AI Agents in the Wild&lt;/h2&gt;
&lt;p&gt;Let’s explore how AI agents are already showing up in real-world applications:&lt;/p&gt;
&lt;h3&gt;1. &lt;strong&gt;Developer Copilots&lt;/strong&gt;&lt;/h3&gt;
&lt;p&gt;Tools like GitHub Copilot or Cursor act as coding assistants, not just autocomplete engines. They:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Read your project files&lt;/li&gt;
&lt;li&gt;Ask clarifying questions&lt;/li&gt;
&lt;li&gt;Suggest multi-line changes&lt;/li&gt;
&lt;li&gt;Run code against test cases&lt;/li&gt;
&lt;/ul&gt;
&lt;h3&gt;2. &lt;strong&gt;Document Q&amp;amp;A Assistants&lt;/strong&gt;&lt;/h3&gt;
&lt;p&gt;Instead of just answering questions, agents:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Search relevant documents&lt;/li&gt;
&lt;li&gt;Summarize findings&lt;/li&gt;
&lt;li&gt;Ask follow-up questions&lt;/li&gt;
&lt;li&gt;Offer next actions (e.g., generate reports)&lt;/li&gt;
&lt;/ul&gt;
&lt;h3&gt;3. &lt;strong&gt;Research Agents&lt;/strong&gt;&lt;/h3&gt;
&lt;p&gt;Given a broad prompt like &lt;em&gt;“summarize recent news on AI regulation,”&lt;/em&gt; agents:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Plan a research strategy&lt;/li&gt;
&lt;li&gt;Browse the web or internal data&lt;/li&gt;
&lt;li&gt;Synthesize and refine results&lt;/li&gt;
&lt;li&gt;Ask for confirmation before continuing&lt;/li&gt;
&lt;/ul&gt;
&lt;h2&gt;🔄 Agents Enable Autonomy and Feedback Loops&lt;/h2&gt;
&lt;p&gt;Unlike plain LLMs, agents can:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Use &lt;strong&gt;tools&lt;/strong&gt; to gather more info&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Loop&lt;/strong&gt; on tasks until a condition is met&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Store and recall&lt;/strong&gt; what they’ve seen&lt;/li&gt;
&lt;li&gt;Chain multiple steps together&lt;/li&gt;
&lt;/ul&gt;
&lt;h3&gt;For example:&lt;/h3&gt;
&lt;p&gt;&lt;strong&gt;Task:&lt;/strong&gt; Schedule a meeting with Alice&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Agent:&lt;/strong&gt;&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;
&lt;p&gt;Search calendar availability&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;Find Alice’s preferred times&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;Draft an email proposal&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;Wait for response&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;Reschedule if needed&lt;/p&gt;
&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;That’s not a single LLM prompt—that’s an intelligent system managing an evolving task.&lt;/p&gt;
&lt;h2&gt;How Are Agents Built Today?&lt;/h2&gt;
&lt;p&gt;A number of popular &lt;strong&gt;AI agent frameworks&lt;/strong&gt; have emerged:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;LangChain&lt;/strong&gt;: Modular orchestration of LLMs, tools, and memory&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;AutoGPT&lt;/strong&gt;: Autonomous task completion with iterative planning&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Semantic Kernel&lt;/strong&gt;: Microsoft’s framework for embedding LLMs into software&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;CrewAI / MetaGPT&lt;/strong&gt;: Multi-agent systems with defined roles&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;These frameworks let developers prototype powerful workflows, but they come with challenges—especially around complexity, tool integration, and portability.&lt;/p&gt;
&lt;p&gt;We’ll explore those challenges in the next post.&lt;/p&gt;
&lt;h2&gt;Limitations of Today’s Agent Implementations&lt;/h2&gt;
&lt;p&gt;While agents are promising, current frameworks have some limitations:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Tight coupling&lt;/strong&gt; to specific models or tools&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Difficult interoperability&lt;/strong&gt; between agent components&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Context juggling&lt;/strong&gt;: hard to manage what the model sees&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Security and control&lt;/strong&gt;: risk of unsafe tool access&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Hard to debug&lt;/strong&gt;: agents can go rogue or get stuck in loops&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;To address these, we need &lt;strong&gt;standardization&lt;/strong&gt;—a modular way to plug in data, tools, and models securely and flexibly.&lt;/p&gt;
&lt;p&gt;That’s where the &lt;strong&gt;Model Context Protocol (MCP)&lt;/strong&gt; enters the picture.&lt;/p&gt;
&lt;h2&gt;Coming Up Next: AI Agent Frameworks — Benefits and Limitations&lt;/h2&gt;
&lt;p&gt;In our next post, we’ll explore:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;How modern agent frameworks work&lt;/li&gt;
&lt;li&gt;What they enable (and where they fall short)&lt;/li&gt;
&lt;li&gt;The missing layer that MCP provides&lt;/li&gt;
&lt;/ul&gt;
</content:encoded><author>Alex Merced</author></item><item><title>A Journey from AI to LLMs and MCP - 3 - Boosting LLM Performance — Fine-Tuning, Prompt Engineering, and RAG</title><link>https://iceberglakehouse.com/posts/2025-04-boosting-llm-performance/</link><guid isPermaLink="true">https://iceberglakehouse.com/posts/2025-04-boosting-llm-performance/</guid><description>
## Free Resources  
- **[Free Apache Iceberg Course](https://hello.dremio.com/webcast-an-apache-iceberg-lakehouse-crash-course-reg.html?utm_source=ev...</description><pubDate>Mon, 07 Apr 2025 09:00:00 GMT</pubDate><content:encoded>&lt;h2&gt;Free Resources&lt;/h2&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;&lt;a href=&quot;https://hello.dremio.com/webcast-an-apache-iceberg-lakehouse-crash-course-reg.html?utm_source=ev_external_blog&amp;amp;utm_medium=influencer&amp;amp;utm_campaign=AItoLLMS&amp;amp;utm_content=alexmerced&amp;amp;utm_term=external_blog&quot;&gt;Free Apache Iceberg Course&lt;/a&gt;&lt;/strong&gt;&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;&lt;a href=&quot;https://hello.dremio.com/wp-apache-iceberg-the-definitive-guide-reg.html?utm_source=ev_external_blog&amp;amp;utm_medium=influencer&amp;amp;utm_campaign=AItoLLMS&amp;amp;utm_content=alexmerced&amp;amp;utm_term=external_blog&quot;&gt;Free Copy of “Apache Iceberg: The Definitive Guide”&lt;/a&gt;&lt;/strong&gt;&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;&lt;a href=&quot;https://medium.com/data-engineering-with-dremio/2025-guide-to-architecting-an-iceberg-lakehouse-9b19ed42c9de&quot;&gt;2025 Apache Iceberg Architecture Guide&lt;/a&gt;&lt;/strong&gt;&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;&lt;a href=&quot;https://medium.alexmerced.blog/guide-to-finding-apache-iceberg-events-near-you-and-being-part-of-the-greater-iceberg-community-0c38ae785ddb&quot;&gt;How to Join the Iceberg Community&lt;/a&gt;&lt;/strong&gt;&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;&lt;a href=&quot;https://youtube.com/playlist?list=PLsLAVBjQJO0p0Yq1fLkoHvt2lEJj5pcYe&amp;amp;si=WTSnqjXZv6Glkc3y&quot;&gt;Iceberg Lakehouse Engineering Video Playlist&lt;/a&gt;&lt;/strong&gt;&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;&lt;a href=&quot;https://medium.com/data-engineering-with-dremio/ultimate-directory-of-apache-iceberg-resources-e3e02efac62e&quot;&gt;Ultimate Apache Iceberg Resource Guide&lt;/a&gt;&lt;/strong&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;In our last post, we explored how LLMs process text using embeddings and vector spaces within limited context windows. While LLMs are powerful out-of-the-box, they aren’t perfect—and in many real-world scenarios, we need to push them further.&lt;/p&gt;
&lt;p&gt;That’s where enhancement techniques come in.&lt;/p&gt;
&lt;p&gt;In this post, we’ll walk through the three most popular and practical ways to &lt;strong&gt;boost the performance of Large Language Models (LLMs)&lt;/strong&gt;:&lt;/p&gt;
&lt;ol&gt;
&lt;li&gt;Fine-tuning&lt;/li&gt;
&lt;li&gt;Prompt engineering&lt;/li&gt;
&lt;li&gt;Retrieval-Augmented Generation (RAG)&lt;/li&gt;
&lt;/ol&gt;
&lt;p&gt;Each approach has its strengths, trade-offs, and ideal use cases. By the end, you’ll know when to use each—and how they work under the hood.&lt;/p&gt;
&lt;h2&gt;1. Fine-Tuning — Teaching the Model New Tricks&lt;/h2&gt;
&lt;p&gt;&lt;strong&gt;Fine-tuning&lt;/strong&gt; is the process of training an existing LLM on custom datasets to improve its behavior on specific tasks.&lt;/p&gt;
&lt;h3&gt;How it works:&lt;/h3&gt;
&lt;ul&gt;
&lt;li&gt;You take a pre-trained model (like GPT or LLaMA).&lt;/li&gt;
&lt;li&gt;You feed it new examples in a structured format (instructions + completions).&lt;/li&gt;
&lt;li&gt;The model updates its internal weights based on this new data.&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;Think of it like giving the model a focused education after it’s graduated from a general AI university.&lt;/p&gt;
&lt;h3&gt;When to use it:&lt;/h3&gt;
&lt;ul&gt;
&lt;li&gt;You want a custom assistant that uses your company’s voice&lt;/li&gt;
&lt;li&gt;You need the model to perform a specialized task (e.g., legal analysis, medical diagnostics)&lt;/li&gt;
&lt;li&gt;You have recurring, structured inputs that aren’t handled well with prompting alone&lt;/li&gt;
&lt;/ul&gt;
&lt;h3&gt;Trade-offs:&lt;/h3&gt;
&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Pros&lt;/th&gt;
&lt;th&gt;Cons&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Highly accurate for specific tasks&lt;/td&gt;
&lt;td&gt;Expensive (compute + time)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Reduces prompt complexity&lt;/td&gt;
&lt;td&gt;Risk of overfitting or forgetting&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Works well offline or locally&lt;/td&gt;
&lt;td&gt;Not ideal for frequently changing data&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;
&lt;blockquote&gt;
&lt;p&gt;Fine-tuning is powerful, but it’s not always the first choice—especially when you need flexibility or real-time knowledge.&lt;/p&gt;
&lt;/blockquote&gt;
&lt;h2&gt;2. Prompt Engineering — Speaking the Model’s Language&lt;/h2&gt;
&lt;p&gt;Sometimes, you don’t need to retrain the model—you just need to &lt;em&gt;talk to it better&lt;/em&gt;.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Prompt engineering&lt;/strong&gt; is the art of crafting inputs that guide the model to behave the way you want. It’s fast, flexible, and doesn’t require model access.&lt;/p&gt;
&lt;h3&gt;Prompting patterns:&lt;/h3&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Zero-shot prompting&lt;/strong&gt;: Just ask a question
&lt;blockquote&gt;
&lt;p&gt;“Summarize this article.”&lt;/p&gt;
&lt;/blockquote&gt;
&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Few-shot prompting&lt;/strong&gt;: Show examples
&lt;blockquote&gt;
&lt;p&gt;“Here’s how I want you to respond…”&lt;/p&gt;
&lt;/blockquote&gt;
&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Chain-of-Thought (CoT)&lt;/strong&gt;: Encourage reasoning
&lt;blockquote&gt;
&lt;p&gt;“Let’s think step by step…”&lt;/p&gt;
&lt;/blockquote&gt;
&lt;/li&gt;
&lt;/ul&gt;
&lt;h3&gt;Tools and techniques:&lt;/h3&gt;
&lt;ul&gt;
&lt;li&gt;Templates: Reusable format strings with variables&lt;/li&gt;
&lt;li&gt;Constraints: “Answer in JSON” or “Limit to 100 words”&lt;/li&gt;
&lt;li&gt;Personas: “You are a helpful legal assistant...”&lt;/li&gt;
&lt;li&gt;System prompts (where supported): Define role and tone&lt;/li&gt;
&lt;/ul&gt;
&lt;h3&gt;When to use it:&lt;/h3&gt;
&lt;ul&gt;
&lt;li&gt;You’re working with a hosted LLM (OpenAI, Anthropic, etc.)&lt;/li&gt;
&lt;li&gt;You want to avoid infrastructure and cost overhead&lt;/li&gt;
&lt;li&gt;You need to quickly iterate and improve outcomes&lt;/li&gt;
&lt;/ul&gt;
&lt;h3&gt;Trade-offs:&lt;/h3&gt;
&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Pros&lt;/th&gt;
&lt;th&gt;Cons&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Fast to test and implement&lt;/td&gt;
&lt;td&gt;Sensitive to wording&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Doesn’t require model access&lt;/td&gt;
&lt;td&gt;Can be brittle or unpredictable&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Great for prototyping&lt;/td&gt;
&lt;td&gt;Doesn’t scale well for complex logic&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;
&lt;blockquote&gt;
&lt;p&gt;Prompt engineering is like UX for AI—small changes in input can completely change the output.&lt;/p&gt;
&lt;/blockquote&gt;
&lt;h2&gt;3. Retrieval-Augmented Generation (RAG) — Give the Model Real-Time Knowledge&lt;/h2&gt;
&lt;p&gt;RAG is a game-changer for context-aware applications.&lt;/p&gt;
&lt;p&gt;Instead of cramming all your knowledge into a model, &lt;strong&gt;RAG retrieves relevant information at runtime&lt;/strong&gt; and includes it in the prompt.&lt;/p&gt;
&lt;h3&gt;How it works:&lt;/h3&gt;
&lt;ol&gt;
&lt;li&gt;User sends a query&lt;/li&gt;
&lt;li&gt;System runs a &lt;strong&gt;semantic search&lt;/strong&gt; over a vector database&lt;/li&gt;
&lt;li&gt;Top-matching documents are inserted into the prompt&lt;/li&gt;
&lt;li&gt;The LLM generates a response using both query + retrieved context&lt;/li&gt;
&lt;/ol&gt;
&lt;p&gt;This gives you &lt;strong&gt;dynamic, real-time access&lt;/strong&gt; to external knowledge—without retraining.&lt;/p&gt;
&lt;h3&gt;Typical RAG architecture:&lt;/h3&gt;
&lt;pre&gt;&lt;code&gt;User → Query → Vector Search (Embeddings) → Top K Documents → LLM Prompt → Response
&lt;/code&gt;&lt;/pre&gt;
&lt;h3&gt;Use case examples:&lt;/h3&gt;
&lt;ul&gt;
&lt;li&gt;Chatbots that answer questions from company docs&lt;/li&gt;
&lt;li&gt;Developer copilots that can search codebases&lt;/li&gt;
&lt;li&gt;LLMs that read log files, support tickets, or PDFs&lt;/li&gt;
&lt;/ul&gt;
&lt;h3&gt;Trade-offs:&lt;/h3&gt;
&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Pros&lt;/th&gt;
&lt;th&gt;Cons&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Real-time access to changing data&lt;/td&gt;
&lt;td&gt;Adds latency due to search layer&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;No need to retrain the model&lt;/td&gt;
&lt;td&gt;Requires infrastructure (DB + search)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Keeps context windows lean&lt;/td&gt;
&lt;td&gt;Needs good chunking &amp;amp; ranking logic&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;
&lt;blockquote&gt;
&lt;p&gt;With RAG, your LLM becomes a smart interface to &lt;em&gt;your&lt;/em&gt; data—not just the internet.&lt;/p&gt;
&lt;/blockquote&gt;
&lt;h2&gt;Choosing the Right Enhancement Technique&lt;/h2&gt;
&lt;p&gt;Here’s a quick cheat sheet to help you choose:&lt;/p&gt;
&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Goal&lt;/th&gt;
&lt;th&gt;Best Technique&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Specialize a model on internal tasks&lt;/td&gt;
&lt;td&gt;Fine-tuning&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Guide output or behavior flexibly&lt;/td&gt;
&lt;td&gt;Prompt engineering&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Inject dynamic, real-time knowledge&lt;/td&gt;
&lt;td&gt;Retrieval-Augmented Gen&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;
&lt;p&gt;Often, the best systems &lt;strong&gt;combine&lt;/strong&gt; these techniques:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Fine-tuned base model&lt;/li&gt;
&lt;li&gt;With prompt templates&lt;/li&gt;
&lt;li&gt;And external knowledge via RAG&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;This is exactly what advanced AI agent systems are starting to do—and it’s where we’re heading next.&lt;/p&gt;
&lt;h2&gt;Recap: Boosting LLMs Is All About Context and Control&lt;/h2&gt;
&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Technique&lt;/th&gt;
&lt;th&gt;What It Does&lt;/th&gt;
&lt;th&gt;Ideal For&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Fine-Tuning&lt;/td&gt;
&lt;td&gt;Teaches model new behavior&lt;/td&gt;
&lt;td&gt;Repetitive, specialized tasks&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Prompt Engineering&lt;/td&gt;
&lt;td&gt;Crafts effective inputs&lt;/td&gt;
&lt;td&gt;Fast prototyping, hosted models&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;RAG&lt;/td&gt;
&lt;td&gt;Adds knowledge dynamically at runtime&lt;/td&gt;
&lt;td&gt;Large, evolving, external datasets&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;
&lt;hr&gt;
&lt;h2&gt;Up Next: What Are AI Agents — And Why They’re the Future&lt;/h2&gt;
&lt;p&gt;Now that we’ve learned how to enhance individual LLMs, the next evolution is combining them with tools, memory, and logic to create &lt;strong&gt;AI Agents&lt;/strong&gt;.&lt;/p&gt;
&lt;p&gt;In the next post, we’ll explore:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;What makes something an AI agent&lt;/li&gt;
&lt;li&gt;How agents orchestrate LLMs + tools&lt;/li&gt;
&lt;li&gt;Why they’re essential for real-world use&lt;/li&gt;
&lt;/ul&gt;
</content:encoded><author>Alex Merced</author></item><item><title>A Journey from AI to LLMs and MCP - 2 - How LLMs Work — Embeddings, Vectors, and Context Windows</title><link>https://iceberglakehouse.com/posts/2025-04-how-llms-work/</link><guid isPermaLink="true">https://iceberglakehouse.com/posts/2025-04-how-llms-work/</guid><description>
## Free Resources  
- **[Free Apache Iceberg Course](https://hello.dremio.com/webcast-an-apache-iceberg-lakehouse-crash-course-reg.html?utm_source=ev...</description><pubDate>Sun, 06 Apr 2025 09:00:00 GMT</pubDate><content:encoded>&lt;h2&gt;Free Resources&lt;/h2&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;&lt;a href=&quot;https://hello.dremio.com/webcast-an-apache-iceberg-lakehouse-crash-course-reg.html?utm_source=ev_external_blog&amp;amp;utm_medium=influencer&amp;amp;utm_campaign=AItoLLMS&amp;amp;utm_content=alexmerced&amp;amp;utm_term=external_blog&quot;&gt;Free Apache Iceberg Course&lt;/a&gt;&lt;/strong&gt;&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;&lt;a href=&quot;https://hello.dremio.com/wp-apache-iceberg-the-definitive-guide-reg.html?utm_source=ev_external_blog&amp;amp;utm_medium=influencer&amp;amp;utm_campaign=AItoLLMS&amp;amp;utm_content=alexmerced&amp;amp;utm_term=external_blog&quot;&gt;Free Copy of “Apache Iceberg: The Definitive Guide”&lt;/a&gt;&lt;/strong&gt;&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;&lt;a href=&quot;https://medium.com/data-engineering-with-dremio/2025-guide-to-architecting-an-iceberg-lakehouse-9b19ed42c9de&quot;&gt;2025 Apache Iceberg Architecture Guide&lt;/a&gt;&lt;/strong&gt;&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;&lt;a href=&quot;https://medium.alexmerced.blog/guide-to-finding-apache-iceberg-events-near-you-and-being-part-of-the-greater-iceberg-community-0c38ae785ddb&quot;&gt;How to Join the Iceberg Community&lt;/a&gt;&lt;/strong&gt;&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;&lt;a href=&quot;https://youtube.com/playlist?list=PLsLAVBjQJO0p0Yq1fLkoHvt2lEJj5pcYe&amp;amp;si=WTSnqjXZv6Glkc3y&quot;&gt;Iceberg Lakehouse Engineering Video Playlist&lt;/a&gt;&lt;/strong&gt;&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;&lt;a href=&quot;https://medium.com/data-engineering-with-dremio/ultimate-directory-of-apache-iceberg-resources-e3e02efac62e&quot;&gt;Ultimate Apache Iceberg Resource Guide&lt;/a&gt;&lt;/strong&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;In our last post, we explored the evolution of AI—from rule-based systems to deep learning—and how &lt;strong&gt;Large Language Models (LLMs)&lt;/strong&gt; like GPT-4 and Claude represent a transformative leap in capability.&lt;/p&gt;
&lt;p&gt;But how do these models &lt;em&gt;actually&lt;/em&gt; work?&lt;/p&gt;
&lt;p&gt;In this post, we’ll peel back the curtain on the inner workings of LLMs. We’ll explore the fundamental concepts that make these models tick: &lt;strong&gt;embeddings&lt;/strong&gt;, &lt;strong&gt;vector spaces&lt;/strong&gt;, and &lt;strong&gt;context windows&lt;/strong&gt;. You’ll walk away with a clearer understanding of how LLMs “understand” language—and what their limits are.&lt;/p&gt;
&lt;h2&gt;How LLMs Think: It’s All Math Underneath&lt;/h2&gt;
&lt;p&gt;Despite their fluent text output, LLMs don’t truly &amp;quot;understand&amp;quot; language in the human sense. Instead, they operate on numerical representations of text, using vast networks of mathematical weights to predict the next word in a sequence.&lt;/p&gt;
&lt;p&gt;The key mechanism behind this: &lt;strong&gt;transformers&lt;/strong&gt;.&lt;/p&gt;
&lt;p&gt;Transformers revolutionized NLP by allowing models to weigh the relevance of each word in a sentence—&lt;strong&gt;attention mechanisms&lt;/strong&gt;—instead of processing words one-by-one like RNNs.&lt;/p&gt;
&lt;p&gt;Here’s the simplified flow:&lt;/p&gt;
&lt;ol&gt;
&lt;li&gt;Text is &lt;strong&gt;tokenized&lt;/strong&gt; (split into chunks)&lt;/li&gt;
&lt;li&gt;Tokens are converted into &lt;strong&gt;embeddings&lt;/strong&gt; (vectors)&lt;/li&gt;
&lt;li&gt;Those vectors pass through &lt;strong&gt;layers of attention&lt;/strong&gt; to capture meaning&lt;/li&gt;
&lt;li&gt;The model generates the next token based on probability&lt;/li&gt;
&lt;/ol&gt;
&lt;p&gt;But what are these &lt;strong&gt;embeddings&lt;/strong&gt; and why do they matter?&lt;/p&gt;
&lt;h2&gt;Embeddings: From Words to Numbers&lt;/h2&gt;
&lt;p&gt;Before an LLM can do anything with language, it must convert words into numbers it can operate on.&lt;/p&gt;
&lt;p&gt;That’s where &lt;strong&gt;embeddings&lt;/strong&gt; come in.&lt;/p&gt;
&lt;h3&gt;What is an embedding?&lt;/h3&gt;
&lt;p&gt;An embedding is a &lt;strong&gt;high-dimensional vector&lt;/strong&gt; (think: a long list of numbers) that represents the meaning of a word or phrase.&lt;/p&gt;
&lt;p&gt;Words with similar meanings have &lt;strong&gt;similar embeddings&lt;/strong&gt;.&lt;/p&gt;
&lt;p&gt;For example:&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;Embedding(&amp;quot;dog&amp;quot;) ≈ Embedding(&amp;quot;puppy&amp;quot;) Embedding(&amp;quot;Paris&amp;quot;) ≈ Embedding(&amp;quot;London&amp;quot;)
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;These vectors live in an abstract &lt;strong&gt;vector space&lt;/strong&gt;, where distance encodes similarity.&lt;/p&gt;
&lt;p&gt;LLMs use embeddings not just for input, but throughout every layer of their neural network to understand relationships, context, and meaning.&lt;/p&gt;
&lt;h2&gt;Vector Search and Semantic Understanding&lt;/h2&gt;
&lt;p&gt;Because embeddings encode meaning, they’re also incredibly useful for &lt;strong&gt;semantic search&lt;/strong&gt;.&lt;/p&gt;
&lt;p&gt;Instead of matching exact words (like keyword search), vector search compares embeddings to find text that’s &lt;em&gt;conceptually&lt;/em&gt; similar.&lt;/p&gt;
&lt;p&gt;For example:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Query: &amp;quot;How do I fix a leaking pipe?&amp;quot;&lt;/li&gt;
&lt;li&gt;Match: &amp;quot;Plumbing repair for minor water leaks&amp;quot;&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;Even though the words don’t overlap, the &lt;strong&gt;meaning&lt;/strong&gt; does—and that’s what embeddings capture.&lt;/p&gt;
&lt;p&gt;This is the foundation for many powerful AI techniques like:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Document similarity&lt;/strong&gt;&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Retrieval-Augmented Generation (RAG)&lt;/strong&gt; (more on this in Blog 3)&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Context injection from external data sources&lt;/strong&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;h2&gt;Context Windows: The Model’s Working Memory&lt;/h2&gt;
&lt;p&gt;Another crucial concept in LLMs is the &lt;strong&gt;context window&lt;/strong&gt;—the maximum number of tokens the model can “see” at once.&lt;/p&gt;
&lt;p&gt;Every input to an LLM gets broken into &lt;strong&gt;tokens&lt;/strong&gt;, and the model has a limited capacity for how many tokens it can process per request.&lt;/p&gt;
&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Model&lt;/th&gt;
&lt;th&gt;Max Context Window&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;GPT-3.5&lt;/td&gt;
&lt;td&gt;4,096 tokens (~3,000 words)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;GPT-4 Turbo&lt;/td&gt;
&lt;td&gt;Up to 128,000 tokens&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Claude 3 Opus&lt;/td&gt;
&lt;td&gt;Up to 200,000 tokens&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;
&lt;p&gt;If you go over the limit, you’ll need to:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Truncate input (losing information)&lt;/li&gt;
&lt;li&gt;Summarize&lt;/li&gt;
&lt;li&gt;Use techniques like RAG or memory management&lt;/li&gt;
&lt;/ul&gt;
&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;TL;DR&lt;/strong&gt;: The larger the context window, the more the model can “remember” during a conversation or task.&lt;/p&gt;
&lt;/blockquote&gt;
&lt;h2&gt;Limitations of Embeddings and Context Windows&lt;/h2&gt;
&lt;p&gt;Even though LLMs are powerful, they come with trade-offs:&lt;/p&gt;
&lt;h3&gt;Embedding limitations:&lt;/h3&gt;
&lt;ul&gt;
&lt;li&gt;Don’t always reflect &lt;strong&gt;nuanced context&lt;/strong&gt; (e.g., sarcasm, tone)&lt;/li&gt;
&lt;li&gt;Fixed dimensionality: can’t represent &lt;em&gt;everything&lt;/em&gt;&lt;/li&gt;
&lt;li&gt;Require separate handling for different modalities (text vs images)&lt;/li&gt;
&lt;/ul&gt;
&lt;h3&gt;Context window limitations:&lt;/h3&gt;
&lt;ul&gt;
&lt;li&gt;Long documents may get truncated or ignored&lt;/li&gt;
&lt;li&gt;Memory is &lt;em&gt;not&lt;/em&gt; persistent—everything resets after a session unless you manually re-include previous context&lt;/li&gt;
&lt;li&gt;More tokens = higher latency and cost&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;These limits are precisely why so much effort goes into &lt;strong&gt;enhancing&lt;/strong&gt; LLMs through fine-tuning, retrieval systems, and smarter prompt engineering.&lt;/p&gt;
&lt;p&gt;We’ll dive into that next.&lt;/p&gt;
&lt;h2&gt;Recap: Key Concepts from This Post&lt;/h2&gt;
&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Concept&lt;/th&gt;
&lt;th&gt;What It Is&lt;/th&gt;
&lt;th&gt;Why It Matters&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Embeddings&lt;/td&gt;
&lt;td&gt;Vector representations of tokens/text&lt;/td&gt;
&lt;td&gt;Enable semantic understanding &amp;amp; search&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Vector Space&lt;/td&gt;
&lt;td&gt;Mathematical space where embeddings live&lt;/td&gt;
&lt;td&gt;Allows similarity comparison &amp;amp; clustering&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Context Window&lt;/td&gt;
&lt;td&gt;Max token size per LLM input&lt;/td&gt;
&lt;td&gt;Defines how much the model can “see”&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Attention&lt;/td&gt;
&lt;td&gt;Weighs token relationships dynamically&lt;/td&gt;
&lt;td&gt;Enables context awareness in LLMs&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;
&lt;h2&gt;🔮 Up Next: Making LLMs Smarter with Fine-Tuning, Prompt Engineering, and RAG&lt;/h2&gt;
&lt;p&gt;In our next post, we’ll show how to &lt;strong&gt;enhance LLM performance&lt;/strong&gt; using proven techniques:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Fine-tuning&lt;/li&gt;
&lt;li&gt;Prompt engineering&lt;/li&gt;
&lt;li&gt;Retrieval-Augmented Generation (RAG)&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;These strategies help you move beyond limitations—and get the most out of your models.&lt;/p&gt;
</content:encoded><author>Alex Merced</author></item><item><title>A Journey from AI to LLMs and MCP - 1 - What Is AI and How It Evolved Into LLMs</title><link>https://iceberglakehouse.com/posts/2025-04-What-is-AI-and-How-It-Evolved-Into-LLMs/</link><guid isPermaLink="true">https://iceberglakehouse.com/posts/2025-04-What-is-AI-and-How-It-Evolved-Into-LLMs/</guid><description>
## Free Resources  
- **[Free Apache Iceberg Course](https://hello.dremio.com/webcast-an-apache-iceberg-lakehouse-crash-course-reg.html?utm_source=ev...</description><pubDate>Sat, 05 Apr 2025 09:00:00 GMT</pubDate><content:encoded>&lt;h2&gt;Free Resources&lt;/h2&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;&lt;a href=&quot;https://hello.dremio.com/webcast-an-apache-iceberg-lakehouse-crash-course-reg.html?utm_source=ev_external_blog&amp;amp;utm_medium=influencer&amp;amp;utm_campaign=AItoLLMS&amp;amp;utm_content=alexmerced&amp;amp;utm_term=external_blog&quot;&gt;Free Apache Iceberg Course&lt;/a&gt;&lt;/strong&gt;&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;&lt;a href=&quot;https://hello.dremio.com/wp-apache-iceberg-the-definitive-guide-reg.html?utm_source=ev_external_blog&amp;amp;utm_medium=influencer&amp;amp;utm_campaign=AItoLLMS&amp;amp;utm_content=alexmerced&amp;amp;utm_term=external_blog&quot;&gt;Free Copy of “Apache Iceberg: The Definitive Guide”&lt;/a&gt;&lt;/strong&gt;&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;&lt;a href=&quot;https://medium.com/data-engineering-with-dremio/2025-guide-to-architecting-an-iceberg-lakehouse-9b19ed42c9de&quot;&gt;2025 Apache Iceberg Architecture Guide&lt;/a&gt;&lt;/strong&gt;&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;&lt;a href=&quot;https://medium.alexmerced.blog/guide-to-finding-apache-iceberg-events-near-you-and-being-part-of-the-greater-iceberg-community-0c38ae785ddb&quot;&gt;How to Join the Iceberg Community&lt;/a&gt;&lt;/strong&gt;&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;&lt;a href=&quot;https://youtube.com/playlist?list=PLsLAVBjQJO0p0Yq1fLkoHvt2lEJj5pcYe&amp;amp;si=WTSnqjXZv6Glkc3y&quot;&gt;Iceberg Lakehouse Engineering Video Playlist&lt;/a&gt;&lt;/strong&gt;&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;&lt;a href=&quot;https://medium.com/data-engineering-with-dremio/ultimate-directory-of-apache-iceberg-resources-e3e02efac62e&quot;&gt;Ultimate Apache Iceberg Resource Guide&lt;/a&gt;&lt;/strong&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;Artificial Intelligence (AI) has become the defining technology of the decade. From chatbots to code generators, from self-driving cars to predictive text—AI systems are everywhere. But before we dive into the cutting-edge world of large language models (LLMs), let’s rewind and understand where this all began.&lt;/p&gt;
&lt;p&gt;This post kicks off our 10-part series exploring how AI evolved into LLMs, how to enhance their capabilities, and how the &lt;strong&gt;Model Context Protocol (MCP)&lt;/strong&gt; is shaping the future of intelligent, modular agents.&lt;/p&gt;
&lt;h2&gt;🧠 A Brief History of AI&lt;/h2&gt;
&lt;p&gt;The term &amp;quot;Artificial Intelligence&amp;quot; was coined in 1956, but the idea has been around even longer—think mechanical automatons and Alan Turing’s famous question: &lt;em&gt;&amp;quot;Can machines think?&amp;quot;&lt;/em&gt;&lt;/p&gt;
&lt;p&gt;AI development has gone through several distinct waves:&lt;/p&gt;
&lt;h3&gt;1. &lt;strong&gt;Symbolic AI (1950s–1980s)&lt;/strong&gt;&lt;/h3&gt;
&lt;p&gt;Also known as &amp;quot;Good Old-Fashioned AI,&amp;quot; symbolic systems were rule-based. Think expert systems, logic programming, and hand-coded decision trees. These systems could play chess or diagnose medical conditions—if you wrote enough rules.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Limitations&lt;/strong&gt;: Rigid, brittle, and poor at handling ambiguity.&lt;/p&gt;
&lt;h3&gt;2. &lt;strong&gt;Machine Learning (1990s–2010s)&lt;/strong&gt;&lt;/h3&gt;
&lt;p&gt;Instead of coding rules manually, we trained models to recognize patterns from data. Algorithms like decision trees, support vector machines, and early neural networks emerged.&lt;/p&gt;
&lt;p&gt;This era gave us:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Spam filters&lt;/li&gt;
&lt;li&gt;Fraud detection&lt;/li&gt;
&lt;li&gt;Recommendation engines&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;But while powerful, these models still had a hard time with natural language and context.&lt;/p&gt;
&lt;h3&gt;3. &lt;strong&gt;Deep Learning (2010s–Now)&lt;/strong&gt;&lt;/h3&gt;
&lt;p&gt;With more data, better algorithms, and stronger GPUs, neural networks started outperforming traditional methods. Deep learning led to breakthroughs in:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Image recognition (CNNs)&lt;/li&gt;
&lt;li&gt;Speech recognition (RNNs, LSTMs)&lt;/li&gt;
&lt;li&gt;Language understanding (Transformers)&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;And that brings us to the latest evolution...&lt;/p&gt;
&lt;h2&gt;🧬 Enter LLMs: The Rise of Language-First AI&lt;/h2&gt;
&lt;p&gt;Large Language Models (LLMs) like GPT-4, Claude, and Gemini aren’t just another step in AI—they represent a leap. Trained on massive text corpora using &lt;strong&gt;transformer architectures&lt;/strong&gt;, these models can:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Write essays and poems&lt;/li&gt;
&lt;li&gt;Generate and debug code&lt;/li&gt;
&lt;li&gt;Translate between languages&lt;/li&gt;
&lt;li&gt;Answer complex questions&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;All by predicting the next word in a sentence.&lt;/p&gt;
&lt;p&gt;But what makes LLMs so powerful?&lt;/p&gt;
&lt;h2&gt;🏗️ LLMs Are More Than Just Big Neural Nets&lt;/h2&gt;
&lt;p&gt;At their core, LLMs are massive deep learning models that turn &lt;strong&gt;tokens (words/pieces of words)&lt;/strong&gt; into &lt;strong&gt;vectors (mathematical representations)&lt;/strong&gt;. Through billions of parameters, they learn the structure of language and the latent meaning within it.&lt;/p&gt;
&lt;p&gt;Key components:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Tokenization&lt;/strong&gt;: Breaking input into chunks the model can process&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Embeddings&lt;/strong&gt;: Mapping tokens to vector space&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Attention Mechanisms&lt;/strong&gt;: Letting the model focus on relevant parts of the input&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Context Window&lt;/strong&gt;: A memory buffer for how much input the model can “see”&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;Popular LLMs:&lt;/p&gt;
&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Model&lt;/th&gt;
&lt;th&gt;Provider&lt;/th&gt;
&lt;th&gt;Context Window&lt;/th&gt;
&lt;th&gt;Notable Feature&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;GPT-4&lt;/td&gt;
&lt;td&gt;OpenAI&lt;/td&gt;
&lt;td&gt;Up to 128k&lt;/td&gt;
&lt;td&gt;Code + natural language synergy&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Claude 3&lt;/td&gt;
&lt;td&gt;Anthropic&lt;/td&gt;
&lt;td&gt;Up to 200k&lt;/td&gt;
&lt;td&gt;Strong at instruction following&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Gemini&lt;/td&gt;
&lt;td&gt;Google DeepMind&lt;/td&gt;
&lt;td&gt;~32k+&lt;/td&gt;
&lt;td&gt;Multimodal capabilities&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;
&lt;h2&gt;🧩 What LLMs Can (and Can’t) Do&lt;/h2&gt;
&lt;p&gt;LLMs are versatile and impressive—but they&apos;re not magic. Their strengths come with real limitations:&lt;/p&gt;
&lt;h3&gt;✅ What they’re great at:&lt;/h3&gt;
&lt;ul&gt;
&lt;li&gt;Text generation and summarization&lt;/li&gt;
&lt;li&gt;Conversational interfaces&lt;/li&gt;
&lt;li&gt;Programming assistance&lt;/li&gt;
&lt;li&gt;Knowledge retrieval from training data&lt;/li&gt;
&lt;/ul&gt;
&lt;h3&gt;❌ What they struggle with:&lt;/h3&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Memory&lt;/strong&gt;: No persistent memory across sessions&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Context limits&lt;/strong&gt;: Can only “see” a fixed number of tokens&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Reasoning&lt;/strong&gt;: Struggles with complex multi-step logic&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Real-time data&lt;/strong&gt;: Can’t access up-to-date or private information&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Action-taking&lt;/strong&gt;: Can&apos;t interact with tools or APIs by default&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;This is where the next evolution comes in: &lt;strong&gt;augmenting LLMs&lt;/strong&gt; with context, tools, and workflows.&lt;/p&gt;
&lt;h2&gt;🔮 The Road Ahead: From Models to Modular AI Agents&lt;/h2&gt;
&lt;p&gt;We’ve gone from rules to learning, from deep learning to LLMs—but we’re not done yet. The future of AI lies in making LLMs &lt;em&gt;do more than just talk&lt;/em&gt;. We need to:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Give them memory&lt;/li&gt;
&lt;li&gt;Let them interact with data&lt;/li&gt;
&lt;li&gt;Enable them to call tools, services, and APIs&lt;/li&gt;
&lt;li&gt;Help them make decisions and reason through complex tasks&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;This brings us to the idea of &lt;strong&gt;AI Agents&lt;/strong&gt;—autonomous systems built on LLMs that can perceive, decide, and act.&lt;/p&gt;
&lt;h3&gt;🧭 Coming Up Next&lt;/h3&gt;
&lt;p&gt;In our next post, we’ll explore &lt;strong&gt;how LLMs actually work&lt;/strong&gt; under the hood—digging into embeddings, vector spaces, and how models “understand” language.&lt;/p&gt;
&lt;p&gt;Stay tuned.&lt;/p&gt;
</content:encoded><author>Alex Merced</author></item><item><title>Building a Basic MCP Server with Python</title><link>https://iceberglakehouse.com/posts/2025-04-basics-of-making-mcp-server/</link><guid isPermaLink="true">https://iceberglakehouse.com/posts/2025-04-basics-of-making-mcp-server/</guid><description>
## Free Resources  
- **[Free Apache Iceberg Course](https://hello.dremio.com/webcast-an-apache-iceberg-lakehouse-crash-course-reg.html?utm_source=ev...</description><pubDate>Fri, 04 Apr 2025 09:00:00 GMT</pubDate><content:encoded>&lt;h2&gt;Free Resources&lt;/h2&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;&lt;a href=&quot;https://hello.dremio.com/webcast-an-apache-iceberg-lakehouse-crash-course-reg.html?utm_source=ev_external_blog&amp;amp;utm_medium=influencer&amp;amp;utm_campaign=mcp_basic&amp;amp;utm_content=alexmerced&amp;amp;utm_term=external_blog&quot;&gt;Free Apache Iceberg Course&lt;/a&gt;&lt;/strong&gt;&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;&lt;a href=&quot;https://hello.dremio.com/wp-apache-iceberg-the-definitive-guide-reg.html?utm_source=ev_external_blog&amp;amp;utm_medium=influencer&amp;amp;utm_campaign=mcp_basic&amp;amp;utm_content=alexmerced&amp;amp;utm_term=external_blog&quot;&gt;Free Copy of “Apache Iceberg: The Definitive Guide”&lt;/a&gt;&lt;/strong&gt;&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;&lt;a href=&quot;https://medium.com/data-engineering-with-dremio/2025-guide-to-architecting-an-iceberg-lakehouse-9b19ed42c9de&quot;&gt;2025 Apache Iceberg Architecture Guide&lt;/a&gt;&lt;/strong&gt;&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;&lt;a href=&quot;https://medium.alexmerced.blog/guide-to-finding-apache-iceberg-events-near-you-and-being-part-of-the-greater-iceberg-community-0c38ae785ddb&quot;&gt;How to Join the Iceberg Community&lt;/a&gt;&lt;/strong&gt;&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;&lt;a href=&quot;https://youtube.com/playlist?list=PLsLAVBjQJO0p0Yq1fLkoHvt2lEJj5pcYe&amp;amp;si=WTSnqjXZv6Glkc3y&quot;&gt;Iceberg Lakehouse Engineering Video Playlist&lt;/a&gt;&lt;/strong&gt;&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;&lt;a href=&quot;https://medium.com/data-engineering-with-dremio/ultimate-directory-of-apache-iceberg-resources-e3e02efac62e&quot;&gt;Ultimate Apache Iceberg Resource Guide&lt;/a&gt;&lt;/strong&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;If you’ve ever wished you could ask an AI model like Claude to interact with your local files or run custom code—good news: &lt;strong&gt;you can.&lt;/strong&gt; That’s exactly what the &lt;strong&gt;Model Context Protocol (MCP)&lt;/strong&gt; makes possible.&lt;/p&gt;
&lt;p&gt;In this tutorial, we’ll walk you through building a beginner-friendly &lt;strong&gt;MCP server&lt;/strong&gt; that acts as a simple template for future projects. You don’t need to be an expert in AI or server development—we’ll explain each part as we go.&lt;/p&gt;
&lt;p&gt;Here’s what we’ll build:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;A small server using Python and the &lt;strong&gt;MCP SDK&lt;/strong&gt;&lt;/li&gt;
&lt;li&gt;Two useful &lt;strong&gt;tools&lt;/strong&gt; that read data from:
&lt;ul&gt;
&lt;li&gt;A &lt;strong&gt;CSV file&lt;/strong&gt; (great for spreadsheets and tabular data)&lt;/li&gt;
&lt;li&gt;A &lt;strong&gt;Parquet file&lt;/strong&gt; (a format often used in data engineering and analytics)&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;li&gt;A clean folder structure that makes it easy to add new tools or features later&lt;/li&gt;
&lt;li&gt;A working connection to &lt;strong&gt;Claude for Desktop&lt;/strong&gt;, so you can ask things like:
&lt;blockquote&gt;
&lt;p&gt;“Summarize the contents of my data file”&lt;br&gt;
“How many rows and columns are in this CSV?”&lt;/p&gt;
&lt;/blockquote&gt;
&lt;/li&gt;
&lt;/ul&gt;
&lt;h3&gt;Why Start Here?&lt;/h3&gt;
&lt;p&gt;This blog is perfect for you if:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;You’ve heard about Claude and want to connect it to your own tools or data&lt;/li&gt;
&lt;li&gt;You’re curious about MCP and want to see how it works in practice&lt;/li&gt;
&lt;li&gt;You’d like a solid starting point for building more advanced tool servers later&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;We’ll use plain Python and some common libraries like &lt;code&gt;pandas&lt;/code&gt;, with no web frameworks or deployment complexity. Everything will run locally on your machine.&lt;/p&gt;
&lt;p&gt;By the end, you’ll have a fully working &lt;strong&gt;local MCP server&lt;/strong&gt; and a better understanding of how to make AI tools that go beyond text prediction—and actually do useful work.&lt;/p&gt;
&lt;p&gt;Let’s get started!&lt;/p&gt;
&lt;h2&gt;What Is MCP (and Why Should You Care)?&lt;/h2&gt;
&lt;p&gt;Let’s break this down before we start writing code.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;MCP&lt;/strong&gt; stands for &lt;strong&gt;Model Context Protocol&lt;/strong&gt;. It’s a way to let apps like Claude for Desktop securely interact with &lt;strong&gt;external data&lt;/strong&gt; and &lt;strong&gt;custom tools&lt;/strong&gt; that you define.&lt;/p&gt;
&lt;p&gt;Think of it like building your own mini API—but instead of exposing it to the whole internet, you’re exposing it to an AI assistant on your machine.&lt;/p&gt;
&lt;p&gt;With MCP, you can:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Let Claude read a file or query a database&lt;/li&gt;
&lt;li&gt;Create tools that do useful things (like summarize a dataset or fetch an API)&lt;/li&gt;
&lt;li&gt;Add reusable prompts to guide how Claude behaves in certain tasks&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;For this project, we’re focusing on &lt;strong&gt;tools&lt;/strong&gt;—the part of MCP that lets you write small Python functions the AI can call.&lt;/p&gt;
&lt;h3&gt;What We’re Building&lt;/h3&gt;
&lt;p&gt;Here’s a quick preview of what you’ll end up with:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;A local MCP server called &lt;code&gt;mix_server&lt;/code&gt;&lt;/li&gt;
&lt;li&gt;Two tools: one that reads a CSV file, and one that reads a Parquet file&lt;/li&gt;
&lt;li&gt;A clean, modular folder layout so you can keep adding more tools later&lt;/li&gt;
&lt;li&gt;A working connection to Claude for Desktop so you can talk to your tools through natural language&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;Let’s start by setting up your project.&lt;/p&gt;
&lt;h2&gt;Project Setup (Step-by-Step)&lt;/h2&gt;
&lt;p&gt;We’ll use &lt;a href=&quot;https://github.com/astral-sh/uv&quot;&gt;&lt;strong&gt;uv&lt;/strong&gt;&lt;/a&gt;—a fast, modern Python project manager—to create and manage our environment. It handles dependencies, virtual environments, and script execution, all in one place.&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;If you’ve used &lt;code&gt;pip&lt;/code&gt; or &lt;code&gt;virtualenv&lt;/code&gt; before, uv is like both of those combined—but much faster and more ergonomic.&lt;/p&gt;
&lt;/blockquote&gt;
&lt;h3&gt;Step 1: Install &lt;code&gt;uv&lt;/code&gt;&lt;/h3&gt;
&lt;p&gt;To install &lt;code&gt;uv&lt;/code&gt;, run this in your terminal:&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-bash&quot;&gt;curl -LsSf https://astral.sh/uv/install.sh | sh
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Then restart your terminal so the uv command is available.&lt;/p&gt;
&lt;p&gt;You can check that it&apos;s working with:&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-bash&quot;&gt;uv --version
&lt;/code&gt;&lt;/pre&gt;
&lt;h3&gt;Step 2: Create the Project&lt;/h3&gt;
&lt;p&gt;Let’s make a new folder for our MCP server:&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-bash&quot;&gt;uv init mix_server
cd mix_server
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;This creates a basic Python project with a pyproject.toml file to manage dependencies.&lt;/p&gt;
&lt;h3&gt;Step 3: Set Up a Virtual Environment&lt;/h3&gt;
&lt;p&gt;We’ll now create a virtual environment for our project and activate it:&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-bash&quot;&gt;uv venv
source .venv/bin/activate
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;This keeps your dependencies isolated from the rest of your system.&lt;/p&gt;
&lt;h3&gt;Step 4: Add Required Dependencies&lt;/h3&gt;
&lt;p&gt;We’re going to install three key packages:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;
&lt;p&gt;&lt;code&gt;mcp[cli]&lt;/code&gt;: The official MCP SDK and command-line tools&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;code&gt;pandas&lt;/code&gt;: For reading CSV and Parquet files&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;code&gt;pyarrow&lt;/code&gt;: Adds support for reading Parquet files via Pandas&lt;/p&gt;
&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;Install them using:&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-bash&quot;&gt;uv add &amp;quot;mcp[cli]&amp;quot; pandas pyarrow
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;This updates your pyproject.toml and installs the packages into your environment.&lt;/p&gt;
&lt;h3&gt;Step 5: Create a Clean Folder Structure&lt;/h3&gt;
&lt;p&gt;We’ll use the following layout to stay organized:&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-bash&quot;&gt;mix_server/
│
├── data/                 # Sample CSV and Parquet files
│
├── tools/                # MCP tool definitions
│
├── utils/                # Reusable file reading logic
│
├── server.py             # Creates the Server
├── main.py             # Entry point for the MCP server
└── README.md             # Optional documentation
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Create the folders:&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-bash&quot;&gt;mkdir data tools utils
touch server.py
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Your environment is now ready. In the next section, we’ll create a couple of small data files to work with—a CSV and a Parquet file—and use them to power our tools.&lt;/p&gt;
&lt;h2&gt;Creating Sample Data Files&lt;/h2&gt;
&lt;p&gt;To build our first tools, we need something for them to work with. In this section, we’ll create two simple files:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;A &lt;strong&gt;CSV file&lt;/strong&gt; (great for spreadsheets and tabular data)&lt;/li&gt;
&lt;li&gt;A &lt;strong&gt;Parquet file&lt;/strong&gt; (a more efficient format used in data engineering)&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;Both files will contain the same mock dataset—a short list of users. You’ll use these files later when building tools that summarize their contents.&lt;/p&gt;
&lt;h3&gt;Step 1: Create the &lt;code&gt;data/&lt;/code&gt; Folder&lt;/h3&gt;
&lt;p&gt;If you haven’t already created the folder for our data, do it now from your project root:&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-bash&quot;&gt;mkdir data
&lt;/code&gt;&lt;/pre&gt;
&lt;h3&gt;Step 2: Create a Sample CSV File&lt;/h3&gt;
&lt;p&gt;Now let’s add a sample CSV file with some fake user data.&lt;/p&gt;
&lt;p&gt;Create a new file called sample.csv inside the data/ folder:&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-bash&quot;&gt;data/sample.csv
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;And paste the following into it:&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-csv&quot;&gt;id,name,email,signup_date
1,Alice Johnson,alice@example.com,2023-01-15
2,Bob Smith,bob@example.com,2023-02-22
3,Carol Lee,carol@example.com,2023-03-10
4,David Wu,david@example.com,2023-04-18
5,Eva Brown,eva@example.com,2023-05-30
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;This file gives us structured, readable data—perfect for a tool to analyze.&lt;/p&gt;
&lt;h3&gt;Step 3: Convert the CSV to Parquet&lt;/h3&gt;
&lt;p&gt;We’ll now create a Parquet version of the same data using Python. This shows how easily you can support both file types in your tools.&lt;/p&gt;
&lt;p&gt;Create a short script in the root of your project called generate_parquet.py:&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-python&quot;&gt;# generate_parquet.py

import pandas as pd

# Read the CSV
df = pd.read_csv(&amp;quot;data/sample.csv&amp;quot;)

# Save as Parquet
df.to_parquet(&amp;quot;data/sample.parquet&amp;quot;, index=False)
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Run the script:&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-bash&quot;&gt;uv run generate_parquet.py
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;After this, your data/ folder should look like:&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-bash&quot;&gt;data/
├── sample.csv
└── sample.parquet
&lt;/code&gt;&lt;/pre&gt;
&lt;h4&gt;What’s the Difference Between CSV and Parquet?&lt;/h4&gt;
&lt;ul&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;CSV:&lt;/strong&gt; Simple, human-readable text file. Great for small datasets and quick inspection.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Parquet:&lt;/strong&gt; A binary, column-based format. Much faster for large datasets and common in analytics pipelines (e.g. with Apache Spark or Dremio).&lt;/p&gt;
&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;Supporting both formats makes your tools more flexible, and this example shows how little extra effort it takes.&lt;/p&gt;
&lt;p&gt;Next, we’ll write some reusable utility functions that can read these files and return a quick summary of their contents—ready to be wrapped as MCP tools.&lt;/p&gt;
&lt;h2&gt;Writing Utility Functions to Read CSV and Parquet Files&lt;/h2&gt;
&lt;p&gt;Now that we have some data to work with, let’s write the core logic to read those files and return a basic summary.&lt;/p&gt;
&lt;p&gt;We’re going to put this logic in a separate Python file under a folder called &lt;code&gt;utils/&lt;/code&gt;. This makes it easy to reuse across different tools without duplicating code.&lt;/p&gt;
&lt;h3&gt;Step 1: Create the Utility Module&lt;/h3&gt;
&lt;p&gt;If you haven’t already created the &lt;code&gt;utils/&lt;/code&gt; folder, do it now:&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-bash&quot;&gt;mkdir utils
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Now create a new Python file inside it:&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-bash&quot;&gt;touch utils/file_reader.py
&lt;/code&gt;&lt;/pre&gt;
&lt;h3&gt;Step 2: Add File Reading Functions&lt;/h3&gt;
&lt;p&gt;Open utils/file_reader.py and paste in the following code:&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-python&quot;&gt;# utils/file_reader.py

import pandas as pd
from pathlib import Path

# Base directory where our data lives
DATA_DIR = Path(__file__).resolve().parent.parent / &amp;quot;data&amp;quot;

def read_csv_summary(filename: str) -&amp;gt; str:
    &amp;quot;&amp;quot;&amp;quot;
    Read a CSV file and return a simple summary.

    Args:
        filename: Name of the CSV file (e.g. &apos;sample.csv&apos;)

    Returns:
        A string describing the file&apos;s contents.
    &amp;quot;&amp;quot;&amp;quot;
    file_path = DATA_DIR / filename
    df = pd.read_csv(file_path)
    return f&amp;quot;CSV file &apos;{filename}&apos; has {len(df)} rows and {len(df.columns)} columns.&amp;quot;

def read_parquet_summary(filename: str) -&amp;gt; str:
    &amp;quot;&amp;quot;&amp;quot;
    Read a Parquet file and return a simple summary.

    Args:
        filename: Name of the Parquet file (e.g. &apos;sample.parquet&apos;)

    Returns:
        A string describing the file&apos;s contents.
    &amp;quot;&amp;quot;&amp;quot;
    file_path = DATA_DIR / filename
    df = pd.read_parquet(file_path)
    return f&amp;quot;Parquet file &apos;{filename}&apos; has {len(df)} rows and {len(df.columns)} columns.&amp;quot;
&lt;/code&gt;&lt;/pre&gt;
&lt;h4&gt;How This Works&lt;/h4&gt;
&lt;ul&gt;
&lt;li&gt;
&lt;p&gt;We’re using &lt;code&gt;pandas&lt;/code&gt; to read both &lt;code&gt;CSV&lt;/code&gt; and &lt;code&gt;Parquet&lt;/code&gt; files. It’s a well-known data analysis library in Python.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;code&gt;pathlib.Path&lt;/code&gt; helps us safely construct file paths across operating systems.&lt;/p&gt;
&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;Both functions return a simple string like:&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-bash&quot;&gt;CSV file &apos;sample.csv&apos; has 5 rows and 4 columns.
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;This is all the logic our tools will need to start with. Later, if you want to add more advanced summaries—like listing column names or detecting null values—you can expand these functions.&lt;/p&gt;
&lt;p&gt;With our utilities ready, we can now expose them as MCP tools—so Claude can actually use them!&lt;/p&gt;
&lt;h2&gt;Wrapping File Readers as MCP Tools&lt;/h2&gt;
&lt;p&gt;Now that we’ve written the logic to read and summarize our data files, it’s time to make those functions available to Claude through &lt;strong&gt;MCP tools&lt;/strong&gt;.&lt;/p&gt;
&lt;h3&gt;What’s an MCP Tool?&lt;/h3&gt;
&lt;p&gt;An &lt;strong&gt;MCP tool&lt;/strong&gt; is a Python function you register with your MCP server that the AI can call when it needs to take action—like reading a file, querying an API, or performing a calculation.&lt;/p&gt;
&lt;p&gt;To register a tool, you decorate the function with &lt;code&gt;@mcp.tool()&lt;/code&gt;. Behind the scenes, MCP generates a definition that the AI can see and interact with.&lt;/p&gt;
&lt;p&gt;But before we do that, let’s follow a best practice: &lt;strong&gt;we’ll define our MCP server instance in one central place&lt;/strong&gt;, then import it into each file that defines tools. This ensures everything stays clean and consistent.&lt;/p&gt;
&lt;h3&gt;Step 1: Define the MCP Server Instance&lt;/h3&gt;
&lt;p&gt;Open your &lt;code&gt;server.py&lt;/code&gt; and &lt;code&gt;main.py&lt;/code&gt; files (or create it if you haven’t already), and add the following:&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-python&quot;&gt;# server.py

from mcp.server.fastmcp import FastMCP

# This is the shared MCP server instance
mcp = FastMCP(&amp;quot;mix_server&amp;quot;)
&lt;/code&gt;&lt;/pre&gt;
&lt;pre&gt;&lt;code class=&quot;language-python&quot;&gt;from server import mcp

# Entry point to run the server
if __name__ == &amp;quot;__main__&amp;quot;:
    mcp.run()
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;This creates a named server called &amp;quot;mix_server&amp;quot; and exposes a simple run command.&lt;/p&gt;
&lt;h3&gt;Step 2: Create the CSV Tool&lt;/h3&gt;
&lt;p&gt;&lt;strong&gt;Let’s now define our first tool:&lt;/strong&gt; one that summarizes a CSV file.&lt;/p&gt;
&lt;p&gt;Create a new file called &lt;code&gt;csv_tools.py&lt;/code&gt; inside the &lt;code&gt;tools/&lt;/code&gt; folder:&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-bash&quot;&gt;touch tools/csv_tools.py
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Then add the following:&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-python&quot;&gt;# tools/csv_tools.py

from server import mcp
from utils.file_reader import read_csv_summary

@mcp.tool()
def summarize_csv_file(filename: str) -&amp;gt; str:
    &amp;quot;&amp;quot;&amp;quot;
    Summarize a CSV file by reporting its number of rows and columns.

    Args:
        filename: Name of the CSV file in the /data directory (e.g., &apos;sample.csv&apos;)

    Returns:
        A string describing the file&apos;s dimensions.
    &amp;quot;&amp;quot;&amp;quot;
    return read_csv_summary(filename)
&lt;/code&gt;&lt;/pre&gt;
&lt;h3&gt;Step 3: Create the Parquet Tool&lt;/h3&gt;
&lt;p&gt;Now let’s do the same for a Parquet file.&lt;/p&gt;
&lt;p&gt;Create a file called &lt;code&gt;parquet_tools.py&lt;/code&gt; inside the &lt;code&gt;tools/&lt;/code&gt; folder:&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-bash&quot;&gt;touch tools/parquet_tools.py
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;And add:&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-python&quot;&gt;# tools/parquet_tools.py

from server import mcp
from utils.file_reader import read_parquet_summary

@mcp.tool()
def summarize_parquet_file(filename: str) -&amp;gt; str:
    &amp;quot;&amp;quot;&amp;quot;
    Summarize a Parquet file by reporting its number of rows and columns.

    Args:
        filename: Name of the Parquet file in the /data directory (e.g., &apos;sample.parquet&apos;)

    Returns:
        A string describing the file&apos;s dimensions.
    &amp;quot;&amp;quot;&amp;quot;
    return read_parquet_summary(filename)
&lt;/code&gt;&lt;/pre&gt;
&lt;h3&gt;Step 4: Register the Tools&lt;/h3&gt;
&lt;p&gt;Since the tools are registered via decorators at import time, we just need to make sure the server.py file imports the tool modules. Add these lines at the top of server.py:&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-python&quot;&gt;# main.py

from server import mcp

# Import tools so they get registered via decorators
import tools.csv_tools
import tools.parquet_tools

# Entry point to run the server
if __name__ == &amp;quot;__main__&amp;quot;:
    mcp.run()
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Now, whenever the server runs, it automatically registers all tools via the @mcp.tool() decorators.&lt;/p&gt;
&lt;p&gt;Your tools are now live! In the next section, we’ll walk through how to run the server and connect it to Claude for Desktop so you can test them out in natural language.&lt;/p&gt;
&lt;h2&gt;Running and Testing Your MCP Server with Claude for Desktop&lt;/h2&gt;
&lt;p&gt;At this point, you’ve built a functional MCP server with two tools: one for reading CSV files and another for Parquet. Now it’s time to bring it to life and connect it to &lt;strong&gt;Claude for Desktop&lt;/strong&gt;, so you can start running your tools using plain English.&lt;/p&gt;
&lt;h3&gt;Step 1: Run the Server&lt;/h3&gt;
&lt;p&gt;Let’s start your server locally.&lt;/p&gt;
&lt;p&gt;In your project root (where &lt;code&gt;server.py&lt;/code&gt; lives), run:&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-bash&quot;&gt;uv run main.py
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;This starts your MCP server using the tools you defined. You won’t see much output in the terminal just yet—that’s normal. Your server is now waiting for a connection from a client like Claude.&lt;/p&gt;
&lt;h3&gt;Step 2: Install Claude for Desktop (If You Haven’t Already)&lt;/h3&gt;
&lt;p&gt;You’ll need Claude for Desktop installed to connect to your server.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Download it here:&lt;/strong&gt; https://www.anthropic.com/claude&lt;/p&gt;
&lt;p&gt;Follow the installation instructions for your operating system&lt;/p&gt;
&lt;p&gt;Note: As of now, Claude for Desktop is not available on Linux. If you’re on Linux, skip ahead to the section on building your own MCP client.&lt;/p&gt;
&lt;h3&gt;Step 3: Configure Claude to Use Your Server&lt;/h3&gt;
&lt;p&gt;Claude needs to know where to find your MCP server. You’ll do this by editing a small config file on your system.&lt;/p&gt;
&lt;h4&gt;MacOS / Linux:&lt;/h4&gt;
&lt;p&gt;Open this file in your code editor (create it if it doesn’t exist):&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-bash&quot;&gt;code ~/Library/Application\ Support/Claude/claude_desktop_config.json
&lt;/code&gt;&lt;/pre&gt;
&lt;h4&gt;Windows:&lt;/h4&gt;
&lt;p&gt;The config file is located here:&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-shell&quot;&gt;%APPDATA%\Claude\claude_desktop_config.json
&lt;/code&gt;&lt;/pre&gt;
&lt;h3&gt;Step 4: Add Your Server to the Config&lt;/h3&gt;
&lt;p&gt;Paste the following JSON into the file, replacing the &amp;quot;/ABSOLUTE/PATH/...&amp;quot; with the actual full path to your mix_server project folder:&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-json&quot;&gt;{
  &amp;quot;mcpServers&amp;quot;: {
    &amp;quot;mix_server&amp;quot;: {
      &amp;quot;command&amp;quot;: &amp;quot;uv&amp;quot;,
      &amp;quot;args&amp;quot;: [
        &amp;quot;--directory&amp;quot;,
        &amp;quot;/ABSOLUTE/PATH/TO/mix_server&amp;quot;,
        &amp;quot;run&amp;quot;,
        &amp;quot;main.py&amp;quot;
      ]
    }
  }
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Tip: To find the absolute path:&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;On Mac/Linux:&lt;/strong&gt; Run pwd in your terminal&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;On Windows:&lt;/strong&gt; Use cd and copy the full path from File Explorer&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Make sure uv is in your system PATH, or replace &amp;quot;command&amp;quot;:&lt;/strong&gt; &amp;quot;uv&amp;quot; with the full path to the uv executable.&lt;/p&gt;
&lt;h3&gt;Step 5: Restart Claude for Desktop&lt;/h3&gt;
&lt;p&gt;Restart the app, and you should see a new tool icon (hammer) appear in the interface. Click it, and you’ll see your registered tools:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;
&lt;p&gt;&lt;code&gt;summarize_csv_file&lt;/code&gt;&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;code&gt;summarize_parquet_file&lt;/code&gt;&lt;/p&gt;
&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;These can now be called directly by the AI!&lt;/p&gt;
&lt;h3&gt;Step 6: Try It Out&lt;/h3&gt;
&lt;p&gt;Now try asking Claude something like:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&amp;quot;Summarize the CSV file named sample.csv.&amp;quot;&lt;/li&gt;
&lt;li&gt;&amp;quot;How many rows are in sample.parquet?&amp;quot;&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;Claude will detect the appropriate tool, call your server, and respond with the results—powered by the very Python code you wrote.&lt;/p&gt;
&lt;h3&gt;Troubleshooting Tips&lt;/h3&gt;
&lt;p&gt;If things don’t work right away, here are a few things to check:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;
&lt;p&gt;Make sure your &lt;code&gt;uv run main.py&lt;/code&gt; process is running and hasn&apos;t crashed&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;Ensure the file paths in your config JSON are correct&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;Confirm that your data files (&lt;code&gt;sample.csv&lt;/code&gt;, &lt;code&gt;sample.parquet&lt;/code&gt;) exist in the &lt;code&gt;/data&lt;/code&gt; directory&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;Check the Claude UI for error messages or tool-loading indicators&lt;/p&gt;
&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;You now have a working local AI toolchain powered by MCP! In the final section, we’ll do a quick recap and show how you can build on this template for more powerful tools.&lt;/p&gt;
&lt;h2&gt;Recap and Next Steps&lt;/h2&gt;
&lt;p&gt;Congratulations—you just built your first MCP server!&lt;/p&gt;
&lt;p&gt;Let’s take a moment to review what you’ve accomplished.&lt;/p&gt;
&lt;h3&gt;What You Built&lt;/h3&gt;
&lt;p&gt;By following this guide, you now have a fully working &lt;strong&gt;MCP server&lt;/strong&gt; that:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Uses Python and the official &lt;code&gt;mcp&lt;/code&gt; SDK&lt;/li&gt;
&lt;li&gt;Reads real data from both &lt;strong&gt;CSV&lt;/strong&gt; and &lt;strong&gt;Parquet&lt;/strong&gt; files&lt;/li&gt;
&lt;li&gt;Exposes two custom &lt;strong&gt;MCP tools&lt;/strong&gt; that Claude for Desktop can call:
&lt;ul&gt;
&lt;li&gt;&lt;code&gt;summarize_csv_file&lt;/code&gt;&lt;/li&gt;
&lt;li&gt;&lt;code&gt;summarize_parquet_file&lt;/code&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;li&gt;Follows a clean, modular folder structure&lt;/li&gt;
&lt;li&gt;Runs locally using &lt;code&gt;uv&lt;/code&gt; and connects seamlessly to Claude for natural language interaction&lt;/li&gt;
&lt;/ul&gt;
&lt;h3&gt;You also learned how to:&lt;/h3&gt;
&lt;ul&gt;
&lt;li&gt;Set up your Python project with &lt;code&gt;uv&lt;/code&gt;&lt;/li&gt;
&lt;li&gt;Manage dependencies cleanly&lt;/li&gt;
&lt;li&gt;Register and expose tools using the &lt;code&gt;@mcp.tool()&lt;/code&gt; decorator&lt;/li&gt;
&lt;li&gt;Wire everything together with Claude through a simple config file&lt;/li&gt;
&lt;/ul&gt;
&lt;h3&gt;Where to Go From Here&lt;/h3&gt;
&lt;p&gt;This project was intentionally simple so you could focus on learning the structure and flow of an MCP server. But this is just the beginning.&lt;/p&gt;
&lt;p&gt;Here are a few ideas for extending this template:&lt;/p&gt;
&lt;h4&gt;1. &lt;strong&gt;Add More Advanced Tools&lt;/strong&gt;&lt;/h4&gt;
&lt;p&gt;Try building tools that:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Filter rows based on a column value&lt;/li&gt;
&lt;li&gt;Return column names or data types&lt;/li&gt;
&lt;li&gt;Calculate statistics (mean, median, etc.)&lt;/li&gt;
&lt;/ul&gt;
&lt;h4&gt;2. &lt;strong&gt;Use Resources&lt;/strong&gt;&lt;/h4&gt;
&lt;p&gt;Use &lt;code&gt;@mcp.resource()&lt;/code&gt; to expose static or dynamic data that Claude can pull into its context before making a decision.&lt;/p&gt;
&lt;h4&gt;3. &lt;strong&gt;Explore Prompts&lt;/strong&gt;&lt;/h4&gt;
&lt;p&gt;Create reusable interaction templates with &lt;code&gt;@mcp.prompt()&lt;/code&gt; to guide how Claude asks or responds.&lt;/p&gt;
&lt;h4&gt;4. &lt;strong&gt;Add Async Logic&lt;/strong&gt;&lt;/h4&gt;
&lt;p&gt;If you’re pulling data from APIs or databases, consider making your tools async using &lt;code&gt;async def&lt;/code&gt;—fully supported by FastMCP.&lt;/p&gt;
&lt;h4&gt;5. &lt;strong&gt;Build Your Own Client&lt;/strong&gt;&lt;/h4&gt;
&lt;p&gt;Not using Claude? You can write your own MCP-compatible client using the SDK’s &lt;code&gt;ClientSession&lt;/code&gt; interface.&lt;/p&gt;
&lt;h3&gt;Share and Reuse&lt;/h3&gt;
&lt;p&gt;You now have a &lt;strong&gt;template&lt;/strong&gt; you can reuse for future projects. If you publish it on GitHub, others can fork it, extend it, and learn from it too.&lt;/p&gt;
&lt;p&gt;This isn’t just a demo—it’s the foundation of a toolchain where you can define your own AI-powered workflows and expose them to LLMs in a controlled, modular way.&lt;/p&gt;
</content:encoded><author>Alex Merced</author></item><item><title>Using Helm with Kubernetes - A Guide to Helm Charts and Their Implementation</title><link>https://iceberglakehouse.com/posts/2025-02-using-helm-with-kubernetes/</link><guid isPermaLink="true">https://iceberglakehouse.com/posts/2025-02-using-helm-with-kubernetes/</guid><description>
## Free Resources  
- **[Free Apache Iceberg Course](https://hello.dremio.com/webcast-an-apache-iceberg-lakehouse-crash-course-reg.html?utm_source=ev...</description><pubDate>Wed, 19 Feb 2025 09:00:00 GMT</pubDate><content:encoded>&lt;h2&gt;Free Resources&lt;/h2&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;&lt;a href=&quot;https://hello.dremio.com/webcast-an-apache-iceberg-lakehouse-crash-course-reg.html?utm_source=ev_external_blog&amp;amp;utm_medium=influencer&amp;amp;utm_campaign=using_helm_charts&amp;amp;utm_content=alexmerced&amp;amp;utm_term=external_blog&quot;&gt;Free Apache Iceberg Course&lt;/a&gt;&lt;/strong&gt;&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;&lt;a href=&quot;https://hello.dremio.com/wp-apache-iceberg-the-definitive-guide-reg.html?utm_source=ev_external_blog&amp;amp;utm_medium=influencer&amp;amp;utm_campaign=lakehouse-benefits-solu&amp;amp;utm_content=alexmerced&amp;amp;utm_term=external_blog&quot;&gt;Free Copy of “Apache Iceberg: The Definitive Guide”&lt;/a&gt;&lt;/strong&gt;&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;&lt;a href=&quot;https://medium.com/data-engineering-with-dremio/2025-guide-to-architecting-an-iceberg-lakehouse-9b19ed42c9de&quot;&gt;2025 Apache Iceberg Architecture Guide&lt;/a&gt;&lt;/strong&gt;&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;&lt;a href=&quot;https://medium.alexmerced.blog/guide-to-finding-apache-iceberg-events-near-you-and-being-part-of-the-greater-iceberg-community-0c38ae785ddb&quot;&gt;How to Join the Iceberg Community&lt;/a&gt;&lt;/strong&gt;&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;&lt;a href=&quot;https://youtube.com/playlist?list=PLsLAVBjQJO0p0Yq1fLkoHvt2lEJj5pcYe&amp;amp;si=WTSnqjXZv6Glkc3y&quot;&gt;Iceberg Lakehouse Engineering Video Playlist&lt;/a&gt;&lt;/strong&gt;&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;&lt;a href=&quot;https://medium.com/data-engineering-with-dremio/ultimate-directory-of-apache-iceberg-resources-e3e02efac62e&quot;&gt;Ultimate Apache Iceberg Resource Guide&lt;/a&gt;&lt;/strong&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;Managing applications in Kubernetes can be complex, requiring multiple YAML files to define resources such as Deployments, Services, ConfigMaps, and Secrets. As applications scale, maintaining and updating these configurations manually becomes cumbersome and error-prone. This is where &lt;strong&gt;Helm&lt;/strong&gt; comes in.&lt;/p&gt;
&lt;p&gt;Helm is a &lt;strong&gt;package manager for Kubernetes&lt;/strong&gt; that simplifies deployment by bundling application configurations into reusable, version-controlled &lt;strong&gt;Helm charts&lt;/strong&gt;. With Helm, you can deploy applications with a single command, manage updates seamlessly, and roll back to previous versions if needed.&lt;/p&gt;
&lt;h3&gt;&lt;strong&gt;Why Use Helm?&lt;/strong&gt;&lt;/h3&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Simplifies Deployments&lt;/strong&gt; – Deploy complex applications with a single command instead of managing multiple YAML files.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Parameterization &amp;amp; Reusability&lt;/strong&gt; – Configure deployments dynamically using &lt;code&gt;values.yaml&lt;/code&gt;, making it easy to manage multiple environments (dev, staging, prod).&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Version Control &amp;amp; Rollbacks&lt;/strong&gt; – Helm tracks deployments, allowing you to roll back to previous versions in case of failures.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Dependency Management&lt;/strong&gt; – Install and manage application dependencies effortlessly.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Integration with CI/CD &amp;amp; GitOps&lt;/strong&gt; – Automate deployments with tools like &lt;strong&gt;ArgoCD&lt;/strong&gt;, &lt;strong&gt;FluxCD&lt;/strong&gt;, and &lt;strong&gt;GitHub Actions&lt;/strong&gt;.&lt;/li&gt;
&lt;/ul&gt;
&lt;h3&gt;&lt;strong&gt;What You&apos;ll Learn in This Guide&lt;/strong&gt;&lt;/h3&gt;
&lt;p&gt;In this blog, we’ll cover:&lt;/p&gt;
&lt;ol&gt;
&lt;li&gt;&lt;strong&gt;What Helm is and how it works&lt;/strong&gt; – Understanding its architecture and components.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Installing and configuring Helm&lt;/strong&gt; – Setting up Helm for your Kubernetes cluster.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Understanding Helm charts&lt;/strong&gt; – Exploring chart structure, templates, and values.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Writing your own Helm chart&lt;/strong&gt; – Step-by-step guide to creating a custom chart.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Deploying applications with Helm&lt;/strong&gt; – Installing, upgrading, and rolling back releases.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Best practices for Helm in production&lt;/strong&gt; – Security, GitOps integration, and monitoring.&lt;/li&gt;
&lt;/ol&gt;
&lt;p&gt;By the end of this guide, you&apos;ll have a strong foundation in Helm and be able to deploy, manage, and scale Kubernetes applications efficiently.&lt;/p&gt;
&lt;h2&gt;Understanding Helm: The Package Manager for Kubernetes&lt;/h2&gt;
&lt;h3&gt;What is Helm?&lt;/h3&gt;
&lt;p&gt;Helm is a &lt;strong&gt;package manager for Kubernetes&lt;/strong&gt; that helps deploy, configure, and manage applications in a Kubernetes cluster. Instead of manually writing and applying multiple Kubernetes YAML manifests, Helm allows you to package them into reusable &lt;strong&gt;Helm Charts&lt;/strong&gt;, simplifying deployment and maintenance.&lt;/p&gt;
&lt;h3&gt;Why Use Helm?&lt;/h3&gt;
&lt;p&gt;Managing Kubernetes resources can become complex, especially when deploying applications with multiple components (Deployments, Services, ConfigMaps, Secrets, etc.). Helm provides several advantages:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Simplifies Deployments&lt;/strong&gt; – Automates the process of applying multiple YAML files.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Versioning &amp;amp; Rollbacks&lt;/strong&gt; – Tracks different versions of deployments and allows rollback if necessary.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Parameterization &amp;amp; Reusability&lt;/strong&gt; – Uses a templating system (&lt;code&gt;values.yaml&lt;/code&gt;) to customize deployments.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Dependency Management&lt;/strong&gt; – Simplifies installing and upgrading application dependencies.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Consistent Configuration Across Environments&lt;/strong&gt; – Makes it easy to manage different configurations for dev, staging, and production.&lt;/li&gt;
&lt;/ul&gt;
&lt;h3&gt;How Does Helm Compare to Traditional Kubernetes Manifests?&lt;/h3&gt;
&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Feature&lt;/th&gt;
&lt;th&gt;Kubernetes YAML Manifests&lt;/th&gt;
&lt;th&gt;Helm Charts&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Management&lt;/td&gt;
&lt;td&gt;Requires manually applying multiple YAML files&lt;/td&gt;
&lt;td&gt;Uses a single Helm command&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Configuration&lt;/td&gt;
&lt;td&gt;Static YAML definitions&lt;/td&gt;
&lt;td&gt;Dynamic templating via &lt;code&gt;values.yaml&lt;/code&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Version Control&lt;/td&gt;
&lt;td&gt;Difficult to track changes manually&lt;/td&gt;
&lt;td&gt;Built-in versioning &amp;amp; rollback&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Reusability&lt;/td&gt;
&lt;td&gt;Limited; each deployment needs its own YAML&lt;/td&gt;
&lt;td&gt;Reusable and configurable charts&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Dependencies&lt;/td&gt;
&lt;td&gt;Managed manually&lt;/td&gt;
&lt;td&gt;Handled via &lt;code&gt;requirements.yaml&lt;/code&gt; (deprecated) or &lt;code&gt;Chart.yaml&lt;/code&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;
&lt;h2&gt;How Helm Works&lt;/h2&gt;
&lt;h3&gt;Helm Components and Architecture&lt;/h3&gt;
&lt;p&gt;Helm follows a client-only architecture in &lt;strong&gt;Helm v3&lt;/strong&gt;, where it directly interacts with the Kubernetes API server without requiring a backend component like Tiller (which was used in Helm v2). Below are the core components of Helm:&lt;/p&gt;
&lt;ol&gt;
&lt;li&gt;&lt;strong&gt;Helm CLI&lt;/strong&gt; – The command-line interface used to manage Helm charts, releases, and repositories.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Helm Charts&lt;/strong&gt; – Packaged Kubernetes applications that define resources like Deployments, Services, ConfigMaps, and Secrets.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Helm Repository&lt;/strong&gt; – A collection of Helm charts stored in a remote or local location (e.g., &lt;a href=&quot;https://artifacthub.io/&quot;&gt;Artifact Hub&lt;/a&gt;).&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Helm Release&lt;/strong&gt; – A deployed instance of a Helm chart, stored as metadata inside the Kubernetes cluster.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Kubernetes API Server&lt;/strong&gt; – Helm interacts with the Kubernetes API to apply resources as defined in the chart.&lt;/li&gt;
&lt;/ol&gt;
&lt;h3&gt;Helm Workflow: How Helm Manages Deployments&lt;/h3&gt;
&lt;ol&gt;
&lt;li&gt;&lt;strong&gt;Fetching Charts&lt;/strong&gt; – Helm can pull pre-built charts from repositories using &lt;code&gt;helm repo add&lt;/code&gt; and &lt;code&gt;helm search repo&lt;/code&gt;.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Templating and Rendering&lt;/strong&gt; – Helm dynamically replaces values in the YAML templates using the &lt;code&gt;values.yaml&lt;/code&gt; file before applying them.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Creating a Release&lt;/strong&gt; – When a Helm chart is installed, Helm assigns it a unique &lt;strong&gt;release name&lt;/strong&gt; and applies the rendered templates to the Kubernetes cluster.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Versioning and Rollbacks&lt;/strong&gt; – Helm maintains a history of releases, allowing easy upgrades (&lt;code&gt;helm upgrade&lt;/code&gt;) and rollbacks (&lt;code&gt;helm rollback&lt;/code&gt;).&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Uninstalling Releases&lt;/strong&gt; – Helm can remove all associated Kubernetes resources using &lt;code&gt;helm uninstall&lt;/code&gt;.&lt;/li&gt;
&lt;/ol&gt;
&lt;h3&gt;Helm Command Lifecycle&lt;/h3&gt;
&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Command&lt;/th&gt;
&lt;th&gt;Purpose&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;helm repo add &amp;lt;repo-name&amp;gt; &amp;lt;repo-url&amp;gt;&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;Adds a Helm chart repository&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;helm search repo &amp;lt;keyword&amp;gt;&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;Searches for a chart in repositories&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;helm install &amp;lt;release-name&amp;gt; &amp;lt;chart-name&amp;gt;&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;Installs a Helm chart and creates a release&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;helm list&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;Lists all active Helm releases&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;helm status &amp;lt;release-name&amp;gt;&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;Shows details of a deployed release&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;helm upgrade &amp;lt;release-name&amp;gt; &amp;lt;chart-name&amp;gt;&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;Upgrades an existing release to a new chart version&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;helm rollback &amp;lt;release-name&amp;gt; &amp;lt;revision&amp;gt;&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;Rolls back a release to a previous version&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;helm uninstall &amp;lt;release-name&amp;gt;&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;Deletes a release and removes associated resources&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;
&lt;h3&gt;Helm in Action: A Simple Example&lt;/h3&gt;
&lt;p&gt;Let&apos;s say you want to deploy &lt;strong&gt;NGINX&lt;/strong&gt; using Helm. You can do this with a single command:&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-sh&quot;&gt;helm repo add bitnami https://charts.bitnami.com/bitnami
helm install my-nginx bitnami/nginx
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;This command:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Adds the Bitnami Helm repository.&lt;/li&gt;
&lt;li&gt;Installs the NGINX Helm chart from the Bitnami repository.&lt;/li&gt;
&lt;li&gt;Creates a Helm release named my-nginx in the cluster.&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;To check the status of the deployment:&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-sh&quot;&gt;helm list
helm status my-nginx
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;To uninstall the release:&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-sh&quot;&gt;helm uninstall my-nginx
&lt;/code&gt;&lt;/pre&gt;
&lt;h2&gt;Installing and Configuring Helm&lt;/h2&gt;
&lt;p&gt;Before using Helm, you need to install it on your local machine and configure it to work with your Kubernetes cluster. This section will walk through the installation process and initial setup.&lt;/p&gt;
&lt;h3&gt;&lt;strong&gt;Prerequisites&lt;/strong&gt;&lt;/h3&gt;
&lt;ul&gt;
&lt;li&gt;A &lt;strong&gt;Kubernetes cluster&lt;/strong&gt; running locally (e.g., Minikube, Kind) or in the cloud (e.g., AKS, GKE, EKS).&lt;/li&gt;
&lt;li&gt;&lt;code&gt;kubectl&lt;/code&gt; installed and configured to communicate with your cluster.&lt;/li&gt;
&lt;/ul&gt;
&lt;h3&gt;&lt;strong&gt;Installing Helm&lt;/strong&gt;&lt;/h3&gt;
&lt;p&gt;Helm can be installed on macOS, Linux, and Windows using various package managers.&lt;/p&gt;
&lt;h4&gt;&lt;strong&gt;macOS (Using Homebrew)&lt;/strong&gt;&lt;/h4&gt;
&lt;pre&gt;&lt;code class=&quot;language-sh&quot;&gt;brew install helm
&lt;/code&gt;&lt;/pre&gt;
&lt;h4&gt;Linux (Using Script)&lt;/h4&gt;
&lt;pre&gt;&lt;code class=&quot;language-sh&quot;&gt;curl -fsSL https://raw.githubusercontent.com/helm/helm/main/scripts/get-helm-3 | bash
&lt;/code&gt;&lt;/pre&gt;
&lt;h4&gt;Windows (Using Chocolatey)&lt;/h4&gt;
&lt;pre&gt;&lt;code class=&quot;language-sh&quot;&gt;choco install kubernetes-helm
&lt;/code&gt;&lt;/pre&gt;
&lt;h3&gt;Verifying the Installation&lt;/h3&gt;
&lt;p&gt;After installation, verify that Helm is installed correctly by running:&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-sh&quot;&gt;helm version
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;You should see output similar to:&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-js&quot;&gt;version.BuildInfo{Version:&amp;quot;v3.x.x&amp;quot;, GitCommit:&amp;quot;...&amp;quot;, GitTreeState:&amp;quot;clean&amp;quot;, GoVersion:&amp;quot;...&amp;quot;}
&lt;/code&gt;&lt;/pre&gt;
&lt;h3&gt;Configuring Helm&lt;/h3&gt;
&lt;h4&gt;Adding a Helm Repository&lt;/h4&gt;
&lt;p&gt;Helm uses repositories to store charts. You can add a popular repository, such as the Bitnami Helm charts, using:&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-sh&quot;&gt;helm repo add bitnami https://charts.bitnami.com/bitnami
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;To confirm the repository has been added:&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-sh&quot;&gt;helm repo list
&lt;/code&gt;&lt;/pre&gt;
&lt;h4&gt;Updating Helm Repositories&lt;/h4&gt;
&lt;p&gt;To fetch the latest charts from all added repositories, run:&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-sh&quot;&gt;helm repo update
&lt;/code&gt;&lt;/pre&gt;
&lt;h4&gt;Searching for Helm Charts&lt;/h4&gt;
&lt;p&gt;To search for a specific application within your configured repositories:&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-sh&quot;&gt;helm search repo nginx
&lt;/code&gt;&lt;/pre&gt;
&lt;h4&gt;Installing a Helm Chart&lt;/h4&gt;
&lt;p&gt;Once Helm is set up, you can deploy an application. For example, to deploy NGINX using the Bitnami Helm chart:&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-sh&quot;&gt;helm install my-nginx bitnami/nginx
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;This will:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Download the NGINX chart.&lt;/li&gt;
&lt;li&gt;Deploy the necessary Kubernetes resources.&lt;/li&gt;
&lt;li&gt;Assign the release name my-nginx.&lt;/li&gt;
&lt;/ul&gt;
&lt;h4&gt;Checking the Installation&lt;/h4&gt;
&lt;p&gt;List all active Helm releases:&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-sh&quot;&gt;helm list
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Check the status of a specific release:&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-sh&quot;&gt;helm status my-nginx
&lt;/code&gt;&lt;/pre&gt;
&lt;h4&gt;Uninstalling a Helm Release&lt;/h4&gt;
&lt;p&gt;To remove the my-nginx release and all associated resources:&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-sh&quot;&gt;helm uninstall my-nginx
&lt;/code&gt;&lt;/pre&gt;
&lt;h2&gt;Understanding Helm Charts&lt;/h2&gt;
&lt;h3&gt;What is a Helm Chart?&lt;/h3&gt;
&lt;p&gt;A &lt;strong&gt;Helm chart&lt;/strong&gt; is a packaged application definition that contains Kubernetes resource templates and default configuration values. It allows you to deploy complex applications with a single command while keeping configurations modular and reusable.&lt;/p&gt;
&lt;p&gt;Each chart defines:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;What Kubernetes resources to deploy&lt;/strong&gt; (e.g., Deployments, Services, ConfigMaps).&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;How those resources should be configured&lt;/strong&gt; using a parameterized values file (&lt;code&gt;values.yaml&lt;/code&gt;).&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Dependencies and metadata&lt;/strong&gt; required for installation.&lt;/li&gt;
&lt;/ul&gt;
&lt;h3&gt;Structure of a Helm Chart&lt;/h3&gt;
&lt;p&gt;When you create a Helm chart, it follows a specific directory structure:&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;mychart/
│── charts/           # Directory for chart dependencies (other charts)
│── templates/        # Contains Kubernetes YAML templates
│   ├── deployment.yaml
│   ├── service.yaml
│   ├── ingress.yaml
│   ├── _helpers.tpl  # Contains reusable template functions
│── Chart.yaml        # Metadata about the chart (name, version, description)
│── values.yaml       # Default configuration values for the chart
│── README.md         # Documentation about the chart

&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Each file in this structure serves a specific purpose:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;&lt;code&gt;Chart.yaml&lt;/code&gt;&lt;/strong&gt; – Contains metadata such as chart name, version, and description.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;&lt;code&gt;values.yaml&lt;/code&gt;&lt;/strong&gt; – Defines default values that can be overridden during installation.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;&lt;code&gt;templates/&lt;/code&gt;&lt;/strong&gt; – Holds Kubernetes manifest templates using Helm’s templating syntax.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;&lt;code&gt;charts/&lt;/code&gt;&lt;/strong&gt; – Stores dependencies (other charts required for deployment).&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;&lt;code&gt;README.md&lt;/code&gt;&lt;/strong&gt; – Documents how to use the chart.&lt;/li&gt;
&lt;/ul&gt;
&lt;h3&gt;Example: &lt;code&gt;Chart.yaml&lt;/code&gt;&lt;/h3&gt;
&lt;p&gt;The &lt;code&gt;Chart.yaml&lt;/code&gt; file provides information about the chart:&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-yaml&quot;&gt;apiVersion: v2
name: mychart
description: A sample Helm chart for Kubernetes
type: application
version: 1.0.0
appVersion: 1.16.0
name: The chart&apos;s name.
description: A brief description of what the chart does.
version: The chart version (used for versioning updates).
appVersion: The application version the chart deploys.
&lt;/code&gt;&lt;/pre&gt;
&lt;h3&gt;Example: &lt;code&gt;values.yaml&lt;/code&gt;&lt;/h3&gt;
&lt;p&gt;The values.yaml file defines default configuration values:&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-yaml&quot;&gt;replicaCount: 2

image:
  repository: nginx
  tag: latest
  pullPolicy: IfNotPresent

service:
  type: ClusterIP
  port: 80
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;These values can be overridden when installing the chart using the &lt;code&gt;--set&lt;/code&gt; flag or a custom values file.&lt;/p&gt;
&lt;h3&gt;Example: &lt;code&gt;templates/deployment.yaml&lt;/code&gt;&lt;/h3&gt;
&lt;p&gt;A sample Kubernetes Deployment template using Helm&apos;s templating syntax:&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-yaml&quot;&gt;apiVersion: apps/v1
kind: Deployment
metadata:
  name: {{ .Release.Name }}-nginx
spec:
  replicas: {{ .Values.replicaCount }}
  selector:
    matchLabels:
      app: nginx
  template:
    metadata:
      labels:
        app: nginx
    spec:
      containers:
        - name: nginx
          image: &amp;quot;{{ .Values.image.repository }}:{{ .Values.image.tag }}&amp;quot;
          imagePullPolicy: {{ .Values.image.pullPolicy }}
          ports:
            - containerPort: {{ .Values.service.port }}
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;In this template:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;code&gt;{{ .Release.Name }}&lt;/code&gt; dynamically sets the release name.&lt;/li&gt;
&lt;li&gt;&lt;code&gt;{{ .Values.replicaCount }}&lt;/code&gt; pulls values from values.yaml.&lt;/li&gt;
&lt;li&gt;&lt;code&gt;{{ .Values.image.repository }}:{{ .Values.image.tag }}&lt;/code&gt; sets the container image dynamically.&lt;/li&gt;
&lt;/ul&gt;
&lt;h3&gt;Rendering Helm Templates&lt;/h3&gt;
&lt;p&gt;Before applying a Helm chart, you can preview how the templates will render using:&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-sh&quot;&gt;helm template mychart
&lt;/code&gt;&lt;/pre&gt;
&lt;h2&gt;Writing Your Own Helm Chart&lt;/h2&gt;
&lt;p&gt;Now that we understand Helm charts and their structure, let’s walk through the process of creating a custom Helm chart from scratch.&lt;/p&gt;
&lt;h3&gt;&lt;strong&gt;Step 1: Create a New Helm Chart&lt;/strong&gt;&lt;/h3&gt;
&lt;p&gt;To generate a new Helm chart, use the following command:&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-sh&quot;&gt;helm create mychart
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;This command creates a new directory mychart/ with the standard Helm chart structure.&lt;/p&gt;
&lt;h3&gt;&lt;strong&gt;Step 2: Modify values.yaml&lt;/strong&gt;&lt;/h3&gt;
&lt;p&gt;Open &lt;code&gt;values.yaml&lt;/code&gt; and update it with custom values. Let’s modify it to deploy an NGINX web server with a LoadBalancer service:&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-yaml&quot;&gt;replicaCount: 3

image:
  repository: nginx
  tag: latest
  pullPolicy: IfNotPresent

service:
  type: LoadBalancer
  port: 80
&lt;/code&gt;&lt;/pre&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;replicaCount:&lt;/strong&gt; Defines how many replicas the deployment will create.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;image:&lt;/strong&gt; Configures the container image.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;service:&lt;/strong&gt; Sets the service type and port.&lt;/li&gt;
&lt;/ul&gt;
&lt;h3&gt;Step 3: Customize Deployment Template&lt;/h3&gt;
&lt;p&gt;Edit &lt;code&gt;templates/deployment.yaml&lt;/code&gt; to use Helm’s templating syntax:&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-yaml&quot;&gt;apiVersion: apps/v1
kind: Deployment
metadata:
  name: {{ .Release.Name }}-nginx
  labels:
    app: nginx
spec:
  replicas: {{ .Values.replicaCount }}
  selector:
    matchLabels:
      app: nginx
  template:
    metadata:
      labels:
        app: nginx
    spec:
      containers:
        - name: nginx
          image: &amp;quot;{{ .Values.image.repository }}:{{ .Values.image.tag }}&amp;quot;
          imagePullPolicy: {{ .Values.image.pullPolicy }}
          ports:
            - containerPort: {{ .Values.service.port }}
&lt;/code&gt;&lt;/pre&gt;
&lt;ul&gt;
&lt;li&gt;&lt;code&gt;{{ .Release.Name }}&lt;/code&gt; dynamically assigns the release name.&lt;/li&gt;
&lt;li&gt;&lt;code&gt;{{ .Values.replicaCount }}&lt;/code&gt; references values from values.yaml.&lt;/li&gt;
&lt;li&gt;&lt;code&gt;{{ .Values.image.repository }}:{{ .Values.image.tag }}&lt;/code&gt; configures the image dynamically.&lt;/li&gt;
&lt;/ul&gt;
&lt;h3&gt;Step 4: Customize the Service Template&lt;/h3&gt;
&lt;p&gt;Edit &lt;code&gt;templates/service.yaml&lt;/code&gt; to configure the service:&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-yaml&quot;&gt;apiVersion: v1
kind: Service
metadata:
  name: {{ .Release.Name }}-nginx
spec:
  type: {{ .Values.service.type }}
  selector:
    app: nginx
  ports:
    - protocol: TCP
      port: {{ .Values.service.port }}
      targetPort: {{ .Values.service.port }}
&lt;/code&gt;&lt;/pre&gt;
&lt;h3&gt;Step 5: Package the Helm Chart&lt;/h3&gt;
&lt;p&gt;Once you&apos;ve modified the necessary files, package the chart:&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-sh&quot;&gt;helm package mychart
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;This creates a &lt;code&gt;.tgz&lt;/code&gt; archive of the chart, making it ready for distribution.&lt;/p&gt;
&lt;h3&gt;Step 6: Install the Chart&lt;/h3&gt;
&lt;p&gt;Deploy the chart to your Kubernetes cluster:&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-sh&quot;&gt;helm install my-nginx ./mychart
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;This:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Parses templates.&lt;/li&gt;
&lt;li&gt;Replaces placeholders with values from values.yaml.&lt;/li&gt;
&lt;li&gt;Applies the resources to Kubernetes.&lt;/li&gt;
&lt;/ul&gt;
&lt;h3&gt;Step 7: Verify the Deployment&lt;/h3&gt;
&lt;p&gt;Check the deployed resources:&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-sh&quot;&gt;helm list
kubectl get pods
kubectl get svc
&lt;/code&gt;&lt;/pre&gt;
&lt;h3&gt;Step 8: Uninstall the Chart&lt;/h3&gt;
&lt;p&gt;To remove the deployment, use:&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-sh&quot;&gt;helm uninstall my-nginx
&lt;/code&gt;&lt;/pre&gt;
&lt;h2&gt;Deploying Applications with Helm&lt;/h2&gt;
&lt;p&gt;Once you&apos;ve created or downloaded a Helm chart, you can use Helm to deploy and manage applications in your Kubernetes cluster. This section will walk through the deployment process, including installation, upgrades, rollbacks, and uninstallation.&lt;/p&gt;
&lt;hr&gt;
&lt;h3&gt;&lt;strong&gt;Step 1: Installing a Helm Chart&lt;/strong&gt;&lt;/h3&gt;
&lt;p&gt;To deploy an application using Helm, use the &lt;code&gt;helm install&lt;/code&gt; command:&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-sh&quot;&gt;helm install my-nginx ./mychart
&lt;/code&gt;&lt;/pre&gt;
&lt;ul&gt;
&lt;li&gt;&lt;code&gt;my-nginx&lt;/code&gt; is the release name (a unique identifier for this deployment).&lt;/li&gt;
&lt;li&gt;&lt;code&gt;./mychart&lt;/code&gt; is the path to the Helm chart.&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;If you are installing a chart from a repository, such as Bitnami, use:&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-sh&quot;&gt;helm repo add bitnami https://charts.bitnami.com/bitnami
helm install my-nginx bitnami/nginx
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;This command:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Pulls the nginx chart from the Bitnami repository.&lt;/li&gt;
&lt;li&gt;Deploys NGINX to the Kubernetes cluster.&lt;/li&gt;
&lt;li&gt;Creates a Helm release named my-nginx.&lt;/li&gt;
&lt;/ul&gt;
&lt;h3&gt;Step 2: Verifying the Deployment&lt;/h3&gt;
&lt;p&gt;Once the chart is installed, verify that the release is active:&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-sh&quot;&gt;helm list
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;This will output something like:&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-bash&quot;&gt;NAME        NAMESPACE   REVISION    UPDATED                  STATUS      CHART        APP VERSION
my-nginx    default     1           2024-02-16 10:00:00     deployed    nginx-1.2.3  1.21.6
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;You can check the detailed status of a release:&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-sh&quot;&gt;helm status my-nginx
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;To view the created Kubernetes resources:&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-sh&quot;&gt;kubectl get pods
kubectl get svc
&lt;/code&gt;&lt;/pre&gt;
&lt;h3&gt;Step 3: Customizing Helm Releases&lt;/h3&gt;
&lt;p&gt;Helm allows you to override default values using the &lt;code&gt;--set&lt;/code&gt; flag or a custom values file.&lt;/p&gt;
&lt;h4&gt;Using the &lt;code&gt;--set&lt;/code&gt; Flag&lt;/h4&gt;
&lt;p&gt;You can override individual values like this:&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-sh&quot;&gt;helm install my-nginx bitnami/nginx --set replicaCount=3
&lt;/code&gt;&lt;/pre&gt;
&lt;h4&gt;Using a Custom values.yaml File&lt;/h4&gt;
&lt;p&gt;To provide multiple custom values, create a &lt;code&gt;my-values.yaml&lt;/code&gt; file:&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-yaml&quot;&gt;replicaCount: 3
service:
  type: LoadBalancer
  port: 8080
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Then, deploy the chart with:&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-sh&quot;&gt;helm install my-nginx bitnami/nginx -f my-values.yaml
&lt;/code&gt;&lt;/pre&gt;
&lt;h3&gt;Step 4: Upgrading a Helm Release&lt;/h3&gt;
&lt;p&gt;If you need to modify a running deployment, use the helm upgrade command:&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-sh&quot;&gt;helm upgrade my-nginx bitnami/nginx --set replicaCount=5
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;To upgrade using a modified values file:&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-sh&quot;&gt;helm upgrade my-nginx bitnami/nginx -f my-values.yaml
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;This updates the deployment while keeping existing resources intact.&lt;/p&gt;
&lt;h3&gt;Step 5: Rolling Back to a Previous Version&lt;/h3&gt;
&lt;p&gt;Helm maintains a history of releases, allowing you to roll back if needed.&lt;/p&gt;
&lt;p&gt;List the release history:&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-sh&quot;&gt;helm history my-nginx
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Roll back to a specific revision:&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-sh&quot;&gt;helm rollback my-nginx 1
&lt;/code&gt;&lt;/pre&gt;
&lt;h3&gt;Step 6: Uninstalling a Helm Release&lt;/h3&gt;
&lt;p&gt;To remove a Helm deployment and all its associated resources, run:&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-sh&quot;&gt;helm uninstall my-nginx
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;To confirm deletion:&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-sh&quot;&gt;helm list
kubectl get all
&lt;/code&gt;&lt;/pre&gt;
&lt;h2&gt;Helm Best Practices&lt;/h2&gt;
&lt;p&gt;Using Helm effectively requires following best practices to ensure maintainability, security, and scalability of deployments. This section outlines key strategies for optimizing Helm usage in production environments.&lt;/p&gt;
&lt;h3&gt;&lt;strong&gt;1. Organizing Values in &lt;code&gt;values.yaml&lt;/code&gt; for Clarity&lt;/strong&gt;&lt;/h3&gt;
&lt;p&gt;A well-structured &lt;code&gt;values.yaml&lt;/code&gt; file improves readability and maintainability.&lt;/p&gt;
&lt;h4&gt;✅ &lt;strong&gt;Good Example: Structured and Documented&lt;/strong&gt;&lt;/h4&gt;
&lt;pre&gt;&lt;code class=&quot;language-yaml&quot;&gt;replicaCount: 3  # Number of replicas for high availability

image:
  repository: nginx
  tag: latest
  pullPolicy: IfNotPresent  # Pull policy to optimize image fetching

service:
  type: LoadBalancer
  port: 80  # Publicly exposed service port

resources:
  limits:
    cpu: 500m
    memory: 256Mi
  requests:
    cpu: 250m
    memory: 128Mi
&lt;/code&gt;&lt;/pre&gt;
&lt;h4&gt;❌ Bad Example: Unstructured and Unclear&lt;/h4&gt;
&lt;pre&gt;&lt;code class=&quot;language-yaml&quot;&gt;replicaCount: 3
image: nginx:latest
serviceType: LoadBalancer
port: 80
cpu: 500m
memory: 256Mi
&lt;/code&gt;&lt;/pre&gt;
&lt;ul&gt;
&lt;li&gt;No clear nesting.&lt;/li&gt;
&lt;li&gt;Missing descriptions for future maintainers.&lt;/li&gt;
&lt;li&gt;Harder to override values at a granular level.&lt;/li&gt;
&lt;/ul&gt;
&lt;h3&gt;2. Using helm dependency for Managing Dependencies&lt;/h3&gt;
&lt;p&gt;If your chart depends on other charts (e.g., a database), declare them in Chart.yaml:&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-yaml&quot;&gt;dependencies:
  - name: postgresql
    version: &amp;quot;12.1.3&amp;quot;
    repository: &amp;quot;https://charts.bitnami.com/bitnami&amp;quot;
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Then, update dependencies before installing:&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-sh&quot;&gt;helm dependency update
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;This ensures that all required subcharts are installed and properly versioned.&lt;/p&gt;
&lt;h3&gt;3. Leveraging helm secrets for Sensitive Values&lt;/h3&gt;
&lt;p&gt;Avoid storing credentials in values.yaml. Instead, use Helm Secrets to encrypt sensitive values.&lt;/p&gt;
&lt;p&gt;Install the Helm Secrets plugin:&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-sh&quot;&gt;helm plugin install https://github.com/zachomedia/helm-secrets
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Encrypt sensitive values using SOPS:&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-sh&quot;&gt;sops --encrypt --in-place my-values.yaml
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Install a chart using encrypted values:&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-sh&quot;&gt;helm install my-app ./mychart -f my-values.yaml
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;This ensures secrets are not stored in plaintext inside version control.&lt;/p&gt;
&lt;h3&gt;4. Automating Helm Deployments in CI/CD Pipelines&lt;/h3&gt;
&lt;p&gt;Integrate Helm with CI/CD tools like GitHub Actions, GitLab CI/CD, or ArgoCD to automate deployments.&lt;/p&gt;
&lt;h4&gt;Example GitHub Actions Workflow for Helm&lt;/h4&gt;
&lt;pre&gt;&lt;code class=&quot;language-yaml&quot;&gt;name: Deploy Helm Chart

on:
  push:
    branches:
      - main

jobs:
  deploy:
    runs-on: ubuntu-latest
    steps:
      - name: Checkout repository
        uses: actions/checkout@v3

      - name: Install Helm
        run: |
          curl -fsSL https://raw.githubusercontent.com/helm/helm/main/scripts/get-helm-3 | bash

      - name: Deploy to Kubernetes
        run: |
          helm upgrade --install my-app ./mychart --namespace prod
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;This automates deployments whenever code is pushed to the main branch.&lt;/p&gt;
&lt;h3&gt;5. Keeping Charts Versioned and Documented&lt;/h3&gt;
&lt;ul&gt;
&lt;li&gt;Use semantic versioning in &lt;code&gt;Chart.yaml&lt;/code&gt; (version: 1.2.0).&lt;/li&gt;
&lt;li&gt;Document all available values in &lt;code&gt;README.md&lt;/code&gt;.
Maintain a &lt;code&gt;CHANGELOG.md&lt;/code&gt; to track modifications.&lt;/li&gt;
&lt;/ul&gt;
&lt;h2&gt;#6. Managing Multiple Environments (Dev, Staging, Prod)&lt;/h2&gt;
&lt;p&gt;Helm allows environment-specific values with separate values files:&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-sh&quot;&gt;helm install my-app ./mychart -f values-dev.yaml
helm install my-app ./mychart -f values-prod.yaml
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;This ensures different configurations for testing and production.&lt;/p&gt;
&lt;h3&gt;7. Helm Security Considerations&lt;/h3&gt;
&lt;ul&gt;
&lt;li&gt;Avoid running Helm with cluster-wide privileges.&lt;/li&gt;
&lt;li&gt;Restrict Helm Release Names to prevent namespace conflicts.&lt;/li&gt;
&lt;li&gt;Use RBAC policies to limit Helm access.
Regularly update Helm and chart dependencies to patch vulnerabilities.&lt;/li&gt;
&lt;/ul&gt;
&lt;h3&gt;Summary&lt;/h3&gt;
&lt;ul&gt;
&lt;li&gt;Organize values.yaml clearly for maintainability.&lt;/li&gt;
&lt;li&gt;Use helm dependency to manage subcharts.&lt;/li&gt;
&lt;li&gt;Secure sensitive values with helm secrets and encryption.&lt;/li&gt;
&lt;li&gt;Automate Helm deployments using CI/CD.&lt;/li&gt;
&lt;li&gt;Maintain versioning, documentation, and separate environments.&lt;/li&gt;
&lt;li&gt;Follow security best practices to protect Kubernetes resources.&lt;/li&gt;
&lt;li&gt;In the next section, we’ll discuss Helm’s role in large-scale production deployments and how to integrate it with GitOps tools like ArgoCD and Flux.&lt;/li&gt;
&lt;/ul&gt;
&lt;h2&gt;Helm in Production: Managing Complexity at Scale&lt;/h2&gt;
&lt;p&gt;As organizations scale their Kubernetes deployments, managing Helm charts effectively in production becomes crucial. This section explores how Helm integrates with GitOps tools, supports multi-environment management, and follows best practices for high availability and security.&lt;/p&gt;
&lt;h3&gt;&lt;strong&gt;1. Using GitOps with Helm (ArgoCD &amp;amp; Flux)&lt;/strong&gt;&lt;/h3&gt;
&lt;p&gt;&lt;strong&gt;GitOps&lt;/strong&gt; enables declarative infrastructure management, where Helm charts are stored in Git repositories and automatically deployed using tools like &lt;strong&gt;ArgoCD&lt;/strong&gt; and &lt;strong&gt;Flux&lt;/strong&gt;.&lt;/p&gt;
&lt;h4&gt;&lt;strong&gt;Deploying Helm Charts with ArgoCD&lt;/strong&gt;&lt;/h4&gt;
&lt;p&gt;ArgoCD monitors a Git repository and applies changes automatically.&lt;/p&gt;
&lt;ol&gt;
&lt;li&gt;&lt;strong&gt;Install ArgoCD&lt;/strong&gt;:&lt;pre&gt;&lt;code class=&quot;language-sh&quot;&gt;kubectl create namespace argocd
kubectl apply -n argocd -f https://raw.githubusercontent.com/argoproj/argo-cd/stable/manifests/install.yaml
&lt;/code&gt;&lt;/pre&gt;
&lt;/li&gt;
&lt;/ol&gt;
&lt;p&gt;Deploy a Helm Chart with ArgoCD:&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-yaml&quot;&gt;apiVersion: argoproj.io/v1alpha1
kind: Application
metadata:
  name: my-helm-app
  namespace: argocd
spec:
  project: default
  source:
    repoURL: https://github.com/my-org/helm-charts.git
    targetRevision: main
    path: mychart
    helm:
      valueFiles:
        - values-prod.yaml
  destination:
    server: https://kubernetes.default.svc
    namespace: prod
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Apply the application manifest:&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-sh&quot;&gt;kubectl apply -f my-helm-app.yaml
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;ArgoCD will now continuously sync the Helm chart with the Kubernetes cluster.&lt;/p&gt;
&lt;h4&gt;Using FluxCD for Helm Deployments&lt;/h4&gt;
&lt;p&gt;FluxCD can also automate Helm deployments:&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-sh&quot;&gt;flux create source git my-helm-repo \
  --url=https://github.com/my-org/helm-charts.git \
  --branch=main

flux create helmrelease my-app \
  --source=GitRepository/my-helm-repo \
  --chart=mychart \
  --namespace=prod
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;&lt;strong&gt;GitOps&lt;/strong&gt; ensures:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Automated rollouts &amp;amp; rollbacks when changes are pushed to Git.&lt;/li&gt;
&lt;li&gt;Version-controlled infrastructure for reproducibility.&lt;/li&gt;
&lt;li&gt;Improved collaboration by managing Helm charts as code.&lt;/li&gt;
&lt;/ul&gt;
&lt;h3&gt;2. Managing Multi-Cluster Deployments&lt;/h3&gt;
&lt;p&gt;For enterprises running multiple Kubernetes clusters (e.g., dev, staging, prod), Helm enables consistent deployments across environments.&lt;/p&gt;
&lt;h4&gt;Option 1: Context Switching with kubectl&lt;/h4&gt;
&lt;pre&gt;&lt;code class=&quot;language-sh&quot;&gt;kubectl config use-context dev-cluster
helm install my-app ./mychart --namespace dev

kubectl config use-context prod-cluster
helm install my-app ./mychart --namespace prod
&lt;/code&gt;&lt;/pre&gt;
&lt;h4&gt;Option 2: Using Helmfile for Multi-Cluster Deployments&lt;/h4&gt;
&lt;p&gt;Helmfile allows managing multiple Helm releases in a declarative format.&lt;/p&gt;
&lt;p&gt;Example helmfile.yaml:&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-yaml&quot;&gt;releases:
  - name: my-app-dev
    namespace: dev
    chart: ./mychart
    values:
      - values-dev.yaml

  - name: my-app-prod
    namespace: prod
    chart: ./mychart
    values:
      - values-prod.yaml
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Deploy all environments at once:&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-sh&quot;&gt;helmfile apply
&lt;/code&gt;&lt;/pre&gt;
&lt;h3&gt;3. Ensuring High Availability and Reliability&lt;/h3&gt;
&lt;p&gt;Use Helm Hooks: Automate pre-install and post-install tasks.&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-yaml&quot;&gt;annotations:
  &amp;quot;helm.sh/hook&amp;quot;: pre-install
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Enable Readiness and Liveness Probes to ensure application health:&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-yaml&quot;&gt;readinessProbe:
  httpGet:
    path: /
    port: 80
  initialDelaySeconds: 5
  periodSeconds: 10
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Use Rolling Updates with strategy to prevent downtime:&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-yaml&quot;&gt;strategy:
  type: RollingUpdate
  rollingUpdate:
    maxUnavailable: 1
    maxSurge: 1
&lt;/code&gt;&lt;/pre&gt;
&lt;h3&gt;4. Helm Security Best Practices for Production&lt;/h3&gt;
&lt;p&gt;Restrict Helm Permissions using Role-Based Access Control (RBAC):&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-yaml&quot;&gt;kind: Role
apiVersion: rbac.authorization.k8s.io/v1
metadata:
  namespace: prod
  name: helm-user
rules:
  - apiGroups: [&amp;quot;*&amp;quot;]
    resources: [&amp;quot;deployments&amp;quot;, &amp;quot;services&amp;quot;]
    verbs: [&amp;quot;get&amp;quot;, &amp;quot;list&amp;quot;, &amp;quot;create&amp;quot;, &amp;quot;update&amp;quot;, &amp;quot;delete&amp;quot;]
&lt;/code&gt;&lt;/pre&gt;
&lt;h4&gt;Avoid Storing Secrets in values.yaml:&lt;/h4&gt;
&lt;ul&gt;
&lt;li&gt;Use Kubernetes Secrets and refer to them in Helm templates.
En- crypt secrets with SOPS or use External Secrets Operator.&lt;/li&gt;
&lt;/ul&gt;
&lt;h4&gt;Implement Image Scanning:&lt;/h4&gt;
&lt;ul&gt;
&lt;li&gt;Use tools like Trivy or Anchore to scan Helm charts and container images.&lt;/li&gt;
&lt;/ul&gt;
&lt;h4&gt;Regularly Update Helm and Charts:&lt;/h4&gt;
&lt;ul&gt;
&lt;li&gt;Ensure Helm CLI and chart dependencies are up to date.&lt;/li&gt;
&lt;li&gt;Use helm dependency update to pull the latest versions.&lt;/li&gt;
&lt;/ul&gt;
&lt;h3&gt;5. Monitoring and Logging Helm Deployments&lt;/h3&gt;
&lt;p&gt;Track Helm Releases:&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-sh&quot;&gt;helm list --all-namespaces
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Monitor Deployments with Prometheus &amp;amp; Grafana:&lt;/p&gt;
&lt;p&gt;Install Prometheus using Helm:&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-sh&quot;&gt;helm repo add prometheus-community https://prometheus-community.github.io/helm-charts
helm install prometheus prometheus-community/prometheus
&lt;/code&gt;&lt;/pre&gt;
&lt;h4&gt;Integrate with Grafana for dashboard visualization.&lt;/h4&gt;
&lt;p&gt;Use Helm Logs to Debug Issues:&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-sh&quot;&gt;helm get manifest my-app
helm get values my-app
helm get notes my-app
&lt;/code&gt;&lt;/pre&gt;
&lt;h3&gt;Summary&lt;/h3&gt;
&lt;ul&gt;
&lt;li&gt;GitOps tools (ArgoCD, Flux) enable automated Helm deployments.&lt;/li&gt;
&lt;li&gt;Multi-cluster management can be streamlined with Helmfile or Helm contexts.&lt;/li&gt;
&lt;li&gt;High availability practices ensure smooth rolling updates and failovers.&lt;/li&gt;
&lt;li&gt;Security best practices include using RBAC, encrypted secrets, and image scanning.&lt;/li&gt;
&lt;li&gt;Monitoring tools like Prometheus and Grafana help track Helm deployments.&lt;/li&gt;
&lt;/ul&gt;
&lt;h2&gt;9. Conclusion and Next Steps&lt;/h2&gt;
&lt;p&gt;Helm simplifies Kubernetes application deployment, making it easier to manage complex workloads with reusable, version-controlled charts. By leveraging Helm, teams can standardize configurations, automate deployments, and integrate with GitOps workflows to achieve reliable and scalable Kubernetes operations.&lt;/p&gt;
&lt;hr&gt;
&lt;h3&gt;&lt;strong&gt;Key Takeaways&lt;/strong&gt;&lt;/h3&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Helm is the Kubernetes Package Manager&lt;/strong&gt; – It streamlines application deployments by packaging Kubernetes resources into reusable Helm charts.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Charts Provide Flexibility&lt;/strong&gt; – Using &lt;code&gt;values.yaml&lt;/code&gt;, teams can easily override configurations without modifying templates.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Helm Supports Versioning &amp;amp; Rollbacks&lt;/strong&gt; – The ability to upgrade and roll back releases ensures stability and rapid recovery.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Automation &amp;amp; CI/CD Integration&lt;/strong&gt; – Helm works seamlessly with GitOps tools like &lt;strong&gt;ArgoCD&lt;/strong&gt; and &lt;strong&gt;FluxCD&lt;/strong&gt; to automate deployments.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Security &amp;amp; Best Practices Matter&lt;/strong&gt; – Implement &lt;strong&gt;RBAC&lt;/strong&gt;, use &lt;strong&gt;secrets management&lt;/strong&gt;, and ensure &lt;strong&gt;chart dependencies&lt;/strong&gt; are up to date to maintain a secure and efficient Helm workflow.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Monitoring &amp;amp; Debugging Are Essential&lt;/strong&gt; – Use &lt;strong&gt;Prometheus&lt;/strong&gt;, &lt;strong&gt;Grafana&lt;/strong&gt;, and Helm’s built-in commands (&lt;code&gt;helm list&lt;/code&gt;, &lt;code&gt;helm get&lt;/code&gt;) to track deployments and troubleshoot issues.&lt;/li&gt;
&lt;/ul&gt;
&lt;hr&gt;
&lt;h3&gt;&lt;strong&gt;Next Steps: Continue Learning Helm&lt;/strong&gt;&lt;/h3&gt;
&lt;p&gt;Now that you understand Helm’s capabilities, here are some next steps to deepen your knowledge and practical experience:&lt;/p&gt;
&lt;ol&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Explore Official Helm Documentation&lt;/strong&gt;&lt;br&gt;
📌 &lt;a href=&quot;https://helm.sh/docs/&quot;&gt;Helm Docs&lt;/a&gt;&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Deploy Real-World Applications with Helm&lt;/strong&gt;&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Try deploying &lt;strong&gt;WordPress&lt;/strong&gt;, &lt;strong&gt;PostgreSQL&lt;/strong&gt;, or &lt;strong&gt;Redis&lt;/strong&gt; with Helm charts from &lt;a href=&quot;https://artifacthub.io/&quot;&gt;Artifact Hub&lt;/a&gt;.&lt;/li&gt;
&lt;li&gt;Example:&lt;pre&gt;&lt;code class=&quot;language-sh&quot;&gt;helm repo add bitnami https://charts.bitnami.com/bitnami
helm install my-wordpress bitnami/wordpress
&lt;/code&gt;&lt;/pre&gt;
&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Experiment with Custom Helm Charts&lt;/strong&gt;&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Modify an existing chart or build one from scratch.&lt;/li&gt;
&lt;li&gt;Deploy it to different environments using separate &lt;code&gt;values.yaml&lt;/code&gt; files.&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Integrate Helm with a CI/CD Pipeline&lt;/strong&gt;&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Set up GitHub Actions, GitLab CI/CD, or Jenkins to automate Helm deployments.&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Learn Advanced Helm Features&lt;/strong&gt;&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Helm Hooks&lt;/strong&gt;: Automate tasks before/after deployments.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Helm Subcharts&lt;/strong&gt;: Manage dependencies efficiently.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Helm Secrets&lt;/strong&gt;: Encrypt sensitive configurations.&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Follow Helm &amp;amp; Kubernetes Communities&lt;/strong&gt;&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Join the &lt;strong&gt;CNCF Slack&lt;/strong&gt; (#helm-users channel).&lt;/li&gt;
&lt;li&gt;Follow Kubernetes and Helm GitHub discussions for the latest updates.&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;/ol&gt;
&lt;hr&gt;
&lt;h3&gt;&lt;strong&gt;Final Thoughts&lt;/strong&gt;&lt;/h3&gt;
&lt;p&gt;Helm is an essential tool for Kubernetes administrators and DevOps teams looking to optimize deployment workflows. Whether you are deploying simple microservices or complex cloud-native applications, Helm provides the flexibility, automation, and reliability needed to scale efficiently.&lt;/p&gt;
&lt;p&gt;Start experimenting with Helm today and take your Kubernetes skills to the next level!&lt;/p&gt;
&lt;hr&gt;
&lt;h3&gt;&lt;strong&gt;Additional Resources&lt;/strong&gt;&lt;/h3&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Helm Charts Repository&lt;/strong&gt;: &lt;a href=&quot;https://artifacthub.io/&quot;&gt;Artifact Hub&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Kubernetes Documentation&lt;/strong&gt;: &lt;a href=&quot;https://kubernetes.io/&quot;&gt;Kubernetes.io&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;ArgoCD for Helm&lt;/strong&gt;: &lt;a href=&quot;https://argo-cd.readthedocs.io/&quot;&gt;ArgoCD Docs&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;FluxCD for Helm&lt;/strong&gt;: &lt;a href=&quot;https://fluxcd.io/&quot;&gt;FluxCD Docs&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Helm Security Best Practices&lt;/strong&gt;: &lt;a href=&quot;https://helm.sh/docs/topics/security/&quot;&gt;Helm Security Guide&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;
</content:encoded><author>Alex Merced</author></item><item><title>Crash Course on Developing AI Applications with LangChain</title><link>https://iceberglakehouse.com/posts/2025-02-crash-course-on-langchain/</link><guid isPermaLink="true">https://iceberglakehouse.com/posts/2025-02-crash-course-on-langchain/</guid><description>
## Free Resources

- **[Free Apache Iceberg Course](https://hello.dremio.com/webcast-an-apache-iceberg-lakehouse-crash-course-reg.html?utm_source=ev_...</description><pubDate>Sat, 01 Feb 2025 09:00:00 GMT</pubDate><content:encoded>&lt;h2&gt;Free Resources&lt;/h2&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;&lt;a href=&quot;https://hello.dremio.com/webcast-an-apache-iceberg-lakehouse-crash-course-reg.html?utm_source=ev_external_blog&amp;amp;utm_medium=influencer&amp;amp;utm_campaign=intro_langchain&amp;amp;utm_content=alexmerced&amp;amp;utm_term=external_blog&quot;&gt;Free Apache Iceberg Course&lt;/a&gt;&lt;/strong&gt;&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;&lt;a href=&quot;https://hello.dremio.com/wp-apache-iceberg-the-definitive-guide-reg.html?utm_source=ev_external_blog&amp;amp;utm_medium=influencer&amp;amp;utm_campaign=lakehouse-benefits-solu&amp;amp;utm_content=alexmerced&amp;amp;utm_term=external_blog&quot;&gt;Free Copy of “Apache Iceberg: The Definitive Guide”&lt;/a&gt;&lt;/strong&gt;&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;&lt;a href=&quot;https://medium.com/data-engineering-with-dremio/2025-guide-to-architecting-an-iceberg-lakehouse-9b19ed42c9de&quot;&gt;2025 Apache Iceberg Architecture Guide&lt;/a&gt;&lt;/strong&gt;&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;&lt;a href=&quot;https://medium.alexmerced.blog/guide-to-finding-apache-iceberg-events-near-you-and-being-part-of-the-greater-iceberg-community-0c38ae785ddb&quot;&gt;How to Join the Iceberg Community&lt;/a&gt;&lt;/strong&gt;&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;&lt;a href=&quot;https://youtube.com/playlist?list=PLsLAVBjQJO0p0Yq1fLkoHvt2lEJj5pcYe&amp;amp;si=WTSnqjXZv6Glkc3y&quot;&gt;Iceberg Lakehouse Engineering Video Playlist&lt;/a&gt;&lt;/strong&gt;&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;&lt;a href=&quot;https://medium.com/data-engineering-with-dremio/ultimate-directory-of-apache-iceberg-resources-e3e02efac62e&quot;&gt;Ultimate Apache Iceberg Resource Guide&lt;/a&gt;&lt;/strong&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;h2&gt;Introduction&lt;/h2&gt;
&lt;p&gt;Large Language Models (LLMs) have revolutionized the way developers build AI-powered applications, from chatbots to intelligent search systems. However, managing LLM interactions effectively—structuring prompts, handling memory, and integrating external tools—can be complex. This is where &lt;strong&gt;LangChain&lt;/strong&gt; comes in.&lt;/p&gt;
&lt;p&gt;LangChain is an open-source framework designed to simplify working with LLMs, enabling developers to create powerful AI applications with ease. By providing a modular approach, LangChain allows you to compose &lt;strong&gt;prompt templates, chains, memory, and agents&lt;/strong&gt; to build flexible and scalable solutions.&lt;/p&gt;
&lt;p&gt;In this guide, we&apos;ll introduce you to &lt;strong&gt;LangChain&lt;/strong&gt; and its companion libraries, including:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;code&gt;langchain_community&lt;/code&gt;: A collection of core integrations and utilities.&lt;/li&gt;
&lt;li&gt;&lt;code&gt;langchain_openai&lt;/code&gt;: A dedicated library for working with OpenAI models.&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;We&apos;ll walk you through key LangChain concepts, installation steps, and practical code examples to help you get started. Whether you&apos;re looking to build chatbots, AI-powered search engines, or decision-making agents, this guide will give you the foundation you need to start developing with LangChain.&lt;/p&gt;
&lt;h2&gt;What is LangChain?&lt;/h2&gt;
&lt;p&gt;LangChain is an open-source framework that simplifies building applications powered by Large Language Models (LLMs). Instead of manually handling prompts, API calls, and responses, LangChain provides a structured way to &lt;strong&gt;chain together different components&lt;/strong&gt; such as prompts, memory, and external tools.&lt;/p&gt;
&lt;h3&gt;Why Use LangChain?&lt;/h3&gt;
&lt;p&gt;Without LangChain, interacting with an LLM typically involves:&lt;/p&gt;
&lt;ol&gt;
&lt;li&gt;Formatting a prompt manually.&lt;/li&gt;
&lt;li&gt;Sending the request to an API (e.g., OpenAI, Cohere).&lt;/li&gt;
&lt;li&gt;Parsing the response and deciding the next action.&lt;/li&gt;
&lt;/ol&gt;
&lt;p&gt;LangChain automates and streamlines these steps, making it easier to build complex AI applications with minimal effort.&lt;/p&gt;
&lt;h3&gt;Key Use Cases&lt;/h3&gt;
&lt;p&gt;LangChain is widely used for:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Chatbots &amp;amp; Virtual Assistants&lt;/strong&gt; – Retaining conversation context and improving responses.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Retrieval-Augmented Generation (RAG)&lt;/strong&gt; – Enhancing LLM responses by fetching external data sources.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Data Processing &amp;amp; Summarization&lt;/strong&gt; – Analyzing and summarizing large documents.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;AI Agents&lt;/strong&gt; – Creating autonomous agents that interact with external APIs and databases.&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;By leveraging LangChain’s modular architecture, you can integrate various &lt;strong&gt;models, tools, and memory mechanisms&lt;/strong&gt; to build dynamic AI-driven applications.&lt;/p&gt;
&lt;h2&gt;Core Concepts in LangChain&lt;/h2&gt;
&lt;p&gt;LangChain is built around a modular architecture that allows developers to compose different components into a pipeline. Here are some of the key concepts you need to understand when working with LangChain:&lt;/p&gt;
&lt;h3&gt;1. &lt;strong&gt;Prompt Templates&lt;/strong&gt;&lt;/h3&gt;
&lt;p&gt;Prompt templates help structure the input given to an LLM. Instead of writing static prompts, you can create dynamic templates that format user inputs into well-structured queries.&lt;/p&gt;
&lt;h4&gt;Example:&lt;/h4&gt;
&lt;pre&gt;&lt;code class=&quot;language-python&quot;&gt;from langchain.prompts import PromptTemplate

template = PromptTemplate(
    input_variables=[&amp;quot;topic&amp;quot;],
    template=&amp;quot;Explain {topic} in simple terms.&amp;quot;
)

formatted_prompt = template.format(topic=&amp;quot;LangChain&amp;quot;)
print(formatted_prompt)
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;This ensures that every input follows a structured format before being passed to the model.&lt;/p&gt;
&lt;h3&gt;2. LLMs and Model Wrappers&lt;/h3&gt;
&lt;p&gt;LangChain provides an easy way to interface with different LLM providers like OpenAI, Hugging Face, and more.&lt;/p&gt;
&lt;h4&gt;Example:&lt;/h4&gt;
&lt;pre&gt;&lt;code class=&quot;language-python&quot;&gt;from langchain_openai import OpenAI

llm = OpenAI(api_key=&amp;quot;your_api_key&amp;quot;)
response = llm(&amp;quot;What is LangChain?&amp;quot;)
print(response)
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;This allows you to seamlessly query the LLM without worrying about API details.&lt;/p&gt;
&lt;h3&gt;3. Chains&lt;/h3&gt;
&lt;p&gt;Chains allow you to combine multiple components (e.g., a prompt template and an LLM) into a single workflow.&lt;/p&gt;
&lt;h4&gt;Example:&lt;/h4&gt;
&lt;pre&gt;&lt;code class=&quot;language-python&quot;&gt;from langchain.chains import LLMChain

llm_chain = LLMChain(llm=llm, prompt=template)
response = llm_chain.run(&amp;quot;machine learning&amp;quot;)
print(response)
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Here, the prompt is formatted and automatically passed to the LLM, reducing boilerplate code.&lt;/p&gt;
&lt;h3&gt;4. Memory&lt;/h3&gt;
&lt;p&gt;Memory allows your application to retain context between interactions, which is crucial for chatbots and multi-turn conversations.&lt;/p&gt;
&lt;h4&gt;Example:&lt;/h4&gt;
&lt;pre&gt;&lt;code class=&quot;language-python&quot;&gt;from langchain.memory import ConversationBufferMemory

memory = ConversationBufferMemory()
memory.save_context({&amp;quot;input&amp;quot;: &amp;quot;Hello&amp;quot;}, {&amp;quot;output&amp;quot;: &amp;quot;Hi, how can I help you?&amp;quot;})
print(memory.load_memory_variables({}))
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;With memory, LangChain can track past interactions and use them to generate more coherent responses.&lt;/p&gt;
&lt;h3&gt;5. Agents and Tools&lt;/h3&gt;
&lt;p&gt;Agents allow an LLM to make decisions dynamically. Instead of following a predefined sequence, an agent determines which tool to call based on the user’s query.&lt;/p&gt;
&lt;h4&gt;Example:&lt;/h4&gt;
&lt;pre&gt;&lt;code class=&quot;language-python&quot;&gt;from langchain.agents import initialize_agent, AgentType
from langchain.tools import Tool

def add_numbers(a, b):
    return a + b

tool = Tool(
    name=&amp;quot;Calculator&amp;quot;,
    func=add_numbers,
    description=&amp;quot;Adds two numbers.&amp;quot;
)

agent = initialize_agent(
    tools=[tool],
    llm=llm,
    agent=AgentType.ZERO_SHOT_REACT_DESCRIPTION
)

response = agent.run(&amp;quot;What is 3 + 5?&amp;quot;)
print(response)
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;This enables an LLM to call functions, fetch data, or interact with APIs to generate more intelligent responses.&lt;/p&gt;
&lt;p&gt;By understanding these core concepts, you can start building more structured and powerful AI applications with LangChain. In the next section, we’ll set up LangChain and its companion libraries to start developing real-world applications.&lt;/p&gt;
&lt;h2&gt;Installing LangChain and Companion Libraries&lt;/h2&gt;
&lt;p&gt;Before we start building with LangChain, we need to install the necessary packages. LangChain is modular, meaning that different functionalities are split across separate libraries. The main ones you&apos;ll need are:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;&lt;code&gt;langchain&lt;/code&gt;&lt;/strong&gt; – The core LangChain library.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;&lt;code&gt;langchain_community&lt;/code&gt;&lt;/strong&gt; – A collection of integrations for third-party tools and services.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;&lt;code&gt;langchain_openai&lt;/code&gt;&lt;/strong&gt; – A dedicated package for working with OpenAI models.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;&lt;code&gt;openai&lt;/code&gt;&lt;/strong&gt; – The OpenAI Python SDK for API access.&lt;/li&gt;
&lt;/ul&gt;
&lt;h3&gt;&lt;strong&gt;1. Installing LangChain and Dependencies&lt;/strong&gt;&lt;/h3&gt;
&lt;p&gt;You can install the required libraries using &lt;code&gt;pip&lt;/code&gt;:&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-bash&quot;&gt;pip install langchain langchain_community langchain_openai openai
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;This command will install the core LangChain framework along with the OpenAI integration.&lt;/p&gt;
&lt;h3&gt;2. Setting Up an OpenAI API Key&lt;/h3&gt;
&lt;p&gt;If you plan to use OpenAI models, you’ll need an API key. Follow these steps:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Sign up at OpenAI.&lt;/li&gt;
&lt;li&gt;Navigate to your API settings and generate an API key.&lt;/li&gt;
&lt;li&gt;Store your API key securely.&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;You can set your API key in an environment variable:&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-bash&quot;&gt;export OPENAI_API_KEY=&amp;quot;your_api_key_here&amp;quot;
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Or pass it directly in your code:&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-python&quot;&gt;import os
os.environ[&amp;quot;OPENAI_API_KEY&amp;quot;] = &amp;quot;your_api_key_here&amp;quot;
&lt;/code&gt;&lt;/pre&gt;
&lt;h3&gt;3. Verifying the Installation&lt;/h3&gt;
&lt;p&gt;To test if everything is installed correctly, run the following Python script:&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-python&quot;&gt;from langchain_openai import OpenAI

llm = OpenAI(api_key=&amp;quot;your_api_key_here&amp;quot;)
response = llm(&amp;quot;Say hello in French.&amp;quot;)
print(response)
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;If you see the response &amp;quot;Bonjour!&amp;quot;, then your setup is working properly.&lt;/p&gt;
&lt;h3&gt;4. Understanding the Role of Companion Libraries&lt;/h3&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;langchain_community&lt;/strong&gt;: Contains integrations for databases, vector stores, and APIs.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;langchain_openai&lt;/strong&gt;: A streamlined package for interacting with OpenAI&apos;s models.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Other integrations&lt;/strong&gt;: LangChain supports many LLM providers (Cohere, Hugging Face, etc.), which can be installed separately.&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;With LangChain and its dependencies installed, you&apos;re ready to start building AI-powered applications. In the next section, we&apos;ll explore how to use LangChain with OpenAI models and create structured workflows.&lt;/p&gt;
&lt;h2&gt;Setting Up and Using LangChain&lt;/h2&gt;
&lt;p&gt;Now that we have LangChain installed, let&apos;s explore how to use it for interacting with LLMs, structuring prompts, and building simple AI workflows.&lt;/p&gt;
&lt;h3&gt;&lt;strong&gt;1. Connecting to an OpenAI Model&lt;/strong&gt;&lt;/h3&gt;
&lt;p&gt;The first step in using LangChain is to connect to an LLM. We&apos;ll start by using OpenAI&apos;s models.&lt;/p&gt;
&lt;h4&gt;&lt;strong&gt;Example: Basic Query to an OpenAI Model&lt;/strong&gt;&lt;/h4&gt;
&lt;pre&gt;&lt;code class=&quot;language-python&quot;&gt;from langchain_openai import OpenAI

llm = OpenAI(api_key=&amp;quot;your_api_key_here&amp;quot;)

response = llm.invoke(&amp;quot;What is LangChain?&amp;quot;)
print(response)
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;This sends a query to OpenAI and prints the response. The invoke method is the recommended way to interact with LLMs in LangChain.&lt;/p&gt;
&lt;h3&gt;2. Working with Prompt Templates&lt;/h3&gt;
&lt;p&gt;A prompt template ensures that user input is formatted consistently before being sent to an LLM. This is useful when you need structured responses.&lt;/p&gt;
&lt;h4&gt;Example: Creating and Using a Prompt Template&lt;/h4&gt;
&lt;pre&gt;&lt;code class=&quot;language-python&quot;&gt;from langchain.prompts import PromptTemplate

template = PromptTemplate(
    input_variables=[&amp;quot;topic&amp;quot;],
    template=&amp;quot;Explain {topic} in simple terms.&amp;quot;
)

formatted_prompt = template.format(topic=&amp;quot;machine learning&amp;quot;)
print(formatted_prompt)
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;This generates a properly structured prompt:
&amp;quot;Explain machine learning in simple terms.&amp;quot;&lt;/p&gt;
&lt;p&gt;You can pass this formatted prompt to an LLM for processing.&lt;/p&gt;
&lt;h3&gt;3. Building a Basic Chain&lt;/h3&gt;
&lt;p&gt;A chain connects multiple components, such as prompts and LLMs, to automate workflows.&lt;/p&gt;
&lt;h4&gt;Example: Using a Chain to Generate Responses&lt;/h4&gt;
&lt;pre&gt;&lt;code class=&quot;language-python&quot;&gt;from langchain.chains import LLMChain

llm_chain = LLMChain(llm=llm, prompt=template)
response = llm_chain.run(&amp;quot;data science&amp;quot;)
print(response)
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Here, LangChain automatically formats the prompt and sends it to the LLM, reducing manual effort.&lt;/p&gt;
&lt;h3&gt;4. Using Memory to Maintain Context&lt;/h3&gt;
&lt;p&gt;By default, LLMs don’t remember past interactions. LangChain provides memory components to store and retrieve conversation history.&lt;/p&gt;
&lt;h4&gt;Example: Storing Conversation History&lt;/h4&gt;
&lt;pre&gt;&lt;code class=&quot;language-python&quot;&gt;from langchain.memory import ConversationBufferMemory

memory = ConversationBufferMemory()

# Simulating a conversation
memory.save_context({&amp;quot;input&amp;quot;: &amp;quot;Hello&amp;quot;}, {&amp;quot;output&amp;quot;: &amp;quot;Hi, how can I help you?&amp;quot;})
memory.save_context({&amp;quot;input&amp;quot;: &amp;quot;What is LangChain?&amp;quot;}, {&amp;quot;output&amp;quot;: &amp;quot;LangChain is a framework for working with LLMs.&amp;quot;})

# Retrieving stored interactions
print(memory.load_memory_variables({}))
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;This ensures that previous interactions can be referenced in future queries.&lt;/p&gt;
&lt;h3&gt;5. Implementing an Agent with Tools&lt;/h3&gt;
&lt;p&gt;An agent allows LLMs to dynamically decide which tool to use for a given query. For example, we can create an agent that uses a calculator tool.&lt;/p&gt;
&lt;h4&gt;Example: Creating an Agent to Perform Calculations&lt;/h4&gt;
&lt;pre&gt;&lt;code class=&quot;language-python&quot;&gt;from langchain.agents import initialize_agent, AgentType
from langchain.tools import Tool

# Defining a simple addition function
def add_numbers(a, b):
    return a + b

tool = Tool(
    name=&amp;quot;Calculator&amp;quot;,
    func=add_numbers,
    description=&amp;quot;Adds two numbers.&amp;quot;
)

# Creating an agent with the tool
agent = initialize_agent(
    tools=[tool],
    llm=llm,
    agent=AgentType.ZERO_SHOT_REACT_DESCRIPTION
)

# Running the agent
response = agent.run(&amp;quot;What is 5 + 7?&amp;quot;)
print(response)
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;This enables the LLM to recognize when to use the calculator tool instead of responding based purely on its pre-trained knowledge.&lt;/p&gt;
&lt;h3&gt;What’s Next?&lt;/h3&gt;
&lt;p&gt;Now that we&apos;ve covered basic LangChain functionalities, you can start experimenting with more advanced features like:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Retrieval-Augmented Generation (RAG) –&lt;/strong&gt; Enhancing LLMs with external knowledge sources.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Vector Databases –&lt;/strong&gt; Storing and retrieving information efficiently.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Custom Tools and APIs –&lt;/strong&gt; Expanding agents to interact with real-world data.&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;In the next section, we&apos;ll discuss best practices for using LangChain efficiently and how to scale applications for production use.&lt;/p&gt;
&lt;h2&gt;Best Practices and Next Steps&lt;/h2&gt;
&lt;p&gt;Now that you understand the basics of LangChain—connecting to LLMs, structuring prompts, using chains, memory, and agents—let’s discuss some best practices for building efficient and scalable applications.&lt;/p&gt;
&lt;hr&gt;
&lt;h3&gt;&lt;strong&gt;1. Optimize Prompt Engineering&lt;/strong&gt;&lt;/h3&gt;
&lt;ul&gt;
&lt;li&gt;Use &lt;strong&gt;clear and structured prompt templates&lt;/strong&gt; to get better responses from LLMs.&lt;/li&gt;
&lt;li&gt;Experiment with &lt;strong&gt;few-shot learning&lt;/strong&gt; by providing example inputs and outputs.&lt;/li&gt;
&lt;li&gt;Keep prompts &lt;strong&gt;concise&lt;/strong&gt; to reduce token usage and improve performance.&lt;/li&gt;
&lt;/ul&gt;
&lt;h4&gt;&lt;strong&gt;Example: Few-Shot Prompting&lt;/strong&gt;&lt;/h4&gt;
&lt;pre&gt;&lt;code class=&quot;language-python&quot;&gt;from langchain.prompts import PromptTemplate

template = PromptTemplate(
    input_variables=[&amp;quot;word&amp;quot;],
    template=&amp;quot;Convert the following word into plural form: {word}\n\nExample:\n- dog -&amp;gt; dogs\n- cat -&amp;gt; cats\n- book -&amp;gt; ?&amp;quot;
)

print(template.format(word=&amp;quot;tree&amp;quot;))
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Providing examples improves the model&apos;s accuracy.&lt;/p&gt;
&lt;h3&gt;2. Use Memory Efficiently&lt;/h3&gt;
&lt;p&gt;Only use conversation memory when necessary (e.g., chatbots).&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Choose the right memory type:&lt;/strong&gt;&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;ConversationBufferMemory –&lt;/strong&gt; Stores all conversation history.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;ConversationSummaryMemory –&lt;/strong&gt; Summarizes past interactions.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;ConversationKGMemory –&lt;/strong&gt; Extracts key facts from a conversation.&lt;/li&gt;
&lt;/ul&gt;
&lt;h4&gt;Example: Using Summary Memory&lt;/h4&gt;
&lt;pre&gt;&lt;code class=&quot;language-python&quot;&gt;from langchain.memory import ConversationSummaryMemory
from langchain_openai import OpenAI

llm = OpenAI(api_key=&amp;quot;your_api_key&amp;quot;)
memory = ConversationSummaryMemory(llm=llm)

memory.save_context({&amp;quot;input&amp;quot;: &amp;quot;I love pizza.&amp;quot;}, {&amp;quot;output&amp;quot;: &amp;quot;Pizza is a great choice!&amp;quot;})
summary = memory.load_memory_variables({})
print(summary)
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;This helps reduce storage while maintaining context.&lt;/p&gt;
&lt;h3&gt;3. Handle API Costs and Rate Limits&lt;/h3&gt;
&lt;p&gt;Use token-efficient prompts to reduce API costs.
Implement batch processing for multiple queries.
Monitor API usage with OpenAI’s rate limits in mind.&lt;/p&gt;
&lt;h4&gt;Example: Monitoring Token Usage&lt;/h4&gt;
&lt;pre&gt;&lt;code class=&quot;language-python&quot;&gt;from langchain_openai import OpenAI

llm = OpenAI(api_key=&amp;quot;your_api_key&amp;quot;, model=&amp;quot;gpt-4&amp;quot;, max_tokens=100)
response = llm(&amp;quot;Summarize the history of AI in 50 words.&amp;quot;)
print(response)
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Setting max_tokens prevents excessive token consumption.&lt;/p&gt;
&lt;h3&gt;4. Enhance LLMs with External Knowledge (RAG)&lt;/h3&gt;
&lt;p&gt;Retrieval-Augmented Generation (RAG) improves LLM responses by fetching external data instead of relying solely on pre-trained knowledge.&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Use vector databases like Pinecone or FAISS for document search.&lt;/li&gt;
&lt;li&gt;Fetch real-time data from APIs.&lt;/li&gt;
&lt;/ul&gt;
&lt;h4&gt;Example: Querying an External Document&lt;/h4&gt;
&lt;pre&gt;&lt;code class=&quot;language-python&quot;&gt;from langchain.vectorstores import FAISS
from langchain.embeddings.openai import OpenAIEmbeddings

# Load and embed documents
embeddings = OpenAIEmbeddings(api_key=&amp;quot;your_api_key&amp;quot;)
vectorstore = FAISS.load_local(&amp;quot;faiss_index&amp;quot;, embeddings)

# Query the knowledge base
docs = vectorstore.similarity_search(&amp;quot;What is LangChain?&amp;quot;, k=2)
print(docs)
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;This retrieves relevant documents to supplement the LLM’s response.&lt;/p&gt;
&lt;h3&gt;5. Scale Applications for Production&lt;/h3&gt;
&lt;p&gt;When moving from prototyping to production, consider:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Caching responses to avoid redundant API calls.&lt;/li&gt;
&lt;li&gt;Logging interactions for debugging and improvement.&lt;/li&gt;
&lt;li&gt;Implementing user authentication for secured access.&lt;/li&gt;
&lt;/ul&gt;
&lt;h4&gt;Example: Implementing Response Caching&lt;/h4&gt;
&lt;pre&gt;&lt;code class=&quot;language-python&quot;&gt;from langchain.cache import InMemoryCache
from langchain.chains import LLMChain

llm_chain = LLMChain(llm=llm, prompt=template)
llm_chain.cache = InMemoryCache()  # Enable caching

response1 = llm_chain.run(&amp;quot;machine learning&amp;quot;)
response2 = llm_chain.run(&amp;quot;machine learning&amp;quot;)  # Cached response
print(response2)
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Caching reduces API calls, improving performance and cost-efficiency.&lt;/p&gt;
&lt;h2&gt;Conclusion&lt;/h2&gt;
&lt;p&gt;LangChain provides a powerful framework for building AI applications that leverage Large Language Models (LLMs). By combining &lt;strong&gt;prompt engineering, chains, memory, and agents&lt;/strong&gt;, LangChain simplifies the development process, making it easier to create &lt;strong&gt;chatbots, AI assistants, and retrieval-augmented generation (RAG) applications&lt;/strong&gt;.&lt;/p&gt;
&lt;p&gt;In this guide, we covered:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;What LangChain is and why it’s useful.&lt;/li&gt;
&lt;li&gt;Core concepts like &lt;strong&gt;prompt templates, chains, memory, and agents&lt;/strong&gt;.&lt;/li&gt;
&lt;li&gt;How to &lt;strong&gt;install and set up LangChain&lt;/strong&gt; along with &lt;code&gt;langchain_openai&lt;/code&gt; and &lt;code&gt;langchain_community&lt;/code&gt;.&lt;/li&gt;
&lt;li&gt;Practical &lt;strong&gt;code examples&lt;/strong&gt; for using LangChain with OpenAI models.&lt;/li&gt;
&lt;li&gt;Best practices for &lt;strong&gt;optimizing prompts, managing memory, and reducing API costs&lt;/strong&gt;.&lt;/li&gt;
&lt;li&gt;How to &lt;strong&gt;scale LangChain applications for production&lt;/strong&gt; using caching and external knowledge retrieval.&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;By applying these concepts, you can start building &lt;strong&gt;custom AI-powered solutions&lt;/strong&gt; with real-world impact.&lt;/p&gt;
&lt;hr&gt;
&lt;h3&gt;&lt;strong&gt;Where to Go from Here?&lt;/strong&gt;&lt;/h3&gt;
&lt;p&gt;If you&apos;re ready to take the next step, consider:&lt;/p&gt;
&lt;ol&gt;
&lt;li&gt;&lt;strong&gt;Building a LangChain Project&lt;/strong&gt; – Try creating a chatbot, document summarizer, or an AI-driven search engine.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Exploring Vector Databases&lt;/strong&gt; – Learn how to integrate Pinecone, FAISS, or ChromaDB for RAG applications.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Joining the Community&lt;/strong&gt; – Engage with other developers on &lt;a href=&quot;https://github.com/langchain-ai/langchain&quot;&gt;LangChain&apos;s GitHub&lt;/a&gt; or &lt;a href=&quot;https://discord.gg/langchain&quot;&gt;Discord&lt;/a&gt;.&lt;/li&gt;
&lt;/ol&gt;
&lt;p&gt;LangChain is continuously evolving, and staying updated with the latest features will help you build &lt;strong&gt;more advanced and efficient AI applications&lt;/strong&gt;. Start experimenting and bring your AI ideas to life!&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;&lt;a href=&quot;https://hello.dremio.com/webcast-an-apache-iceberg-lakehouse-crash-course-reg.html?utm_source=ev_external_blog&amp;amp;utm_medium=influencer&amp;amp;utm_campaign=intro_langchain&amp;amp;utm_content=alexmerced&amp;amp;utm_term=external_blog&quot;&gt;Free Apache Iceberg Course&lt;/a&gt;&lt;/strong&gt;&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;&lt;a href=&quot;https://hello.dremio.com/wp-apache-iceberg-the-definitive-guide-reg.html?utm_source=ev_external_blog&amp;amp;utm_medium=influencer&amp;amp;utm_campaign=lakehouse-benefits-solu&amp;amp;utm_content=alexmerced&amp;amp;utm_term=external_blog&quot;&gt;Free Copy of “Apache Iceberg: The Definitive Guide”&lt;/a&gt;&lt;/strong&gt;&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;&lt;a href=&quot;https://medium.com/data-engineering-with-dremio/2025-guide-to-architecting-an-iceberg-lakehouse-9b19ed42c9de&quot;&gt;2025 Apache Iceberg Architecture Guide&lt;/a&gt;&lt;/strong&gt;&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;&lt;a href=&quot;https://medium.alexmerced.blog/guide-to-finding-apache-iceberg-events-near-you-and-being-part-of-the-greater-iceberg-community-0c38ae785ddb&quot;&gt;How to Join the Iceberg Community&lt;/a&gt;&lt;/strong&gt;&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;&lt;a href=&quot;https://youtube.com/playlist?list=PLsLAVBjQJO0p0Yq1fLkoHvt2lEJj5pcYe&amp;amp;si=WTSnqjXZv6Glkc3y&quot;&gt;Iceberg Lakehouse Engineering Video Playlist&lt;/a&gt;&lt;/strong&gt;&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;&lt;a href=&quot;https://medium.com/data-engineering-with-dremio/ultimate-directory-of-apache-iceberg-resources-e3e02efac62e&quot;&gt;Ultimate Apache Iceberg Resource Guide&lt;/a&gt;&lt;/strong&gt;&lt;/li&gt;
&lt;/ul&gt;
</content:encoded><author>Alex Merced</author></item><item><title>The Data Lakehouse - The Benefits and Enhancing Implementation</title><link>https://iceberglakehouse.com/posts/2025-01-the-data-lakehouse-benefits-and-enhancing/</link><guid isPermaLink="true">https://iceberglakehouse.com/posts/2025-01-the-data-lakehouse-benefits-and-enhancing/</guid><description>
## Free Resources  
- **[Free Apache Iceberg Course](https://hello.dremio.com/webcast-an-apache-iceberg-lakehouse-crash-course-reg.html?utm_source=ev...</description><pubDate>Fri, 31 Jan 2025 09:00:00 GMT</pubDate><content:encoded>&lt;h2&gt;Free Resources&lt;/h2&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;&lt;a href=&quot;https://hello.dremio.com/webcast-an-apache-iceberg-lakehouse-crash-course-reg.html?utm_source=ev_external_blog&amp;amp;utm_medium=influencer&amp;amp;utm_campaign=lakehouse_benefts_solu&amp;amp;utm_content=alexmerced&amp;amp;utm_term=external_blog&quot;&gt;Free Apache Iceberg Course&lt;/a&gt;&lt;/strong&gt;&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;&lt;a href=&quot;https://hello.dremio.com/wp-apache-iceberg-the-definitive-guide-reg.html?utm_source=ev_external_blog&amp;amp;utm_medium=influencer&amp;amp;utm_campaign=lakehouse-benefits-solu&amp;amp;utm_content=alexmerced&amp;amp;utm_term=external_blog&quot;&gt;Free Copy of “Apache Iceberg: The Definitive Guide”&lt;/a&gt;&lt;/strong&gt;&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;&lt;a href=&quot;https://medium.com/data-engineering-with-dremio/2025-guide-to-architecting-an-iceberg-lakehouse-9b19ed42c9de&quot;&gt;2025 Apache Iceberg Architecture Guide&lt;/a&gt;&lt;/strong&gt;&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;&lt;a href=&quot;https://medium.alexmerced.blog/guide-to-finding-apache-iceberg-events-near-you-and-being-part-of-the-greater-iceberg-community-0c38ae785ddb&quot;&gt;How to Join the Iceberg Community&lt;/a&gt;&lt;/strong&gt;&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;&lt;a href=&quot;https://youtube.com/playlist?list=PLsLAVBjQJO0p0Yq1fLkoHvt2lEJj5pcYe&amp;amp;si=WTSnqjXZv6Glkc3y&quot;&gt;Iceberg Lakehouse Engineering Video Playlist&lt;/a&gt;&lt;/strong&gt;&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;&lt;a href=&quot;https://medium.com/data-engineering-with-dremio/ultimate-directory-of-apache-iceberg-resources-e3e02efac62e&quot;&gt;Ultimate Apache Iceberg Resource Guide&lt;/a&gt;&lt;/strong&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;h2&gt;Introduction&lt;/h2&gt;
&lt;p&gt;The &lt;strong&gt;data lakehouse&lt;/strong&gt; has been a significant topic in data architecture over the past several years. However, like any high-value trend, it’s easy to get caught up in the hype and lose sight of the &lt;strong&gt;real reasons&lt;/strong&gt; for adopting this new paradigm.&lt;/p&gt;
&lt;p&gt;In this article, I aim to &lt;strong&gt;clarify the key benefits of a lakehouse&lt;/strong&gt;, highlight the &lt;strong&gt;challenges organizations face in implementing one&lt;/strong&gt;, and explore &lt;strong&gt;practical solutions&lt;/strong&gt; to overcome those challenges.&lt;/p&gt;
&lt;h2&gt;The Problems We Are Trying to Solve For&lt;/h2&gt;
&lt;p&gt;Traditionally, running analytics directly on &lt;strong&gt;operational databases (OLTP systems)&lt;/strong&gt; is neither performant nor efficient, as it creates &lt;strong&gt;resource contention&lt;/strong&gt; with transactional workloads that power enterprise operations. The standard solution has been to &lt;strong&gt;offload this data into a data warehouse&lt;/strong&gt;, which optimizes storage for analytics, manages data efficiently, and provides a processing layer for analytical queries.&lt;/p&gt;
&lt;p&gt;However, not all data is structured or fits neatly into a data warehouse. Additionally, storing &lt;strong&gt;all structured data in a data warehouse can be cost-prohibitive&lt;/strong&gt;. As a result, an intermediate layer—a &lt;strong&gt;data lake&lt;/strong&gt;—is often introduced, where copies of data are stored for &lt;strong&gt;ad hoc analysis&lt;/strong&gt; on &lt;strong&gt;distributed storage systems&lt;/strong&gt; like &lt;strong&gt;Amazon S3, ADLS, MinIO, NetApp StorageGRID, Vast Data, Pure Storage, Nutanix, and others&lt;/strong&gt;.&lt;/p&gt;
&lt;p&gt;In large enterprises, different business units often choose &lt;strong&gt;different data warehouses&lt;/strong&gt;, leading to &lt;strong&gt;multiple copies&lt;/strong&gt; of the same data, inconsistently modeled across departments. This fragmentation introduces several challenges:&lt;/p&gt;
&lt;h3&gt;&lt;strong&gt;1. Consistency&lt;/strong&gt;&lt;/h3&gt;
&lt;p&gt;With multiple copies, &lt;strong&gt;business metrics&lt;/strong&gt; can have &lt;strong&gt;different definitions and values&lt;/strong&gt; depending on which department’s data model you reference, leading to &lt;strong&gt;discrepancies in decision-making&lt;/strong&gt;.&lt;/p&gt;
&lt;h3&gt;&lt;strong&gt;2. Time to Insight&lt;/strong&gt;&lt;/h3&gt;
&lt;p&gt;As &lt;strong&gt;data volumes grow&lt;/strong&gt; and the demand for &lt;strong&gt;real-time or near real-time insights&lt;/strong&gt; increases, excessive &lt;strong&gt;data movement&lt;/strong&gt; becomes a bottleneck. Even if individual transactions are fast, the cumulative impact of copying and processing delays data accessibility.&lt;/p&gt;
&lt;h3&gt;&lt;strong&gt;3. Centralization Bottlenecks&lt;/strong&gt;&lt;/h3&gt;
&lt;p&gt;To &lt;strong&gt;improve consistency&lt;/strong&gt;, some organizations centralize modeling in an &lt;strong&gt;enterprise-wide data warehouse&lt;/strong&gt; with &lt;strong&gt;department-specific data marts&lt;/strong&gt;. However, this centralization can create &lt;strong&gt;bottlenecks&lt;/strong&gt;, &lt;strong&gt;slowing down access to insights&lt;/strong&gt;.&lt;/p&gt;
&lt;h3&gt;&lt;strong&gt;4. Cost&lt;/strong&gt;&lt;/h3&gt;
&lt;p&gt;Every step of data movement incurs &lt;strong&gt;costs&lt;/strong&gt;:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Compute resources&lt;/strong&gt; for processing,&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Storage costs&lt;/strong&gt; for redundant copies, and&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;BI tool expenses&lt;/strong&gt; from multiple teams generating similar &lt;strong&gt;data extracts&lt;/strong&gt; across different tools.&lt;/li&gt;
&lt;/ul&gt;
&lt;h3&gt;&lt;strong&gt;5. Governance&lt;/strong&gt;&lt;/h3&gt;
&lt;p&gt;Not all enterprise data resides in a &lt;strong&gt;data warehouse&lt;/strong&gt;. There will always be &lt;strong&gt;a long tail of data&lt;/strong&gt; in &lt;strong&gt;external systems&lt;/strong&gt;, such as:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Partner-shared data&lt;/strong&gt;,&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Data marketplaces&lt;/strong&gt;, or&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Regulatory-restricted environments&lt;/strong&gt;.&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;Managing access to a &lt;strong&gt;holistic data picture&lt;/strong&gt; while maintaining &lt;strong&gt;governance and security&lt;/strong&gt; across &lt;strong&gt;distributed sources&lt;/strong&gt; is a significant challenge.&lt;/p&gt;
&lt;p&gt;This is where the &lt;strong&gt;data lakehouse&lt;/strong&gt; emerges as a &lt;strong&gt;solution&lt;/strong&gt;.&lt;/p&gt;
&lt;h2&gt;The Data Lakehouse Solution&lt;/h2&gt;
&lt;p&gt;&lt;strong&gt;Data warehouses&lt;/strong&gt; provide essential &lt;strong&gt;data management&lt;/strong&gt; capabilities and &lt;strong&gt;ACID guarantees&lt;/strong&gt;, ensuring &lt;strong&gt;consistency and reliability&lt;/strong&gt; in analytics. However, these features have traditionally been &lt;strong&gt;absent from data lakes&lt;/strong&gt;, as data lakes are not inherently data platforms but &lt;strong&gt;repositories of raw data stored on open storage&lt;/strong&gt;.&lt;/p&gt;
&lt;p&gt;If we &lt;strong&gt;bring data management and ACID transactions&lt;/strong&gt; to the &lt;strong&gt;data lake&lt;/strong&gt;, organizations can work with &lt;strong&gt;a single canonical copy&lt;/strong&gt; directly within the lake, eliminating the need to replicate data across &lt;strong&gt;multiple data warehouses&lt;/strong&gt;. This transformation turns the &lt;strong&gt;data lake into a data warehouse&lt;/strong&gt;—hence the term &lt;strong&gt;data lakehouse&lt;/strong&gt;.&lt;/p&gt;
&lt;p&gt;This is achieved by adopting:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Lakehouse Table Formats&lt;/strong&gt; like &lt;strong&gt;Apache Iceberg, Apache Hudi, Delta Lake, or Apache Paimon&lt;/strong&gt;, enabling &lt;strong&gt;Parquet files&lt;/strong&gt; to act as &lt;strong&gt;structured, ACID-compliant tables&lt;/strong&gt; optimized for analytics.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Lakehouse Catalogs&lt;/strong&gt; like &lt;strong&gt;Apache Polaris, Nessie, Apache Gravitino, Lakekeeper, and Unity&lt;/strong&gt;, which provide &lt;strong&gt;metadata tracking&lt;/strong&gt; for seamless data discovery and access.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Managed Catalog Services&lt;/strong&gt; (e.g., &lt;strong&gt;Dremio&lt;/strong&gt;), which &lt;strong&gt;automate data optimization and governance&lt;/strong&gt;, reducing unnecessary data movement.&lt;/p&gt;
&lt;/li&gt;
&lt;/ul&gt;
&lt;h3&gt;&lt;strong&gt;Key Benefits of a Lakehouse Approach&lt;/strong&gt;&lt;/h3&gt;
&lt;p&gt;✅ &lt;strong&gt;Lower costs&lt;/strong&gt; by reducing &lt;strong&gt;data replication and processing overhead&lt;/strong&gt;.&lt;br&gt;
✅ &lt;strong&gt;Improved consistency&lt;/strong&gt; by maintaining &lt;strong&gt;a single source of truth&lt;/strong&gt;.&lt;br&gt;
✅ &lt;strong&gt;Faster time to insight&lt;/strong&gt; with &lt;strong&gt;direct access to analytics-ready data&lt;/strong&gt;.&lt;/p&gt;
&lt;h3&gt;&lt;strong&gt;Challenges That Remain&lt;/strong&gt;&lt;/h3&gt;
&lt;p&gt;Despite its advantages, a lakehouse alone does not &lt;strong&gt;completely solve all challenges&lt;/strong&gt;:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Migration Delays&lt;/strong&gt; – Moving existing data &lt;strong&gt;takes time&lt;/strong&gt;, delaying the &lt;strong&gt;full benefits&lt;/strong&gt; of a lakehouse.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Distributed Data Sources&lt;/strong&gt; – Not all data resides in the lakehouse; &lt;strong&gt;external data&lt;/strong&gt; remains a challenge.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;BI Tool Extracts&lt;/strong&gt; – Users &lt;strong&gt;may still create&lt;/strong&gt; redundant &lt;strong&gt;isolated extracts&lt;/strong&gt;, increasing costs.&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;This is where the &lt;strong&gt;Dremio Lakehouse Platform&lt;/strong&gt; fills the gap.&lt;/p&gt;
&lt;h2&gt;The Dremio Solution&lt;/h2&gt;
&lt;p&gt;&lt;a href=&quot;https://www.dremio.com&quot;&gt;Dremio is a &lt;strong&gt;lakehouse platform&lt;/strong&gt;&lt;/a&gt; that integrates &lt;strong&gt;four key capabilities&lt;/strong&gt; into a &lt;strong&gt;holistic data integration solution&lt;/strong&gt;, addressing the remaining &lt;strong&gt;lakehouse challenges&lt;/strong&gt;.&lt;/p&gt;
&lt;h3&gt;&lt;strong&gt;1. High-Performance Federated Query Engine&lt;/strong&gt;&lt;/h3&gt;
&lt;ul&gt;
&lt;li&gt;Best-in-class &lt;strong&gt;raw query performance&lt;/strong&gt;.&lt;/li&gt;
&lt;li&gt;Federates queries across &lt;strong&gt;lakehouse catalogs, data lakes, databases, and warehouses&lt;/strong&gt;.&lt;/li&gt;
&lt;li&gt;Provides a &lt;strong&gt;centralized experience&lt;/strong&gt; across &lt;strong&gt;disparate data sources&lt;/strong&gt;.&lt;/li&gt;
&lt;/ul&gt;
&lt;h3&gt;&lt;strong&gt;2. Semantic Layer&lt;/strong&gt;&lt;/h3&gt;
&lt;ul&gt;
&lt;li&gt;Enables &lt;strong&gt;virtual data marts&lt;/strong&gt; without data duplication.&lt;/li&gt;
&lt;li&gt;Built-in &lt;strong&gt;wiki and search&lt;/strong&gt; for &lt;strong&gt;dataset documentation&lt;/strong&gt;.&lt;/li&gt;
&lt;li&gt;Standardizes &lt;strong&gt;business metrics and datasets&lt;/strong&gt; across all tools.&lt;/li&gt;
&lt;/ul&gt;
&lt;h3&gt;&lt;strong&gt;3. Query Acceleration&lt;/strong&gt;&lt;/h3&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Reflections&lt;/strong&gt; replace traditional &lt;strong&gt;materialized views and BI cubes&lt;/strong&gt;.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Raw Reflections&lt;/strong&gt; (precomputed query results).&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Aggregate Reflections&lt;/strong&gt; (optimized aggregations for fast analytics).&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Automatic query acceleration&lt;/strong&gt;, with no effort required from analysts.&lt;/li&gt;
&lt;/ul&gt;
&lt;h3&gt;&lt;strong&gt;4. Integrated Lakehouse Catalog&lt;/strong&gt;&lt;/h3&gt;
&lt;ul&gt;
&lt;li&gt;Tracks and &lt;strong&gt;manages Apache Iceberg tables&lt;/strong&gt;.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Automates maintenance and cleanup&lt;/strong&gt; of data.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Provides centralized, portable governance&lt;/strong&gt; across all queries.&lt;/li&gt;
&lt;/ul&gt;
&lt;h3&gt;&lt;strong&gt;The Dremio Advantage&lt;/strong&gt;&lt;/h3&gt;
&lt;p&gt;✅ &lt;strong&gt;Instant Lakehouse Benefits&lt;/strong&gt; – Get the advantages &lt;strong&gt;immediately&lt;/strong&gt;, even before full migration.&lt;br&gt;
✅ &lt;strong&gt;Improved Consistency&lt;/strong&gt; – Ensure &lt;strong&gt;a unified definition of business metrics&lt;/strong&gt;.&lt;br&gt;
✅ &lt;strong&gt;High-Performance Analytics&lt;/strong&gt; – Federated queries + &lt;strong&gt;Reflections&lt;/strong&gt; accelerate workloads.&lt;br&gt;
✅ &lt;strong&gt;Automated Management&lt;/strong&gt; – No &lt;strong&gt;manual cleanup&lt;/strong&gt; of lakehouse tables needed.&lt;br&gt;
✅ &lt;strong&gt;Centralized Governance&lt;/strong&gt; – Unified &lt;strong&gt;access control&lt;/strong&gt; across &lt;strong&gt;all tools and sources&lt;/strong&gt;.&lt;/p&gt;
&lt;h2&gt;Conclusion&lt;/h2&gt;
&lt;p&gt;The &lt;strong&gt;data lakehouse&lt;/strong&gt; represents a transformative shift in data architecture, solving long-standing challenges around &lt;strong&gt;data consistency, cost, and accessibility&lt;/strong&gt;.&lt;/p&gt;
&lt;p&gt;However, simply adopting a &lt;strong&gt;lakehouse format&lt;/strong&gt; isn’t enough. Organizations need &lt;strong&gt;a lakehouse solution that integrates data management, acceleration, and governance&lt;/strong&gt; to fully unlock the benefits.&lt;/p&gt;
&lt;p&gt;Dremio provides that &lt;strong&gt;missing piece&lt;/strong&gt; with:&lt;br&gt;
✅ &lt;strong&gt;Federated query capabilities&lt;/strong&gt;&lt;br&gt;
✅ &lt;strong&gt;A built-in semantic layer&lt;/strong&gt;&lt;br&gt;
✅ &lt;strong&gt;Automated query acceleration&lt;/strong&gt;&lt;br&gt;
✅ &lt;strong&gt;A fully managed lakehouse catalog&lt;/strong&gt;&lt;/p&gt;
&lt;p&gt;With &lt;strong&gt;Dremio&lt;/strong&gt;, organizations &lt;strong&gt;don’t just implement a lakehouse&lt;/strong&gt;—they &lt;strong&gt;enhance it&lt;/strong&gt;, unlocking its &lt;strong&gt;full potential&lt;/strong&gt; for faster insights, better decision-making, and long-term cost savings.&lt;/p&gt;
&lt;h2&gt;Free Resources&lt;/h2&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;&lt;a href=&quot;https://hello.dremio.com/webcast-an-apache-iceberg-lakehouse-crash-course-reg.html?utm_source=ev_external_blog&amp;amp;utm_medium=influencer&amp;amp;utm_campaign=lakehouse_benefts_solu&amp;amp;utm_content=alexmerced&amp;amp;utm_term=external_blog&quot;&gt;Free Apache Iceberg Course&lt;/a&gt;&lt;/strong&gt;&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;&lt;a href=&quot;https://hello.dremio.com/wp-apache-iceberg-the-definitive-guide-reg.html?utm_source=ev_external_blog&amp;amp;utm_medium=influencer&amp;amp;utm_campaign=lakehouse-benefits-solu&amp;amp;utm_content=alexmerced&amp;amp;utm_term=external_blog&quot;&gt;Free Copy of “Apache Iceberg: The Definitive Guide”&lt;/a&gt;&lt;/strong&gt;&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;&lt;a href=&quot;https://medium.com/data-engineering-with-dremio/2025-guide-to-architecting-an-iceberg-lakehouse-9b19ed42c9de&quot;&gt;2025 Apache Iceberg Architecture Guide&lt;/a&gt;&lt;/strong&gt;&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;&lt;a href=&quot;https://medium.alexmerced.blog/guide-to-finding-apache-iceberg-events-near-you-and-being-part-of-the-greater-iceberg-community-0c38ae785ddb&quot;&gt;How to Join the Iceberg Community&lt;/a&gt;&lt;/strong&gt;&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;&lt;a href=&quot;https://youtube.com/playlist?list=PLsLAVBjQJO0p0Yq1fLkoHvt2lEJj5pcYe&amp;amp;si=WTSnqjXZv6Glkc3y&quot;&gt;Iceberg Lakehouse Engineering Video Playlist&lt;/a&gt;&lt;/strong&gt;&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;&lt;a href=&quot;https://medium.com/data-engineering-with-dremio/ultimate-directory-of-apache-iceberg-resources-e3e02efac62e&quot;&gt;Ultimate Apache Iceberg Resource Guide&lt;/a&gt;&lt;/strong&gt;&lt;/li&gt;
&lt;/ul&gt;
</content:encoded><author>Alex Merced</author></item><item><title>2025 Comprehensive Guide to Apache Iceberg</title><link>https://iceberglakehouse.com/posts/2025-01-2025-comprehensive-apache-iceberg-guide/</link><guid isPermaLink="true">https://iceberglakehouse.com/posts/2025-01-2025-comprehensive-apache-iceberg-guide/</guid><description>
- [Free Apache Iceberg Crash Course](https://university.dremio.com/?utm_source=ev_external_blog&amp;utm_medium=influencer&amp;utm_campaign=2025-iceberg-comp-...</description><pubDate>Mon, 20 Jan 2025 09:00:00 GMT</pubDate><content:encoded>&lt;ul&gt;
&lt;li&gt;&lt;a href=&quot;https://university.dremio.com/?utm_source=ev_external_blog&amp;amp;utm_medium=influencer&amp;amp;utm_campaign=2025-iceberg-comp-guide&amp;amp;utm_content=alexmerced&amp;amp;utm_term=external_blog&quot;&gt;Free Apache Iceberg Crash Course&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&quot;https://hello.dremio.com/wp-apache-iceberg-the-definitive-guide-reg.html?utm_source=ev_external_blog&amp;amp;utm_medium=influencer&amp;amp;utm_campaign=2025-iceberg-comp-guide&amp;amp;utm_content=alexmerced&amp;amp;utm_term=external_blog&quot;&gt;Free Copy of “Apache Iceberg: The Definitive Guide”&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&quot;https://medium.com/data-engineering-with-dremio/2025-guide-to-architecting-an-iceberg-lakehouse-9b19ed42c9de&quot;&gt;2025 Apache Iceberg Architecture Guide&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&quot;https://medium.alexmerced.blog/guide-to-finding-apache-iceberg-events-near-you-and-being-part-of-the-greater-iceberg-community-0c38ae785ddb&quot;&gt;How to Join the Iceberg Community&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&quot;https://youtube.com/playlist?list=PLsLAVBjQJO0p0Yq1fLkoHvt2lEJj5pcYe&amp;amp;si=WTSnqjXZv6Glkc3y&quot;&gt;Iceberg Lakehouse Engineering Video Playlist&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&quot;https://medium.com/data-engineering-with-dremio/ultimate-directory-of-apache-iceberg-resources-e3e02efac62e&quot;&gt;Ultimate Apache Iceberg Resource Guide&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;Apache Iceberg had a monumental 2024, with significant announcements and advancements from major players like Dremio, Snowflake, Databricks, AWS, and other leading data platforms. The Iceberg ecosystem is evolving rapidly, making it essential for professionals to stay up-to-date with the latest innovations. To help navigate this ever-changing space, I’m introducing an annual guide dedicated to Apache Iceberg. This guide aims to provide a comprehensive overview of Iceberg, highlight key resources, and offer valuable insights for anyone looking to deepen their knowledge. Whether you’re just starting with Iceberg or are a seasoned user, this guide will serve as your go-to resource for 2025.&lt;/p&gt;
&lt;p&gt;&lt;a href=&quot;https://www.dremio.com/blog/migration-guide-for-apache-iceberg-lakehouses/&quot;&gt;Read this article for details on migrating to Apache Iceberg.&lt;/a&gt;&lt;/p&gt;
&lt;h2&gt;What is a Table Format?&lt;/h2&gt;
&lt;p&gt;A table format, often referred to as an “open table format” or “lakehouse table format,” is a foundational component of the data lakehouse architecture. This architecture is gaining popularity for its ability to address the complexities of modern data management. Table formats transform how data stored in collections of analytics-optimized Parquet files is accessed and managed. Instead of treating these files as standalone units to be opened and read individually, a table format enables them to function like traditional database tables, complete with ACID guarantees.&lt;/p&gt;
&lt;p&gt;With a table format, users can interact with data through SQL to create, read, update, and delete records, bringing the functionality of a data warehouse directly to the data lake. This capability allows enterprises to treat their data lake as a unified platform, supporting both data warehousing and data lake use cases. It also enables teams across an organization to work with a single copy of data in their tool of choice — whether for analytics, machine learning, or operational reporting — eliminating redundant data movements, reducing costs, and improving consistency across the enterprise.&lt;/p&gt;
&lt;p&gt;Currently, there are four primary table formats driving innovation in this space:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Apache Iceberg:&lt;/strong&gt; Originating from Netflix, this blog’s focus, Iceberg is known for its flexibility and robust support for big data operations.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Delta Lake:&lt;/strong&gt; Developed by Databricks, it emphasizes simplicity and seamless integration with their ecosystem.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Apache Hudi:&lt;/strong&gt; Created by Uber, Hudi focuses on real-time data ingestion and incremental processing.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Apache Paimon:&lt;/strong&gt; Emerging from the Apache Flink Project, Paimon is designed to optimize streaming and batch processing use cases.&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;Each of these table formats plays a role in the evolving data lakehouse landscape, enabling organizations to unlock the full potential of their data lakehouse.&lt;/p&gt;
&lt;h2&gt;How Table Formats Work&lt;/h2&gt;
&lt;p&gt;At the core of every table format is a metadata layer that transforms collections of files into a table-like structure. This metadata serves as a blueprint for understanding the data, providing essential details such as:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Files included in the table:&lt;/strong&gt; Identifying the physical Parquet or similar files that make up the dataset.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Partitioning scheme:&lt;/strong&gt; Detailing how the data is partitioned to optimize query performance.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Schema:&lt;/strong&gt; Defining the structure of the table, including column names, data types, and constraints.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Snapshot history:&lt;/strong&gt; Tracking changes over time, such as additions, deletions, and updates to the table, enabling features like time travel and rollback.&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;This metadata acts as an entry point, allowing tools to treat the underlying files as a cohesive table. Instead of scanning all files in a directory, query engines use the metadata to understand the structure and contents of the table. Additionally, the metadata often includes statistics about partitions and individual files. These statistics enable advanced query optimization techniques, such as pruning or skipping files that are irrelevant to a specific query, significantly improving performance.&lt;/p&gt;
&lt;p&gt;While all table formats rely on metadata to bridge the gap between raw files and table functionality, each format structures and optimizes its metadata differently. These differences can influence performance, compatibility, and the features each format provides.&lt;/p&gt;
&lt;h2&gt;How Apache Iceberg’s Metadata is Structured&lt;/h2&gt;
&lt;p&gt;Apache Iceberg’s metadata structure is what enables it to transform raw data files into highly performant and queryable tables. This structure consists of several interrelated components, each designed to provide specific details about the table and optimize query performance. Here’s an overview of Iceberg’s key metadata elements:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;metadata.json&lt;/strong&gt;:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;The metadata.json file is the primary entry point for understanding the table.&lt;/li&gt;
&lt;li&gt;This semi-structured JSON object contains information about the table’s schema, partitioning scheme, snapshot history, and other critical details.&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Manifest List&lt;/strong&gt;:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Each snapshot in Iceberg has a corresponding Avro-based “manifest list.” This list contains rows representing each manifest (a group of files) that makes up the snapshot.&lt;/li&gt;
&lt;li&gt;Each row includes:
&lt;ul&gt;
&lt;li&gt;The file location of the manifest.&lt;/li&gt;
&lt;li&gt;Partition value information for the files in the manifest.&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;li&gt;This information allows query engines to prune unnecessary manifests and avoid scanning irrelevant partitions, improving query efficiency.&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Manifests&lt;/strong&gt;:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;A manifest lists one or more Parquet files and includes statistics about each file, such as column summaries.&lt;/li&gt;
&lt;li&gt;These statistics allow query engines to determine whether a file contains data relevant to the query, enabling file skipping for improved performance.&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Delete Files&lt;/strong&gt;:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Delete files track records that have been deleted as part of “merge-on-read” updates. During queries, the engine reconciles these files with the base data, ensuring that deleted records are ignored.&lt;/li&gt;
&lt;li&gt;There is ongoing discussion about transitioning from delete files to a “deletion vector” approach, inspired by Delta Lake, where deletions are tracked using Puffin files. As of this writing, this proposal has not yet been implemented.&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Puffin Files&lt;/strong&gt;:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Puffin files are a format for tracking binary blobs and other metadata, designed to optimize queries for engines that choose to leverage them.&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Partition Stats Files&lt;/strong&gt;:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;These files summarize statistics at the partition level, enabling even greater optimization for queries that rely on partitioning.&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;/ul&gt;
&lt;h2&gt;The Evolution of Iceberg’s Specification&lt;/h2&gt;
&lt;p&gt;Apache Iceberg’s specification is constantly evolving through community contributions and proposals. These innovations benefit the entire ecosystem, as improvements made by one platform are shared across others. For example:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Partition Stats Files originated from work by Dremio to enhance query optimization.&lt;/li&gt;
&lt;li&gt;Puffin Files were introduced by the Trino community to improve how Iceberg tracks metadata.&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;This collaborative approach ensures that Apache Iceberg continues to evolve as a cutting-edge table format for modern data lakehouses.&lt;/p&gt;
&lt;p&gt;&lt;a href=&quot;https://www.dremio.com/blog/apache-iceberg-metadata-tables/&quot;&gt;Read this article on the Apache Iceberg Metadata tables.&lt;/a&gt;&lt;/p&gt;
&lt;h2&gt;The Role of Catalogs in Apache Iceberg&lt;/h2&gt;
&lt;p&gt;One of the key features of Apache Iceberg is its immutable file structure, which makes snapshot isolation possible. Every time the data or structure of a table changes, a new metadata.json file is generated. This immutability raises an important question: how does a tool know which metadata.json file is the latest one?&lt;/p&gt;
&lt;p&gt;This is where Lakehouse Catalogs come into play. A Lakehouse Catalog serves as an abstraction layer that tracks each table’s name and links it to the most recent metadata.json file. When a table’s data or structure is updated, the catalog is also updated to point to the new metadata.json file. This update is the final step in any transaction, ensuring that the change is completed successfully and meets the atomicity requirement of ACID compliance.&lt;/p&gt;
&lt;p&gt;Lakehouse Catalogs are distinct from Enterprise Data Catalogs or Metadata Catalogs, such as those provided by companies like Alation and Collibra. While Lakehouse Catalogs focus on managing the technical details of tables and transactions, enterprise data catalogs are designed for end-users. They act as tools to help users discover, understand, and request access to datasets across an organization, enhancing data governance and usability.&lt;/p&gt;
&lt;p&gt;&lt;a href=&quot;https://www.dremio.com/blog/the-evolution-of-apache-iceberg-catalogs/&quot;&gt;Read this article to learn more about Iceberg catalogs.&lt;/a&gt;&lt;/p&gt;
&lt;h2&gt;The Apache Iceberg REST Catalog Spec&lt;/h2&gt;
&lt;p&gt;As more catalog implementations emerged, each with unique features and APIs, interoperability between tools and catalogs became a significant challenge. This lack of a unified standard created a bottleneck for seamless table management and cross-platform compatibility.&lt;/p&gt;
&lt;p&gt;To address this issue and drive innovation, the REST Catalog specification was developed. Rather than requiring all catalog providers to adopt a standardized server-side implementation, the specification introduced a universal REST API interface. This approach ensures that:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Tools and systems can rely on a consistent, client-side library to interact with catalogs.&lt;/li&gt;
&lt;li&gt;Catalog providers maintain the flexibility to implement their server-side systems in ways that suit their needs, as long as they adhere to the standard REST endpoints outlined in the specification.&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;With the REST Catalog specification, interoperability and ease of integration have dramatically improved. This innovation allows developers and enterprises to adopt or build catalogs that align with their technical and business requirements while still being compatible with any tool that supports the REST API interface. This forward-thinking design has strengthened the role of catalogs in modern lakehouse architectures, ensuring that Iceberg tables remain accessible and manageable across diverse platforms.&lt;/p&gt;
&lt;p&gt;&lt;a href=&quot;https://medium.com/data-engineering-with-dremio/what-iceberg-rest-catalog-is-and-isnt-b4a6d056f493&quot;&gt;Read more about the Iceberg REST Spec in this article.&lt;/a&gt;&lt;/p&gt;
&lt;h2&gt;Soft Deletes vs Hard Deleting Data&lt;/h2&gt;
&lt;p&gt;When working with table formats like Apache Iceberg, it’s important to understand how data deletion is handled. Unlike traditional databases, where deleted data is immediately removed from the storage layer, Iceberg follows a different approach to maintain snapshot isolation and enable features like time travel.&lt;/p&gt;
&lt;p&gt;When you execute a delete query, the data is not physically deleted. Instead:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;A new snapshot is created where the deleted data is no longer present.&lt;/li&gt;
&lt;li&gt;The original data files remain intact because the old snapshots are still valid and accessible.&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;This approach allows users to query previous versions of the table using time travel, providing a powerful mechanism for auditing, debugging, and historical analysis.&lt;/p&gt;
&lt;p&gt;However, this also means that data marked for deletion continues to occupy storage until it is physically removed. To address this, snapshot expiration procedures are performed during table maintenance using tools like Spark or Dremio. These procedures:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Invalidate old snapshots that are no longer needed.&lt;/li&gt;
&lt;li&gt;Remove the associated data files from storage, freeing up space.&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;Regular maintenance is a critical part of managing Iceberg tables to ensure storage efficiency and maintain optimal performance while leveraging the benefits of its snapshot-based architecture.&lt;/p&gt;
&lt;h2&gt;Optimizing Iceberg Data&lt;/h2&gt;
&lt;h3&gt;Minimizing Storage&lt;/h3&gt;
&lt;p&gt;The first step in reducing storage costs is selecting the right compression algorithm for your data. Compression not only reduces the amount of space required to store data but can also improve performance by accelerating data transfer across networks. These compression settings can typically be adjusted at both the table and query engine levels to suit your specific use case.&lt;/p&gt;
&lt;h3&gt;Improving Performance&lt;/h3&gt;
&lt;p&gt;Optimizing performance largely depends on how data is distributed across files. This can be achieved through regular maintenance procedures using tools like Spark or Dremio. These optimizations result in two key outcomes:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Compaction&lt;/strong&gt;:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Reduces the number of small files and consolidates delete files into fewer, larger files.&lt;/li&gt;
&lt;li&gt;Minimizes the number of I/O operations required during query execution, leading to faster reads.&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Clustering/Sorting&lt;/strong&gt;:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Reorganizes data to co-locate similar records within the same files based on commonly queried fields.&lt;/li&gt;
&lt;li&gt;Allows query engines to skip more files during a query, as the data being searched for is concentrated in a smaller subset of files.&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;By leveraging these strategies, Iceberg users can maintain a balance between efficient storage and fast query performance, ensuring their data lakehouse operates at peak efficiency. Regular maintenance is essential for reaping the full benefits of these optimizations.&lt;/p&gt;
&lt;p&gt;&lt;a href=&quot;https://www.dremio.com/blog/guide-to-maintaining-an-apache-iceberg-lakehouse/&quot;&gt;Read this article for more detail on optimizing Apache Iceberg tables.&lt;/a&gt;&lt;/p&gt;
&lt;h2&gt;Hands-on Tutorials&lt;/h2&gt;
&lt;ul&gt;
&lt;li&gt;&lt;a href=&quot;https://www.dremio.com/blog/intro-to-dremio-nessie-and-apache-iceberg-on-your-laptop/?utm_source=ev_external_blog&amp;amp;utm_medium=influencer&amp;amp;utm_campaign=2025comp-iceberg-guide&amp;amp;utm_content=alexmerced&amp;amp;utm_term=external_blog&quot;&gt;Hands-on Intro with Apache iceberg&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&quot;https://www.dremio.com/blog/intro-to-dremio-nessie-and-apache-iceberg-on-your-laptop/?utm_source=ev_external_blog&amp;amp;utm_medium=influencer&amp;amp;utm_campaign=2025comp-iceberg-guide&amp;amp;utm_content=alexmerced&amp;amp;utm_term=external_blog&quot;&gt;Intro to Apache Iceberg, Nessie and Dremio on your Laptop&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&quot;https://www.dremio.com/blog/from-json-csv-and-parquet-to-dashboards-with-apache-iceberg-and-dremio/?utm_source=ev_external_blog&amp;amp;utm_medium=influencer&amp;amp;utm_campaign=2025comp-iceberg-guide&amp;amp;utm_content=alexmerced&amp;amp;utm_term=external_blog&quot;&gt;JSON/CSV/Parquet to Apache Iceberg to BI Dashboard&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&quot;https://www.dremio.com/blog/from-mongodb-to-dashboards-with-dremio-and-apache-iceberg/?utm_source=ev_external_blog&amp;amp;utm_medium=influencer&amp;amp;utm_campaign=2025comp-iceberg-guide&amp;amp;utm_content=alexmerced&amp;amp;utm_term=external_blog&quot;&gt;From MongoDB to Apache Iceberg to BI Dashboard&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&quot;https://www.dremio.com/blog/from-sqlserver-to-dashboards-with-dremio-and-apache-iceberg/?utm_source=ev_external_blog&amp;amp;utm_medium=influencer&amp;amp;utm_campaign=2025comp-iceberg-guide&amp;amp;utm_content=alexmerced&amp;amp;utm_term=external_blog&quot;&gt;From SQLServer to Apache Iceberg to BI Dashboard&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&quot;https://www.dremio.com/blog/from-postgres-to-dashboards-with-dremio-and-apache-iceberg/?utm_source=ev_external_blog&amp;amp;utm_medium=influencer&amp;amp;utm_campaign=2025comp-iceberg-guide&amp;amp;utm_content=alexmerced&amp;amp;utm_term=external_blog&quot;&gt;From Postgres to Apache Iceberg to BI Dashboard&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&quot;https://www.dremio.com/blog/experience-the-dremio-lakehouse-hands-on-with-dremio-nessie-iceberg-data-as-code-and-dbt/?utm_source=ev_external_blog&amp;amp;utm_medium=influencer&amp;amp;utm_campaign=2025comp-iceberg-guide&amp;amp;utm_content=alexmerced&amp;amp;utm_term=external_blog&quot;&gt;Mongo/Postgres to Apache Iceberg to BI Dashboard using Git for Data and DBT&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&quot;https://www.dremio.com/blog/from-elasticsearch-to-dashboards-with-dremio-and-apache-iceberg/?utm_source=ev_external_blog&amp;amp;utm_medium=influencer&amp;amp;utm_campaign=2025comp-iceberg-guide&amp;amp;utm_content=alexmerced&amp;amp;utm_term=external_blog&quot;&gt;Elasticsearch to Apache Iceberg to BI Dashboard&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&quot;https://www.dremio.com/blog/from-mysql-to-dashboards-with-dremio-and-apache-iceberg/?utm_source=ev_external_blog&amp;amp;utm_medium=influencer&amp;amp;utm_campaign=2025comp-iceberg-guide&amp;amp;utm_content=alexmerced&amp;amp;utm_term=external_blog&quot;&gt;MySQL to Apache Iceberg to BI Dashboard&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&quot;https://www.dremio.com/blog/from-apache-druid-to-dashboards-with-dremio-and-apache-iceberg/?utm_source=ev_external_blog&amp;amp;utm_medium=influencer&amp;amp;utm_campaign=2025comp-iceberg-guide&amp;amp;utm_content=alexmerced&amp;amp;utm_term=external_blog&quot;&gt;Apache Druid to Apache Iceberg to BI Dashboard&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&quot;https://www.dremio.com/blog/bi-dashboards-with-apache-iceberg-using-aws-glue-and-apache-superset/?utm_source=ev_external_blog&amp;amp;utm_medium=influencer&amp;amp;utm_campaign=2025comp-iceberg-guide&amp;amp;utm_content=alexmerced&amp;amp;utm_term=external_blog&quot;&gt;BI Dashboards with Apache Iceberg Using AWS Glue and Apache Superset&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&quot;https://medium.com/data-engineering-with-dremio/end-to-end-basic-data-engineering-tutorial-spark-dremio-superset-c076a56eaa75&quot;&gt;End-to-End Basic Data Engineering Tutorial (Spark, Apache Iceberg Dremio, Superset)&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;
</content:encoded><author>Alex Merced</author></item><item><title>When to use Apache Xtable or Delta Lake Uniform for Data Lakehouse Interoperability</title><link>https://iceberglakehouse.com/posts/2025-01-xtable-or-uniform/</link><guid isPermaLink="true">https://iceberglakehouse.com/posts/2025-01-xtable-or-uniform/</guid><description>
- [Blog: What is a Data Lakehouse and a Table Format?](https://www.dremio.com/blog/apache-iceberg-crash-course-what-is-a-data-lakehouse-and-a-table-f...</description><pubDate>Tue, 07 Jan 2025 09:00:00 GMT</pubDate><content:encoded>&lt;ul&gt;
&lt;li&gt;&lt;a href=&quot;https://www.dremio.com/blog/apache-iceberg-crash-course-what-is-a-data-lakehouse-and-a-table-format/?utm_source=ev_external_blog&amp;amp;utm_medium=influencer&amp;amp;utm_campaign=xtable-uniform&amp;amp;utm_content=alexmerced&amp;amp;utm_term=external_blog&quot;&gt;Blog: What is a Data Lakehouse and a Table Format?&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&quot;https://hello.dremio.com/wp-apache-iceberg-the-definitive-guide-reg.html?utm_source=ev_external_blog&amp;amp;utm_medium=influencer&amp;amp;utm_campaign=xtable-uniform&amp;amp;utm_content=alexmerced&amp;amp;utm_term=external_blog&quot;&gt;Free Copy of Apache Iceberg the Definitive Guide&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&quot;https://hello.dremio.com/webcast-an-apache-iceberg-lakehouse-crash-course-reg.html?utm_source=ev_external_blog&amp;amp;utm_medium=influencer&amp;amp;utm_campaign=xtable-uniform&amp;amp;utm_content=alexmerced&amp;amp;utm_term=external_blog&quot;&gt;Free Apache Iceberg Crash Course&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&quot;https://hello.dremio.com/webcast-an-in-depth-exploration-on-the-world-of-data-lakehouse-catalogs-reg.html?utm_source=ev_external_blog&amp;amp;utm_medium=influencer&amp;amp;utm_campaign=xtable-uniform&amp;amp;utm_content=alexmerced&amp;amp;utm_term=external_blog&quot;&gt;Lakehouse Catalog Course&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&quot;https://www.youtube.com/watch?v=SIriNcVIGJQ&amp;amp;list=PLsLAVBjQJO0p0Yq1fLkoHvt2lEJj5pcYe&quot;&gt;Iceberg Lakehouse Engineering Video Playlist&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;The value of the &lt;a href=&quot;https://www.datalakehousehub.com&quot;&gt;lakehouse model&lt;/a&gt;, along with the concept of &amp;quot;shifting left&amp;quot; by moving more data modeling and processing from the data warehouse to the data lake, has seen significant buy-in and adoption over the past few years. A lakehouse integrates data warehouse functionality into a data lake using open table formats, offering the best of both worlds for analytics and storage.&lt;/p&gt;
&lt;p&gt;Enabling lakehouse architecture with open table formats like Apache Iceberg, Delta Lake, Apache Hudi, and Apache Paimon has introduced the need to manage interoperability between these formats, especially at the boundaries of data systems. While many lakehouse implementations operate seamlessly with a single table format, scenarios arise where multiple formats are involved. To address these challenges, several solutions have emerged.&lt;/p&gt;
&lt;p&gt;In this blog, we will explore these solutions and discuss when it makes sense to use them.&lt;/p&gt;
&lt;h2&gt;The Solutions&lt;/h2&gt;
&lt;p&gt;There are primarily two types of interoperability solutions for working across different table formats:&lt;/p&gt;
&lt;h3&gt;1. Mirroring Metadata&lt;/h3&gt;
&lt;p&gt;These solutions focus on maintaining metadata for the same data files in multiple formats, enabling seamless interaction across systems.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Apache XTable:&lt;/strong&gt;&lt;br&gt;
An open-source project initially developed at Onehouse and now managed by the community, Apache XTable enables bi-directional metadata conversion between different table formats. It includes incremental metadata update features, ensuring efficiency and consistency. For Iceberg, XTable generates the metadata, which can then be registered with your preferred catalog.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Delta Lake Uniform:&lt;/strong&gt;&lt;br&gt;
A feature of the Delta Lake format, Delta Lake Uniform allows you to natively write to Delta Lake tables while maintaining a secondary metadata set in Iceberg or Hudi. For Iceberg, it can sync these tables to a Hive Metastore or Unity Catalog. When used with Unity Catalog, these tables can also be exposed for reading through an Iceberg REST Catalog interface, enabling greater flexibility and integration.&lt;/p&gt;
&lt;h3&gt;2. Data Unification Platforms&lt;/h3&gt;
&lt;p&gt;Unified Lakehouse Platforms like &lt;strong&gt;Dremio&lt;/strong&gt; or open-source query engines such as &lt;strong&gt;Trino&lt;/strong&gt; provide another solution by allowing queries across multiple formats without requiring metadata conversion. This approach enables various table formats to coexist while being queried seamlessly.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Dremio’s Advantage with Apache Arrow and Reflections:&lt;/strong&gt;&lt;br&gt;
Dremio leverages the power of Apache Arrow to enable in-memory columnar processing, delivering greater performance to Trino. Additionally, Dremio’s &lt;strong&gt;Reflections&lt;/strong&gt; feature provides pre-aggregated, incremental materializations that significantly accelerate query response times especially when paired with Apache Iceberg tables. With its built-in semantic layer, Dremio ensures uniform data models that can be consistently utilized across different teams and tools. This capability enables seamless collaboration, allowing data engineers, analysts, and BI tools to consume data efficiently without requiring duplicate efforts for model creation or maintenance.&lt;/p&gt;
&lt;h2&gt;The Use Cases and Which Solution to Use&lt;/h2&gt;
&lt;h3&gt;1. Joining Delta Lake Tables with On-Prem Data&lt;/h3&gt;
&lt;p&gt;If you&apos;re a Databricks user leveraging the Databricks ecosystem and its features but also have on-premises data you&apos;d like to incorporate into certain workflows, a hybrid tool like &lt;strong&gt;Dremio&lt;/strong&gt; can help. Dremio enables you to read Delta Lake tables directly from cloud storage and federate queries with your on-prem data. However, this approach bypasses the governance settings in Unity Catalog and doesn’t take full advantage of Dremio&apos;s powerful acceleration features, such as &lt;strong&gt;Live Reflections&lt;/strong&gt; and &lt;strong&gt;Incremental Reflections&lt;/strong&gt;.&lt;/p&gt;
&lt;p&gt;A better option is to connect Dremio to Unity Catalog tables and read the Uniform Iceberg version of the metadata. This allows you to maintain Unity Catalog governance while also leveraging Dremio’s advanced acceleration capabilities for optimized query performance.&lt;/p&gt;
&lt;h3&gt;2. Streaming with Hudi and Reading as Iceberg/Delta&lt;/h3&gt;
&lt;p&gt;Apache Hudi is widely used for low-latency, high-frequency upserts in streaming use cases. However, when it comes to consuming this data, broader read support exists for Iceberg and Delta Lake. This is an ideal scenario for &lt;strong&gt;Apache XTable&lt;/strong&gt;, which can handle a continuous, one-way incremental metadata conversion. As data lands in Hudi, XTable can write new metadata in the preferred format, such as Iceberg or Delta, ensuring seamless consumption.&lt;/p&gt;
&lt;h3&gt;3. Using Snowflake and Databricks Side by Side&lt;/h3&gt;
&lt;p&gt;Snowflake in 2024 announce Polaris which has since become a community-run Incubating Apache project. Snowflake offers a managed Polaris service called &lt;strong&gt;Open Catalog&lt;/strong&gt;. Apache Polaris features the ability to connect &amp;quot;external catalogs.&amp;quot; This functionality allows Snowflake to read tables from other Iceberg REST Catalog-compliant systems, such as &lt;strong&gt;Nessie&lt;/strong&gt;, &lt;strong&gt;Gravitino&lt;/strong&gt;, &lt;strong&gt;Lake Keeper&lt;/strong&gt;, &lt;strong&gt;AWS Glue&lt;/strong&gt;, and &lt;strong&gt;Unity Catalog&lt;/strong&gt; directly from Polaris.&lt;/p&gt;
&lt;p&gt;By connecting Unity Catalog as an external catalog, you can utilize &lt;strong&gt;Uniform-enabled tables&lt;/strong&gt; from Delta Lake alongside other datasets within Snowflake, enabling seamless interoperability between Snowflake and Databricks environments.&lt;/p&gt;
&lt;h3&gt;4. Migrating Between Formats&lt;/h3&gt;
&lt;p&gt;If you&apos;re looking to migrate between table formats without rewriting all your data, &lt;strong&gt;Apache XTable&lt;/strong&gt; stands out as the optimal solution. XTable enables smooth transitions allowing you to adopt a new format with minimal disruption to your existing workflows.&lt;/p&gt;
&lt;h2&gt;Limitations to Keep in Mind&lt;/h2&gt;
&lt;p&gt;When using a mirrored metadata approach to interoperability, there are certain trade-offs to be aware of. One key limitation is the loss of write-side optimizations specific to the secondary format, such as &lt;strong&gt;hidden partitioning&lt;/strong&gt; in Iceberg or &lt;strong&gt;deletion vectors&lt;/strong&gt; in Delta Lake. Below is a list of specific limitations when using Uniform or XTable:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Uniform-enabled Delta Lake tables&lt;/strong&gt; do not currently support &lt;strong&gt;Liquid Clustering&lt;/strong&gt;.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Deletion Vectors&lt;/strong&gt; cannot be utilized with Uniform-enabled Delta Lake tables.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;XTable&lt;/strong&gt; supports only &lt;strong&gt;Copy-on-Write&lt;/strong&gt; or &lt;strong&gt;Read-Optimized Views&lt;/strong&gt; of tables.&lt;/li&gt;
&lt;li&gt;XTable has &lt;strong&gt;limited support&lt;/strong&gt; for Delta Lake&apos;s &lt;strong&gt;Generated Columns&lt;/strong&gt;.&lt;/li&gt;
&lt;/ul&gt;
&lt;h2&gt;Conclusion&lt;/h2&gt;
&lt;p&gt;As organizations increasingly adopt lakehouse architectures, interoperability across multiple table formats has become a critical need. Solutions like &lt;strong&gt;Apache XTable&lt;/strong&gt; and &lt;strong&gt;Delta Lake Uniform&lt;/strong&gt; offer powerful ways to manage metadata and facilitate collaboration between different systems. Whether you&apos;re joining Delta Lake tables with on-premises data, leveraging Hudi for streaming, integrating Snowflake with Databricks, or migrating between formats, these tools provide flexibility and efficiency.&lt;/p&gt;
&lt;p&gt;However, it’s important to evaluate the limitations of each approach to ensure it aligns with your use case. While mirrored metadata solutions simplify interoperability, they come with trade-offs, particularly on the write-side optimizations of the secondary format. By understanding these constraints and leveraging platforms like &lt;strong&gt;Dremio&lt;/strong&gt; for advanced query acceleration and data unification, you can make informed decisions and maximize the potential of your lakehouse ecosystem.&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;a href=&quot;https://www.dremio.com/blog/apache-iceberg-crash-course-what-is-a-data-lakehouse-and-a-table-format/?utm_source=ev_external_blog&amp;amp;utm_medium=influencer&amp;amp;utm_campaign=xtable-uniform&amp;amp;utm_content=alexmerced&amp;amp;utm_term=external_blog&quot;&gt;Blog: What is a Data Lakehouse and a Table Format?&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&quot;https://hello.dremio.com/wp-apache-iceberg-the-definitive-guide-reg.html?utm_source=ev_external_blog&amp;amp;utm_medium=influencer&amp;amp;utm_campaign=xtable-uniform&amp;amp;utm_content=alexmerced&amp;amp;utm_term=external_blog&quot;&gt;Free Copy of Apache Iceberg the Definitive Guide&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&quot;https://hello.dremio.com/webcast-an-apache-iceberg-lakehouse-crash-course-reg.html?utm_source=ev_external_blog&amp;amp;utm_medium=influencer&amp;amp;utm_campaign=xtable-uniform&amp;amp;utm_content=alexmerced&amp;amp;utm_term=external_blog&quot;&gt;Free Apache Iceberg Crash Course&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&quot;https://hello.dremio.com/webcast-an-in-depth-exploration-on-the-world-of-data-lakehouse-catalogs-reg.html?utm_source=ev_external_blog&amp;amp;utm_medium=influencer&amp;amp;utm_campaign=xtable-uniform&amp;amp;utm_content=alexmerced&amp;amp;utm_term=external_blog&quot;&gt;Lakehouse Catalog Course&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&quot;https://www.youtube.com/watch?v=SIriNcVIGJQ&amp;amp;list=PLsLAVBjQJO0p0Yq1fLkoHvt2lEJj5pcYe&quot;&gt;Iceberg Lakehouse Engineering Video Playlist&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;
</content:encoded><author>Alex Merced</author></item><item><title>2025 Guide to Architecting an Iceberg Lakehouse</title><link>https://iceberglakehouse.com/posts/2024-12-2025-guide-architecting-an-iceberg-lakehouse/</link><guid isPermaLink="true">https://iceberglakehouse.com/posts/2024-12-2025-guide-architecting-an-iceberg-lakehouse/</guid><description>
- [Blog: What is a Data Lakehouse and a Table Format?](https://www.dremio.com/blog/apache-iceberg-crash-course-what-is-a-data-lakehouse-and-a-table-f...</description><pubDate>Mon, 09 Dec 2024 09:00:00 GMT</pubDate><content:encoded>&lt;ul&gt;
&lt;li&gt;&lt;a href=&quot;https://www.dremio.com/blog/apache-iceberg-crash-course-what-is-a-data-lakehouse-and-a-table-format/?utm_source=ev_external_blog&amp;amp;utm_medium=influencer&amp;amp;utm_campaign=iceberg-2025-guide&amp;amp;utm_content=alexmerced&amp;amp;utm_term=external_blog&quot;&gt;Blog: What is a Data Lakehouse and a Table Format?&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&quot;https://hello.dremio.com/wp-apache-iceberg-the-definitive-guide-reg.html?utm_source=ev_external_blog&amp;amp;utm_medium=influencer&amp;amp;utm_campaign=iceberg-2025-guide&amp;amp;utm_content=alexmerced&amp;amp;utm_term=external_blog&quot;&gt;Free Copy of Apache Iceberg the Definitive Guide&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&quot;https://hello.dremio.com/webcast-an-apache-iceberg-lakehouse-crash-course-reg.html?utm_source=ev_external_blog&amp;amp;utm_medium=influencer&amp;amp;utm_campaign=iceberg-2025-guide&amp;amp;utm_content=alexmerced&amp;amp;utm_term=external_blog&quot;&gt;Free Apache Iceberg Crash Course&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&quot;https://hello.dremio.com/webcast-an-in-depth-exploration-on-the-world-of-data-lakehouse-catalogs-reg.html?utm_source=ev_external_blog&amp;amp;utm_medium=influencer&amp;amp;utm_campaign=iceberg-2025-guide&amp;amp;utm_content=alexmerced&amp;amp;utm_term=external_blog&quot;&gt;Lakehouse Catalog Course&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&quot;https://www.youtube.com/watch?v=SIriNcVIGJQ&amp;amp;list=PLsLAVBjQJO0p0Yq1fLkoHvt2lEJj5pcYe&quot;&gt;Iceberg Lakehouse Engineering Video Playlist&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;Another year has passed, and 2024 has been an eventful one for the Apache Iceberg table format. Numerous announcements throughout the year have solidified Apache Iceberg&apos;s position as the industry standard for modern data lakehouse architectures.&lt;/p&gt;
&lt;p&gt;Here are some of the highlights from 2024:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Dremio&lt;/strong&gt; announced the private preview of the &lt;strong&gt;Hybrid Iceberg Catalog&lt;/strong&gt;, extending governance and table maintenance capabilities for both on-premises and cloud environments, building on the cloud catalog&apos;s general availability from previous years.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Snowflake&lt;/strong&gt; announces &lt;strong&gt;Polaris Catalog&lt;/strong&gt;, and then Partners with Dremio, AWS, Google and Microsoft to donate it to the Apache Software Foundation.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Upsolver&lt;/strong&gt; introduced native Iceberg support, including table maintenance for streamed data landing in Iceberg tables.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Confluent&lt;/strong&gt; unveiled several features aimed at enhancing Iceberg integrations.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Databricks&lt;/strong&gt; acquired &lt;strong&gt;Tabular&lt;/strong&gt;, a startup founded by Apache Iceberg creators Ryan Blue, Daniel Weeks, and Jason Reid.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;AWS&lt;/strong&gt; announced specialized S3 table bucket types for native Apache Iceberg support.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;BigQuery&lt;/strong&gt; added native Iceberg table support.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Microsoft Fabric&lt;/strong&gt; introduced &amp;quot;Iceberg Links,&amp;quot; enabling seamless access to Iceberg tables within its environment.&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;These advancements, along with many other companies and open-source technologies expanding their support for Iceberg, have made 2024 a remarkable year for the Apache Iceberg ecosystem.&lt;/p&gt;
&lt;p&gt;Looking ahead, there is much to be excited about for Iceberg in 2025, as detailed in &lt;a href=&quot;https://medium.com/data-engineering-with-dremio/10-future-apache-iceberg-developments-to-look-forward-to-in-2025-7292a2a2101d&quot;&gt;this blog&lt;/a&gt;.&lt;/p&gt;
&lt;p&gt;With these developments in mind, it&apos;s the perfect time to reflect on how to architect an Apache Iceberg lakehouse. This guide aims to help you design a lakehouse that takes full advantage of Iceberg&apos;s capabilities and the latest industry innovations.&lt;/p&gt;
&lt;h2&gt;Why an Apache Iceberg Lakehouse?&lt;/h2&gt;
&lt;p&gt;Before we dive into the &lt;em&gt;how&lt;/em&gt;, let’s take a moment to reflect on the &lt;em&gt;why&lt;/em&gt;. A lakehouse leverages open table formats like &lt;strong&gt;Iceberg&lt;/strong&gt;, &lt;strong&gt;Delta Lake&lt;/strong&gt;, &lt;strong&gt;Hudi&lt;/strong&gt;, and &lt;strong&gt;Paimon&lt;/strong&gt; to create data warehouse-like tables directly on your data lake. The key advantage of these tables is that they provide the transactional guarantees of a traditional data warehouse without requiring data duplication across platforms or teams.&lt;/p&gt;
&lt;p&gt;This value proposition is a major reason to consider Apache Iceberg in particular. In a world where different teams rely on different tools, Iceberg stands out with the largest ecosystem of tools for reading, writing, and—most importantly—managing Iceberg tables.&lt;/p&gt;
&lt;p&gt;Additionally, recent advancements in portable governance through catalog technologies amplify the benefits of adopting Iceberg. Features like &lt;strong&gt;hidden partitioning&lt;/strong&gt; and &lt;strong&gt;partition evolution&lt;/strong&gt; further enhance Iceberg’s appeal by maximizing flexibility and simplifying partition management. These qualities ensure that you can optimize your data lakehouse architecture for both performnance and cost.&lt;/p&gt;
&lt;h2&gt;Pre-Architecture Audit&lt;/h2&gt;
&lt;p&gt;Before we begin architecting your Apache Iceberg Lakehouse, it’s essential to perform a self-audit to clearly define your requirements. Document answers to the following questions:&lt;/p&gt;
&lt;ol&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Where is my data currently?&lt;/strong&gt;&lt;br&gt;
Understanding where your data resides—whether on-premises, in the cloud, or across multiple locations—helps you plan for migration, integration, and governance challenges.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Which of my data is the most accessed by different teams?&lt;/strong&gt;&lt;br&gt;
Identifying the most frequently accessed datasets ensures you prioritize optimizing performance for these critical assets.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Which of my data is the highest cost generator?&lt;/strong&gt;&lt;br&gt;
Knowing which datasets drive the highest costs allows you to focus on cost-saving strategies, such as tiered storage or optimizing query performance.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Which data platforms will I still need if I standardize on Iceberg?&lt;/strong&gt;&lt;br&gt;
This helps you assess which existing systems can coexist with Iceberg and which ones may need to be retired or reconfigured.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;What are the SLAs I need to meet?&lt;/strong&gt;&lt;br&gt;
Service-level agreements (SLAs) dictate the performance, availability, and recovery time objectives your architecture must support.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;What tools are accessing my data, and which of those are non-negotiables?&lt;/strong&gt;&lt;br&gt;
Understanding the tools your teams rely on—especially non-negotiable ones—ensures that the ecosystem around your Iceberg lakehouse remains compatible and functional.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;What are my regulatory barriers?&lt;/strong&gt;&lt;br&gt;
Compliance with industry regulations or organizational policies must be factored into your architecture to avoid potential risks.&lt;/p&gt;
&lt;/li&gt;
&lt;/ol&gt;
&lt;p&gt;By answering these questions, you can determine which platforms align with your needs and identify the components required to generate, track, consume, and maintain your Apache Iceberg data effectively.&lt;/p&gt;
&lt;h2&gt;The Components of an Apache Iceberg Lakehouse&lt;/h2&gt;
&lt;p&gt;When moving to an Apache Iceberg lakehouse, certain fundamentals are a given—most notably that your data will be stored as &lt;strong&gt;Parquet files&lt;/strong&gt; with &lt;strong&gt;Iceberg metadata&lt;/strong&gt;. However, building a functional lakehouse requires several additional components to be carefully planned and implemented.&lt;/p&gt;
&lt;h3&gt;Key Components of an Apache Iceberg Lakehouse&lt;/h3&gt;
&lt;ol&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Storage&lt;/strong&gt;&lt;br&gt;
Where will your data be stored? The choice of storage system (e.g., cloud object storage like AWS S3 or on-premises systems) impacts cost, scalability, and performance.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Catalog&lt;/strong&gt;&lt;br&gt;
How will your tables be tracked and governed? A catalog, such as &lt;strong&gt;Nessie&lt;/strong&gt;, &lt;strong&gt;Hive&lt;/strong&gt;, or &lt;strong&gt;AWS Glue&lt;/strong&gt;, is critical for managing metadata, enabling versioning, and supporting governance.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Ingestion&lt;/strong&gt;&lt;br&gt;
What tools will you use to write data to your Iceberg tables? Ingestion tools (e.g., &lt;strong&gt;Apache Spark&lt;/strong&gt;, &lt;strong&gt;Flink&lt;/strong&gt;, &lt;strong&gt;Kafka Connect&lt;/strong&gt;) ensure data is efficiently loaded into Iceberg tables in the required format.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Integration&lt;/strong&gt;&lt;br&gt;
How will you work with Iceberg tables alongside other data? Integration tools (e.g., &lt;strong&gt;Dremio&lt;/strong&gt;, &lt;strong&gt;Trino&lt;/strong&gt;, or &lt;strong&gt;Presto&lt;/strong&gt;) allow you to query and combine Iceberg tables with other datasets and build a semantic layer that defines common business metrics.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Consumption&lt;/strong&gt;&lt;br&gt;
What tools will you use to extract value from the data? Whether for training machine learning models, generating BI dashboards, or conducting ad hoc analytics, consumption tools (e.g., &lt;strong&gt;Tableau&lt;/strong&gt;, &lt;strong&gt;Power BI&lt;/strong&gt;, &lt;strong&gt;dbt&lt;/strong&gt;) ensure data is accessible for end-users and teams.&lt;/p&gt;
&lt;/li&gt;
&lt;/ol&gt;
&lt;h3&gt;Next Steps&lt;/h3&gt;
&lt;p&gt;In this guide, we’ll explore each of these components in detail and provide guidance on how to evaluate and select the best options for your specific use case.&lt;/p&gt;
&lt;h2&gt;Storage: Building the Foundation of Your Iceberg Lakehouse&lt;/h2&gt;
&lt;p&gt;Choosing the right storage solution is critical to the success of your Apache Iceberg lakehouse. Your decision will impact performance, scalability, cost, and compliance. Below, we’ll explore the considerations for selecting cloud, on-premises, or hybrid storage, compare cloud vendors, and evaluate alternative solutions.&lt;/p&gt;
&lt;h3&gt;Reasons to Choose Cloud, On-Premises, or Hybrid Storage&lt;/h3&gt;
&lt;ul&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Cloud Storage&lt;/strong&gt;:&lt;br&gt;
Cloud storage offers scalability, cost efficiency, and managed services. It’s ideal for businesses prioritizing flexibility, global accessibility, and reduced operational overhead. Examples include &lt;strong&gt;AWS S3&lt;/strong&gt;, &lt;strong&gt;Google Cloud Storage&lt;/strong&gt;, and &lt;strong&gt;Azure Data Lake Storage (ADLS)&lt;/strong&gt;.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;On-Premises Storage&lt;/strong&gt;:&lt;br&gt;
On-premises solutions provide greater control over data and are often preferred for compliance, security, or latency-sensitive workloads. These solutions require significant investment in hardware and maintenance.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Hybrid Storage&lt;/strong&gt;:&lt;br&gt;
Hybrid storage combines the benefits of both worlds. You can use on-premises storage for sensitive or high-frequency data while leveraging the cloud for archival, burst workloads, or global access.&lt;/p&gt;
&lt;/li&gt;
&lt;/ul&gt;
&lt;h3&gt;Considerations When Choosing a Cloud Vendor&lt;/h3&gt;
&lt;p&gt;When selecting a cloud provider, consider the following:&lt;/p&gt;
&lt;ol&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Integration with Your Tech Stack&lt;/strong&gt;:&lt;br&gt;
Ensure the vendor works seamlessly with your compute and analytics tools (e.g., Apache Spark, Dremio).&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Cost Structure&lt;/strong&gt;:&lt;br&gt;
Evaluate storage costs, retrieval fees, and data transfer costs. Some providers, like AWS, offer tiered storage options to optimize costs for infrequent data access.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Global Availability and Latency&lt;/strong&gt;:&lt;br&gt;
If your organization operates globally, consider a provider with a robust network of regions to minimize latency.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Ecosystem Services&lt;/strong&gt;:&lt;br&gt;
Consider additional services like data lakes, ML tools, or managed databases provided by the vendor.&lt;/p&gt;
&lt;/li&gt;
&lt;/ol&gt;
&lt;h3&gt;Considerations for Alternative Storage Solutions&lt;/h3&gt;
&lt;p&gt;In addition to cloud and traditional on-prem options, there are specialized storage systems to consider:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;NetApp StorageGrid&lt;/strong&gt;: Optimized for object storage with S3 compatibility and strong data lifecycle management.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;VAST Data&lt;/strong&gt;: Designed for high-performance workloads, leveraging technologies like NVMe over Fabrics.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;MinIO&lt;/strong&gt;: An open-source, high-performance object storage system compatible with S3 APIs, ideal for hybrid environments.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Pure Storage&lt;/strong&gt;: Offers scalable, all-flash solutions for high-throughput workloads.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Dell EMC&lt;/strong&gt;: Provides a range of storage solutions for diverse enterprise needs.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Nutanix&lt;/strong&gt;: Combines hyper-converged infrastructure with scalable object storage.&lt;/li&gt;
&lt;/ul&gt;
&lt;h3&gt;Questions to Ask Yourself When Deciding on Storage&lt;/h3&gt;
&lt;ol&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;What are my performance requirements?&lt;/strong&gt;&lt;br&gt;
Determine the latency, throughput, and IOPS needs of your workloads.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;What is my budget?&lt;/strong&gt;&lt;br&gt;
Consider initial costs, ongoing costs, and scalability.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;What are my compliance and security needs?&lt;/strong&gt;&lt;br&gt;
Identify regulatory requirements and whether you need fine-grained access controls or encryption.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;How frequently will I access my data?&lt;/strong&gt;&lt;br&gt;
Choose between high-performance tiers and cost-effective archival solutions based on access patterns.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Do I need scalability and flexibility?&lt;/strong&gt;&lt;br&gt;
Assess whether your workloads will grow significantly or require frequent adjustments.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;What are my geographic and redundancy needs?&lt;/strong&gt;&lt;br&gt;
Decide if data needs to be replicated across regions or stored locally for compliance.&lt;/p&gt;
&lt;/li&gt;
&lt;/ol&gt;
&lt;p&gt;Selecting the right storage for your Iceberg lakehouse is a foundational step. By thoroughly evaluating your needs and the available options, you can ensure a storage solution that aligns with your performance, cost, and governance requirements.&lt;/p&gt;
&lt;h2&gt;Catalog: Managing Your Iceberg Tables&lt;/h2&gt;
&lt;p&gt;A lakehouse catalog is essential for tracking your Apache Iceberg tables and ensuring consistent access to the latest metadata across tools and teams. The catalog serves as a centralized registry, enabling seamless governance and collaboration.&lt;/p&gt;
&lt;h3&gt;Types of Iceberg Lakehouse Catalogs&lt;/h3&gt;
&lt;p&gt;Iceberg lakehouse catalogs come in two main flavors:&lt;/p&gt;
&lt;ol&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Self-Managed Catalogs&lt;/strong&gt;&lt;br&gt;
With a self-managed catalog, you deploy and maintain your own catalog system. Examples include &lt;strong&gt;Nessie&lt;/strong&gt;, &lt;strong&gt;Hive&lt;/strong&gt;, &lt;strong&gt;Polaris&lt;/strong&gt;, &lt;strong&gt;Lakekeeper&lt;/strong&gt;, and &lt;strong&gt;Gravitino&lt;/strong&gt;. While this approach requires operational effort to maintain the deployment, it provides portability of your tables and governance capabilities.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Managed Catalogs&lt;/strong&gt;&lt;br&gt;
Managed catalogs are provided as a service, offering the same benefits of portability and governance while eliminating the overhead of maintaining the deployment. Examples include &lt;strong&gt;Dremio Catalog&lt;/strong&gt; and &lt;strong&gt;Snowflake&apos;s Open Catalog&lt;/strong&gt;, which are managed versions of Polaris.&lt;/p&gt;
&lt;/li&gt;
&lt;/ol&gt;
&lt;h3&gt;Importance of the Iceberg REST Catalog Specification&lt;/h3&gt;
&lt;p&gt;A key consideration when selecting a catalog is whether it supports the &lt;strong&gt;Iceberg REST Catalog Spec&lt;/strong&gt;. This specification ensures compatibility with the broader Iceberg ecosystem, providing assurance that your lakehouse can integrate seamlessly with other Iceberg tools.&lt;/p&gt;
&lt;h4&gt;Catalogs Supporting the REST Spec:&lt;/h4&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Polaris&lt;/strong&gt;&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Gravitino&lt;/strong&gt;&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Unity Catalog&lt;/strong&gt;&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Lakekeeper&lt;/strong&gt;&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Nessie&lt;/strong&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;h4&gt;Catalogs Without REST Spec Support (Yet):&lt;/h4&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Hive&lt;/strong&gt;&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;JDBC&lt;/strong&gt;&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;AWS Glue&lt;/strong&gt;&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;BigQuery&lt;/strong&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;h3&gt;Choosing the Right Catalog&lt;/h3&gt;
&lt;p&gt;Here are some considerations to guide your choice:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;If you have on-prem data&lt;/strong&gt;:&lt;br&gt;
Dremio Catalog is the only managed catalog offering that allows for on-prem tables to co-exist with cloud tables.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;If you are already a Snowflake user&lt;/strong&gt;:&lt;br&gt;
Snowflake&apos;s Open Catalog offers an easy path to adopting Iceberg, allowing you to leverage Iceberg while staying within the Snowflake ecosystem.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;If you use Databricks with Delta Lake&lt;/strong&gt;:&lt;br&gt;
Unity Catalog’s &lt;strong&gt;Uniform&lt;/strong&gt; feature allows you to maintain an Iceberg copy of your Delta Lake table metadata, enabling compatibility with the Iceberg ecosystem.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;If you are heavily invested in the AWS ecosystem&lt;/strong&gt;:&lt;br&gt;
AWS Glue provides excellent interoperability within AWS. However, its lack of REST Catalog support may limit its usability outside the AWS ecosystem.&lt;/p&gt;
&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;Selecting the right catalog is critical for ensuring your Iceberg lakehouse operates efficiently and integrates well with your existing tools. By understanding the differences between self-managed and managed catalogs, as well as the importance of REST Catalog support, you can make an informed decision that meets your needs for portability, governance, and compatibility.&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;a href=&quot;https://www.dremio.com/blog/why-thinking-about-apache-iceberg-catalogs-like-nessie-and-apache-polaris-incubating-matters/&quot;&gt;Why Thinking about Apache Iceberg Catalogs Matters&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&quot;https://medium.com/data-engineering-with-dremio/the-importance-of-dremios-hybrid-lakehouse-catalog-b9ee9937ab4e?source=---------3&quot;&gt;Importance of Dremio&apos;s Hybrid Lakehouse Catalog&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;h2&gt;Ingesting Data into Iceberg: Managing the Flow of Data&lt;/h2&gt;
&lt;p&gt;Ingesting data into Apache Iceberg tables is a critical step in building a functional lakehouse. The tools and strategies you choose will depend on your infrastructure, data workflows, and resource constraints. Let’s explore the key options and considerations for data ingestion.&lt;/p&gt;
&lt;h3&gt;Managing Your Own Ingestion Clusters&lt;/h3&gt;
&lt;p&gt;For those who prefer complete control, managing your own ingestion clusters offers flexibility and customization. This approach allows you to handle both &lt;strong&gt;batch&lt;/strong&gt; and &lt;strong&gt;streaming&lt;/strong&gt; data using tools like:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Apache Spark&lt;/strong&gt;: Ideal for large-scale batch processing and ETL workflows.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Apache Kafka&lt;/strong&gt; or &lt;strong&gt;Apache Flink&lt;/strong&gt;: Excellent choices for real-time streaming data ingestion.&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;While these tools provide robust capabilities, they require significant effort to deploy, monitor, and maintain.&lt;/p&gt;
&lt;h3&gt;Leveraging Managed Services for Ingestion&lt;/h3&gt;
&lt;p&gt;If operational overhead is a concern, managed services can streamline the ingestion process. These services handle much of the complexity, offering ease of use and scalability:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Batch Ingestion Tools&lt;/strong&gt;:&lt;br&gt;
Examples include &lt;strong&gt;Fivetran&lt;/strong&gt;, &lt;strong&gt;Airbyte&lt;/strong&gt;, &lt;strong&gt;AWS Glue&lt;/strong&gt;, and &lt;strong&gt;ETleap&lt;/strong&gt;. These tools are well-suited for scheduled ETL tasks and periodic data loads.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Streaming Ingestion Tools&lt;/strong&gt;:&lt;br&gt;
Examples include &lt;strong&gt;Upsolver&lt;/strong&gt;, &lt;strong&gt;Delta Stream&lt;/strong&gt;, &lt;strong&gt;Estuary&lt;/strong&gt;, &lt;strong&gt;Confluent&lt;/strong&gt;, and &lt;strong&gt;Decodable&lt;/strong&gt;, which are optimized for real-time data processing and ingestion.&lt;/p&gt;
&lt;/li&gt;
&lt;/ul&gt;
&lt;h3&gt;Questions to Ask When Selecting Ingestion Tools&lt;/h3&gt;
&lt;p&gt;To narrow down your options and define your hard requirements, consider the following questions:&lt;/p&gt;
&lt;ol&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;What is the nature of your data workflow?&lt;/strong&gt;&lt;br&gt;
Determine if your use case primarily involves batch processing, streaming data, or a combination of both.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;What is your tolerance for operational complexity?&lt;/strong&gt;&lt;br&gt;
Decide whether you want to manage your own clusters or prefer managed services to reduce overhead.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;What are your performance and scalability requirements?&lt;/strong&gt;&lt;br&gt;
Assess whether your ingestion tool can handle the volume, velocity, and variety of your data.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;How critical is real-time processing?&lt;/strong&gt;&lt;br&gt;
If near-instantaneous data updates are crucial, prioritize streaming tools over batch processing solutions.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;What is your existing tech stack?&lt;/strong&gt;&lt;br&gt;
Consider tools that integrate well with your current infrastructure, such as cloud services, catalogs, or BI tools.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;What is your budget?&lt;/strong&gt;&lt;br&gt;
Balance cost considerations between self-managed clusters (higher operational costs) and managed services (subscription-based pricing).&lt;/p&gt;
&lt;/li&gt;
&lt;/ol&gt;
&lt;p&gt;Choosing the right ingestion strategy is essential for ensuring your Iceberg lakehouse runs smoothly. By weighing the trade-offs between managing your own ingestion clusters and leveraging managed services, and by asking the right questions, you can design an ingestion pipeline that aligns with your performance, cost, and operational goals.&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;a href=&quot;https://www.dremio.com/blog/cdc-with-apache-iceberg/&quot;&gt;Apache Iceberg CDC Guide&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&quot;https://www.dremio.com/blog/8-tools-for-ingesting-data-into-apache-iceberg/&quot;&gt;8 Tools for Apache Iceberg Ingestion&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;h2&gt;Data Integration: Bridging the Gap for a Unified Lakehouse Experience&lt;/h2&gt;
&lt;p&gt;Not all your data will migrate to Apache Iceberg immediately—or ever. Moving existing workloads to Iceberg requires thoughtful planning and a phased approach. However, you can still deliver the &amp;quot;Iceberg Lakehouse experience&amp;quot; to your end-users upfront, even if not all your data resides in Iceberg. This is where data integration, data virtualization, or a unified lakehouse platform like &lt;strong&gt;Dremio&lt;/strong&gt; becomes invaluable.&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;a href=&quot;https://www.dremio.com/blog/why-dremio-and-apache-iceberg/&quot;&gt;How Dremio Enhances the Iceberg Journey&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&quot;https://www.dremio.com/blog/dremio-best-sql-engine-for-apache-iceberg/&quot;&gt;3 Reasons Dremio is Best Query Engine for Apache Iceberg&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&quot;https://medium.com/data-engineering-with-dremio/10-use-cases-for-dremio-in-your-data-architecture-64a98d2be8bc?source=---------0&quot;&gt;10 Use Cases for Dremio in your Data Architecture&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;h3&gt;Why Dremio for Data Integration?&lt;/h3&gt;
&lt;ol&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Unified Access Across Data Sources&lt;/strong&gt;&lt;br&gt;
Dremio allows you to connect and query all your data sources in one place. Even if your datasets haven’t yet migrated to Iceberg, you can combine them with Iceberg tables seamlessly. Dremio’s fast query engine ensures performant analytics, regardless of where your data resides.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Built-In Semantic Layer for Consistency&lt;/strong&gt;&lt;br&gt;
Dremio includes a built-in semantic layer to define commonly used datasets across teams. This layer ensures consistent and accurate data usage for your entire organization. Since the semantic layer is based on SQL views, transitioning data from its original source to an Iceberg table is seamless—simply update the SQL definition of the views. Your end-users won’t even notice the change, yet they’ll immediately benefit from the migration.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Performance Boost with Iceberg-Based Reflections&lt;/strong&gt;&lt;br&gt;
Dremio’s &lt;strong&gt;Reflections&lt;/strong&gt; feature accelerates queries on your data. When your data is natively in Iceberg, reflections are refreshed incrementally and updated automatically when the underlying dataset changes. This results in faster query performance and reduced maintenance effort. Learn more about reflections in &lt;a href=&quot;https://www.dremio.com/blog/iceberg-lakehouses-and-dremio-reflections/&quot;&gt;this blog post&lt;/a&gt;.&lt;/p&gt;
&lt;/li&gt;
&lt;/ol&gt;
&lt;h3&gt;Delivering the Lakehouse Experience&lt;/h3&gt;
&lt;p&gt;As more of your data lands in Iceberg, Dremio enables you to seamlessly integrate it into a governed semantic layer. This layer supports a wide range of data consumers, including BI tools, notebooks, and reporting platforms, ensuring all teams can access and use the data they need effectively.&lt;/p&gt;
&lt;p&gt;By leveraging Dremio, you can bridge the gap between legacy data systems and your Iceberg lakehouse, providing a consistent and performant data experience while migrating to Iceberg at a pace that works for your organization.&lt;/p&gt;
&lt;h2&gt;Consumers: Empowering Teams with Accessible Data&lt;/h2&gt;
&lt;p&gt;Once your data is stored, integrated, and organized in your Iceberg lakehouse, the final step is ensuring it can be consumed effectively by your teams. Data consumers rely on various tools for analytics, reporting, visualization, and machine learning. A robust lakehouse architecture ensures that all these tools can access the data they need, even if they don’t natively support Apache Iceberg.&lt;/p&gt;
&lt;h3&gt;Types of Data Consumers and Their Tools&lt;/h3&gt;
&lt;ol&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Python Notebooks&lt;/strong&gt;&lt;br&gt;
Python notebooks, such as &lt;strong&gt;Jupyter&lt;/strong&gt;, &lt;strong&gt;Google Colab&lt;/strong&gt;, or &lt;strong&gt;VS Code Notebooks&lt;/strong&gt;, are widely used by data scientists and analysts for exploratory data analysis, data visualization, and machine learning. These notebooks leverage libraries like &lt;strong&gt;Pandas&lt;/strong&gt;, &lt;strong&gt;PyArrow&lt;/strong&gt;, and &lt;strong&gt;Dask&lt;/strong&gt; to process data from Iceberg tables, often via a platform like Dremio for seamless access.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;BI Tools&lt;/strong&gt;&lt;br&gt;
Business intelligence tools like &lt;strong&gt;Tableau&lt;/strong&gt;, &lt;strong&gt;Power BI&lt;/strong&gt;, and &lt;strong&gt;Looker&lt;/strong&gt; are used to create interactive dashboards and reports. While these tools may not natively support Iceberg, Dremio acts as a bridge, providing direct access to Iceberg tables and unifying them with other datasets through its semantic layer.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Reporting Tools&lt;/strong&gt;&lt;br&gt;
Tools such as &lt;strong&gt;Crystal Reports&lt;/strong&gt;, &lt;strong&gt;Microsoft Excel&lt;/strong&gt;, and &lt;strong&gt;Google Sheets&lt;/strong&gt; are commonly used for generating structured reports. Dremio&apos;s integration capabilities make it easy for reporting tools to query Iceberg tables alongside other data sources.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Machine Learning Platforms&lt;/strong&gt;&lt;br&gt;
Platforms like &lt;strong&gt;Databricks&lt;/strong&gt;, &lt;strong&gt;SageMaker&lt;/strong&gt;, or &lt;strong&gt;Azure ML&lt;/strong&gt; require efficient access to large datasets for training models. With Dremio, these platforms can query Iceberg tables directly or through unified views, simplifying data preparation workflows.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Ad Hoc Querying Tools&lt;/strong&gt;&lt;br&gt;
Tools like &lt;strong&gt;DBeaver&lt;/strong&gt;, &lt;strong&gt;SQL Workbench&lt;/strong&gt;, or even command-line utilities are popular among engineers and analysts for quick SQL-based data exploration. These tools can connect to Dremio to query Iceberg tables without additional configuration.&lt;/p&gt;
&lt;/li&gt;
&lt;/ol&gt;
&lt;h3&gt;Dremio as the Integration Layer&lt;/h3&gt;
&lt;p&gt;Most platforms, even if they don’t have native Iceberg capabilities, can leverage Dremio to access Iceberg tables alongside other datasets. Here’s how Dremio enhances the consumer experience:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Unified Data Access&lt;/strong&gt;:&lt;br&gt;
Dremio’s ability to virtualize data from multiple sources means that end-users don’t need to know where the data resides. Whether it’s Iceberg tables or legacy systems, all datasets can be queried together.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Semantic Layer&lt;/strong&gt;:&lt;br&gt;
Dremio’s semantic layer defines business metrics and datasets, ensuring consistent definitions across all tools and teams. Users querying data via BI tools or Python notebooks can rely on the same, agreed-upon metrics.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Performance Optimization&lt;/strong&gt;:&lt;br&gt;
Dremio’s &lt;strong&gt;Reflections&lt;/strong&gt; accelerate queries, providing near-instant response times for dashboards, reports, and interactive analyses, even with large Iceberg datasets.&lt;/p&gt;
&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;By enabling data consumers with tools they already know and use, your Iceberg lakehouse can become a powerful, accessible platform for delivering insights and driving decisions. Leveraging Dremio ensures that even tools without native Iceberg support can fully participate in your data ecosystem, helping you maximize the value of your Iceberg lakehouse.&lt;/p&gt;
&lt;h2&gt;Conclusion: Your Journey to a Seamless Iceberg Lakehouse&lt;/h2&gt;
&lt;p&gt;Architecting an Iceberg Lakehouse is not just about adopting a new technology; it’s about transforming how your organization stores, governs, integrates, and consumes data. This guide has walked you through the essential components—from storage and catalogs to ingestion, integration, and consumption—highlighting the importance of thoughtful planning and the tools available to support your journey.&lt;/p&gt;
&lt;p&gt;Apache Iceberg’s open table format, with its unique features like hidden partitioning, partition evolution, and broad ecosystem support, provides a solid foundation for a modern data lakehouse. By leveraging tools like &lt;strong&gt;Dremio&lt;/strong&gt; for integration and query acceleration, you can deliver the &amp;quot;Iceberg Lakehouse experience&amp;quot; to your teams immediately, even as you transition existing workloads over time.&lt;/p&gt;
&lt;p&gt;As 2025 unfolds, the Apache Iceberg ecosystem will continue to grow, bringing new innovations and opportunities to refine your architecture further. By taking a structured approach and selecting the right tools for your needs, you can build a flexible, performant, and cost-efficient lakehouse that empowers your organization to make data-driven decisions at scale.&lt;/p&gt;
&lt;p&gt;Let this guide be the starting point for your Iceberg Lakehouse journey—designed for today and ready for the future.&lt;/p&gt;
</content:encoded><author>Alex Merced</author></item><item><title>10 Future Apache Iceberg Developments to Look forward to in 2025</title><link>https://iceberglakehouse.com/posts/2024-11-10-Iceberg-developments/</link><guid isPermaLink="true">https://iceberglakehouse.com/posts/2024-11-10-Iceberg-developments/</guid><description>
- [Blog: What is a Data Lakehouse and a Table Format?](https://www.dremio.com/blog/apache-iceberg-crash-course-what-is-a-data-lakehouse-and-a-table-f...</description><pubDate>Mon, 25 Nov 2024 09:00:00 GMT</pubDate><content:encoded>&lt;ul&gt;
&lt;li&gt;&lt;a href=&quot;https://www.dremio.com/blog/apache-iceberg-crash-course-what-is-a-data-lakehouse-and-a-table-format/?utm_source=ev_external_blog&amp;amp;utm_medium=influencer&amp;amp;utm_campaign=iceberg-developments&amp;amp;utm_content=alexmerced&amp;amp;utm_term=external_blog&quot;&gt;Blog: What is a Data Lakehouse and a Table Format?&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&quot;https://hello.dremio.com/wp-apache-iceberg-the-definitive-guide-reg.html?utm_source=ev_external_blog&amp;amp;utm_medium=influencer&amp;amp;utm_campaign=iceberg-developments&amp;amp;utm_content=alexmerced&amp;amp;utm_term=external_blog&quot;&gt;Free Copy of Apache Iceberg the Definitive Guide&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&quot;https://hello.dremio.com/webcast-an-apache-iceberg-lakehouse-crash-course-reg.html?utm_source=ev_external_blog&amp;amp;utm_medium=influencer&amp;amp;utm_campaign=iceberg-developments&amp;amp;utm_content=alexmerced&amp;amp;utm_term=external_blog&quot;&gt;Free Apache Iceberg Crash Course&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&quot;https://hello.dremio.com/webcast-an-in-depth-exploration-on-the-world-of-data-lakehouse-catalogs-reg.html?utm_source=ev_external_blog&amp;amp;utm_medium=influencer&amp;amp;utm_campaign=iceberg-developments&amp;amp;utm_content=alexmerced&amp;amp;utm_term=external_blog&quot;&gt;Lakehouse Catalog Course&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&quot;https://www.youtube.com/watch?v=SIriNcVIGJQ&amp;amp;list=PLsLAVBjQJO0p0Yq1fLkoHvt2lEJj5pcYe&quot;&gt;Iceberg Lakehouse Engineering Video Playlist&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;Apache Iceberg remains at the forefront of innovation, redefining how we think about data lakehouse architectures. In 2025, the Iceberg ecosystem is poised for significant advancements that will empower organizations to handle data more efficiently, securely, and at scale. From enhanced interoperability with modern data tools to new features that simplify data management, the year ahead promises to be transformative. In this blog, we’ll explore 10 exciting developments in the Apache Iceberg ecosystem that you should keep an eye on, offering a glimpse into the future of open data lakehouse technology.&lt;/p&gt;
&lt;h2&gt;1. Scan Planning Endpoint in the Iceberg REST Catalog Specification&lt;/h2&gt;
&lt;p&gt;One of the most anticipated updates in the Iceberg ecosystem for 2025 is the addition of a &amp;quot;Scan Planning&amp;quot; endpoint to the Iceberg REST Catalog specification. This enhancement will allow query engines to delegate scan planning—the process of reading metadata to determine which files are needed for a query—to the catalog itself. This new capability opens the door to several exciting possibilities:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Optimized Scan Planning with Caching&lt;/strong&gt;: By handling scan planning at the catalog level, frequently submitted queries can benefit from cached scan plans. This optimization reduces redundant metadata reads and accelerates query execution, irrespective of the engine used to submit the query.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Enhanced Interoperability Between Table Formats&lt;/strong&gt;: With the catalog managing scan planning, the responsibility of supporting table formats shifts from the engine to the catalog. This makes it possible for Iceberg REST-compliant catalogs to facilitate querying tables in multiple formats. For example, a catalog could generate file lists for queries across various table formats, paving the way for broader interoperability.&lt;/p&gt;
&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;Looking ahead, the introduction of this endpoint is not only a step toward improving query performance but also a glimpse into a future where catalogs become the central hub for table format compatibility. To fully realize this vision, a similar endpoint for handling metadata writes may be introduced in the future, further extending the catalog&apos;s capabilities.&lt;/p&gt;
&lt;p&gt;&lt;a href=&quot;https://github.com/apache/iceberg/pull/11369&quot;&gt;Scan Planning Pull Request&lt;/a&gt;&lt;/p&gt;
&lt;h2&gt;2. Interoperable Views in Apache Iceberg&lt;/h2&gt;
&lt;p&gt;Interoperable views are another major development to watch in the Apache Iceberg ecosystem for 2025. While Iceberg already supports a view specification, the current approach has limitations: it stores the SQL used to define the view, but since SQL syntax varies across engines, resolving these views is not always feasible in a multi-engine environment.&lt;/p&gt;
&lt;p&gt;To address this challenge, two promising solutions are being explored:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;SQL Transpilation with Frameworks like SQLGlot&lt;/strong&gt;: By leveraging SQL transpilation tools such as SQLGlot, the SQL defining a view can be translated between different dialects. This approach builds on the existing view specification, which includes a &amp;quot;dialect&amp;quot; property to identify the SQL syntax used to define the view. This enables engines to resolve views by translating the SQL into a dialect they support.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Intermediate Representation for Views&lt;/strong&gt;: Another approach involves using an intermediate format to represent views, independent of SQL syntax. Two notable projects being discussed in this context are:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Apache Calcite&lt;/strong&gt;: An open-source project that provides a framework for parsing, validating, and optimizing relational algebra queries. Calcite could serve as a bridge, converting SQL into a standardized logical plan that any engine can execute.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Substrait&lt;/strong&gt;: A cross-language specification for defining and exchanging query plans. Substrait focuses on representing queries in a portable, engine-agnostic format, making it a strong candidate for enabling true interoperability.&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;These advancements aim to make views in Iceberg truly interoperable, allowing seamless sharing and resolution of views across different engines and workflows. Whether through SQL transpilation or an intermediate format, these improvements will significantly enhance Iceberg&apos;s flexibility in heterogeneous data environments.&lt;/p&gt;
&lt;h2&gt;3. Materialized Views in Apache Iceberg&lt;/h2&gt;
&lt;p&gt;A materialized view stores a query definition as a logical table, with precomputed data that serves query results. By shifting the computational cost to precomputation, materialized views significantly improve query performance while maintaining flexibility. The Iceberg community is working towards a common metadata format for materialized views, enabling their creation, reading, and updating across different engines.&lt;/p&gt;
&lt;h4&gt;&lt;strong&gt;Key Features of Iceberg Materialized Views&lt;/strong&gt;&lt;/h4&gt;
&lt;ul&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Metadata Structure&lt;/strong&gt;: A materialized view is realized as a combination of an Iceberg view (the &amp;quot;common view&amp;quot;) storing the query definition and a pointer to the precomputed data, and an Iceberg table (the &amp;quot;storage table&amp;quot;) holding the precomputed data. The storage table is marked with states like &amp;quot;fresh,&amp;quot; &amp;quot;stale,&amp;quot; or &amp;quot;invalid&amp;quot; based on its alignment with source table snapshots.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Storage Table State Management&lt;/strong&gt;:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;A &lt;strong&gt;fresh&lt;/strong&gt; state indicates the precomputed data is up-to-date.&lt;/li&gt;
&lt;li&gt;A &lt;strong&gt;stale&lt;/strong&gt; state requires the query engine to decide between full or incremental refresh.&lt;/li&gt;
&lt;li&gt;An &lt;strong&gt;invalid&lt;/strong&gt; state mandates a full refresh.&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Refresh Mechanisms&lt;/strong&gt;: Materialized views can be refreshed through various methods, including event-driven triggers, query-time checks, scheduled refreshes, or manual operations. These methods ensure the precomputed data remains relevant to the underlying data.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Query Optimization&lt;/strong&gt;: Queries can use precomputed data directly if it meets freshness criteria (e.g., the &lt;code&gt;materialization.data.max-staleness&lt;/code&gt; property). Otherwise, the query engine determines the next steps, such as refreshing the data or falling back to the original view definition.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Interoperability and Governance&lt;/strong&gt;: The shared metadata format supports lineage tracking and consistent states, making materialized views easy to manage and audit across engines.&lt;/p&gt;
&lt;/li&gt;
&lt;/ul&gt;
&lt;h4&gt;&lt;strong&gt;Impact on the Iceberg Ecosystem&lt;/strong&gt;&lt;/h4&gt;
&lt;p&gt;Materialized views in Iceberg offer a way to optimize query performance while ensuring that optimizations are portable across systems. By providing a standard for metadata and refresh mechanisms, Iceberg hopes to enable organizations to harness the benefits of materialized views without being locked into specific query engines. This development will make Iceberg an even more compelling choice for building scalable, engine-agnostic data lakehouses.&lt;/p&gt;
&lt;p&gt;&lt;a href=&quot;https://github.com/apache/iceberg/pull/11041&quot;&gt;Materilized View Pull Request&lt;/a&gt;&lt;/p&gt;
&lt;h2&gt;4. Variant Data Format in Apache Iceberg&lt;/h2&gt;
&lt;p&gt;The upcoming introduction of the &lt;strong&gt;variant data format&lt;/strong&gt; in Apache Iceberg marks a significant advancement in handling semi-structured data. While Iceberg already supports a JSON data format, the variant data type offers a more efficient and versatile approach to managing JSON-like data, aligning with the Spark variant format.&lt;/p&gt;
&lt;h4&gt;&lt;strong&gt;How Variant Differs from JSON&lt;/strong&gt;&lt;/h4&gt;
&lt;p&gt;The variant data format is designed to provide a structured representation of semi-structured data, improving performance and usability:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Typed Representation&lt;/strong&gt;: Unlike traditional JSON, which treats data as text, the variant format incorporates schema-aware types. This allows for faster processing and easier integration with analytical workflows.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Efficient Storage&lt;/strong&gt;: By leveraging columnar storage principles, variant data optimizes storage space and access patterns for semi-structured data, reducing the overhead associated with parsing and serializing JSON.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Query Flexibility&lt;/strong&gt;: Variant enables advanced querying capabilities, such as filtering and aggregations, on semi-structured data without requiring extensive transformations or data flattening.&lt;/li&gt;
&lt;/ul&gt;
&lt;h4&gt;&lt;strong&gt;Benefits of the Variant Format&lt;/strong&gt;&lt;/h4&gt;
&lt;ol&gt;
&lt;li&gt;&lt;strong&gt;Improved Performance&lt;/strong&gt;: By avoiding the need to repeatedly parse JSON strings, the variant format enables faster data access and manipulation, making it ideal for high-performance analytical queries.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Better Interoperability&lt;/strong&gt;: With consensus on using the Spark variant format, this addition ensures compatibility across engines that support the same standard.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Simplified Workflows&lt;/strong&gt;: Variant makes it easier to work with semi-structured data within Iceberg tables, allowing for more straightforward schema evolution and query optimizations.&lt;/li&gt;
&lt;/ol&gt;
&lt;p&gt;&lt;a href=&quot;https://github.com/apache/iceberg/pull/10831&quot;&gt;Variant Data Format Pull Request&lt;/a&gt;&lt;/p&gt;
&lt;h2&gt;5. Native Geospatial Data Type Support in Apache Iceberg&lt;/h2&gt;
&lt;p&gt;The integration of geospatial data types into Apache Iceberg is poised to open up powerful capabilities for organizations managing location-based data. While geospatial data has long been supported by big data tools like GeoParquet, Apache Sedona, and GeoMesa, Iceberg&apos;s position as a central table format makes the addition of native geospatial support a natural evolution. Leveraging prior efforts such as Geolake and Havasu, this proposal aims to bring geospatial functionality into Iceberg without the need for project forks.&lt;/p&gt;
&lt;h4&gt;&lt;strong&gt;Proposed Features&lt;/strong&gt;&lt;/h4&gt;
&lt;p&gt;The geospatial extension for Iceberg will introduce:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Geospatial Data Types&lt;/strong&gt;: Support for types like &lt;code&gt;POINT&lt;/code&gt;, &lt;code&gt;LINESTRING&lt;/code&gt;, and &lt;code&gt;POLYGON&lt;/code&gt;.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Geospatial Expressions&lt;/strong&gt;: Functions such as &lt;code&gt;ST_COVERS&lt;/code&gt;, &lt;code&gt;ST_COVERED_BY&lt;/code&gt;, and &lt;code&gt;ST_INTERSECTS&lt;/code&gt; for spatial querying.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Geospatial Partition Transforms&lt;/strong&gt;: Partitioning using geospatial transforms like &lt;code&gt;XZ2&lt;/code&gt; to optimize query filtering.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Geospatial Sorting&lt;/strong&gt;: Sorting data with space-filling curves, such as the Hilbert curve, to enhance data locality and query efficiency.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Spark Integration&lt;/strong&gt;: Built-in support for working with geospatial data in Spark.&lt;/li&gt;
&lt;/ul&gt;
&lt;h4&gt;&lt;strong&gt;Key Use Cases&lt;/strong&gt;&lt;/h4&gt;
&lt;ol&gt;
&lt;li&gt;&lt;strong&gt;Table Creation with Geospatial Types&lt;/strong&gt;:&lt;/li&gt;
&lt;/ol&gt;
&lt;pre&gt;&lt;code class=&quot;language-sql&quot;&gt;   CREATE TABLE geom_table (geom GEOMETRY);
&lt;/code&gt;&lt;/pre&gt;
&lt;ol start=&quot;2&quot;&gt;
&lt;li&gt;&lt;strong&gt;Inserting Geospatial Data&lt;/strong&gt;:&lt;/li&gt;
&lt;/ol&gt;
&lt;pre&gt;&lt;code class=&quot;language-sql&quot;&gt;  INSERT INTO geom_table VALUES (&apos;POINT(1 2)&apos;, &apos;LINESTRING(1 2, 3 4)&apos;);
&lt;/code&gt;&lt;/pre&gt;
&lt;ol start=&quot;3&quot;&gt;
&lt;li&gt;&lt;strong&gt;Querying with Geospatial Predicates&lt;/strong&gt;:&lt;/li&gt;
&lt;/ol&gt;
&lt;pre&gt;&lt;code class=&quot;language-sql&quot;&gt;SELECT * FROM geom_table WHERE ST_COVERS(geom, ST_POINT(0.5, 0.5));
&lt;/code&gt;&lt;/pre&gt;
&lt;ol start=&quot;4&quot;&gt;
&lt;li&gt;&lt;strong&gt;Geospatial Partitioning&lt;/strong&gt;:&lt;/li&gt;
&lt;/ol&gt;
&lt;pre&gt;&lt;code class=&quot;language-sql&quot;&gt;ALTER TABLE geom_table ADD PARTITION FIELD (xz2(geom));
&lt;/code&gt;&lt;/pre&gt;
&lt;ol start=&quot;5&quot;&gt;
&lt;li&gt;&lt;strong&gt;Optimized File Sorting for Geospatial Queries&lt;/strong&gt;:&lt;/li&gt;
&lt;/ol&gt;
&lt;pre&gt;&lt;code class=&quot;language-sql&quot;&gt;CALL rewrite_data_files(table =&amp;gt; `geom_table`, sort_order =&amp;gt; `hilbert(geom)`);
&lt;/code&gt;&lt;/pre&gt;
&lt;h3&gt;Benefits&lt;/h3&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Efficient Geospatial Analysis&lt;/strong&gt;: By natively supporting geospatial data types and operations, Iceberg will enable faster and more scalable location-based queries.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Improved Query Optimization&lt;/strong&gt;: Partition transforms and spatial sorting will enhance filtering and reduce data scan overhead.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Broad Ecosystem Integration&lt;/strong&gt;: With Spark integration and compatibility with geospatial standards like GeoParquet, Iceberg becomes a powerful tool for geospatial data management.&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;&lt;a href=&quot;https://github.com/apache/iceberg/issues/10260&quot;&gt;GeoSpatial Proposal&lt;/a&gt;&lt;/p&gt;
&lt;h2&gt;6. Apache Polaris Federated Catalogs&lt;/h2&gt;
&lt;p&gt;Apache Polaris is expanding its capabilities with the concept of &lt;strong&gt;federated catalogs&lt;/strong&gt;, allowing seamless connectivity to external catalogs such as Nessie, Gravitino, and Unity. This feature makes the tables in these external catalogs visible and queryable from a Polaris connection, streamlining Iceberg data federation within a single interface.&lt;/p&gt;
&lt;h4&gt;&lt;strong&gt;Current State&lt;/strong&gt;&lt;/h4&gt;
&lt;p&gt;At present, Polaris supports &lt;strong&gt;read-only external catalogs&lt;/strong&gt;, enabling users to query and analyze data from connected catalogs without duplicating data or moving it between systems. This functionality simplifies data integration and allows users to leverage the strengths of multiple catalogs from a centralized Polaris environment.&lt;/p&gt;
&lt;h4&gt;&lt;strong&gt;Future Vision: Read/Write Federation&lt;/strong&gt;&lt;/h4&gt;
&lt;p&gt;There is active discussion and interest within the community to extend this capability to &lt;strong&gt;read/write catalog federation&lt;/strong&gt;. With this enhancement, users will be able to:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Read&lt;/strong&gt; data from external catalogs as they currently do.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Write&lt;/strong&gt; data directly back to external catalogs, making updates, inserts, and schema modifications possible.&lt;/li&gt;
&lt;/ul&gt;
&lt;h4&gt;&lt;strong&gt;Key Benefits of Federated Catalogs&lt;/strong&gt;&lt;/h4&gt;
&lt;ol&gt;
&lt;li&gt;&lt;strong&gt;Unified Data Access&lt;/strong&gt;: Query data across multiple catalogs without the need for extensive ETL processes or duplication.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Improved Interoperability&lt;/strong&gt;: Leverage the unique features of external catalogs like Nessie and Unity directly within Polaris.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Streamlined Workflows&lt;/strong&gt;: Enable read/write operations to external catalogs, reducing friction in workflows that span multiple systems.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Enhanced Governance&lt;/strong&gt;: Centralize metadata and access controls while interacting with data stored in different catalogs.&lt;/li&gt;
&lt;/ol&gt;
&lt;h4&gt;&lt;strong&gt;The Road Ahead&lt;/strong&gt;&lt;/h4&gt;
&lt;p&gt;The move toward read/write federation make it easier for organizations to manage diverse data ecosystems. By bridging the gap between disparate catalogs, Polaris continues to simplify data management and empower users to unlock the full potential of their data.&lt;/p&gt;
&lt;h2&gt;7. Table Maintenance Service in Apache Polaris&lt;/h2&gt;
&lt;p&gt;A feature beign discussed in the Apache Polaris community is the &lt;strong&gt;table maintenance service&lt;/strong&gt;, designed to streamline table optimization and maintenance workflows. This service would function as a notification system, broadcasting maintenance requests to subscribed tools, enabling automated and efficient table management.&lt;/p&gt;
&lt;h4&gt;&lt;strong&gt;How It Could Works&lt;/strong&gt;&lt;/h4&gt;
&lt;p&gt;The table maintenance service allows users to configure maintenance triggers based on specific conditions. For example, users could set a table to be optimized every 10 snapshots. When this condition is met, the service broadcasts a notification to subscribed tools such as Dremio, Upsolver and any other service that optimizes Iceberg tables.&lt;/p&gt;
&lt;h4&gt;&lt;strong&gt;Key Use Cases&lt;/strong&gt;&lt;/h4&gt;
&lt;ol&gt;
&lt;li&gt;&lt;strong&gt;Automated Table Optimization&lt;/strong&gt;: Configure tables to trigger maintenance tasks, such as compaction or sorting, at predefined intervals or based on conditions like snapshot count.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Cross-Tool Integration&lt;/strong&gt;: Seamlessly integrate with multiple tools in the ecosystem, enabling flexible and automated workflows.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Cadence Management&lt;/strong&gt;: Ensure maintenance tasks are performed on a schedule or event-driven basis, aligned with the table’s operational needs.&lt;/li&gt;
&lt;/ol&gt;
&lt;h4&gt;&lt;strong&gt;Benefits&lt;/strong&gt;&lt;/h4&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Reduced Operational Overhead&lt;/strong&gt;: Automate repetitive maintenance tasks, minimizing the need for manual intervention.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Improved Performance&lt;/strong&gt;: Regular maintenance ensures tables remain optimized for query performance and storage efficiency.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Ecosystem Flexibility&lt;/strong&gt;: By supporting a wide range of subscribing tools, the service adapts to diverse data pipelines and optimization strategies.&lt;/li&gt;
&lt;/ul&gt;
&lt;h2&gt;8. Catalog Versioning in Apache Polaris&lt;/h2&gt;
&lt;p&gt;&lt;strong&gt;Catalog versioning&lt;/strong&gt;, a transformative feature currently available in the &lt;a href=&quot;https://www.projectnessie.org&quot;&gt;Nessie catalog&lt;/a&gt;, is under discussion for inclusion in the Apache Polaris ecosystem. Adding catalog versioning to Polaris would unlock a range of powerful capabilities, positioning Polaris as a unifying force for the most innovative ideas in the Iceberg catalog space.&lt;/p&gt;
&lt;h4&gt;&lt;strong&gt;The Power of Catalog Versioning&lt;/strong&gt;&lt;/h4&gt;
&lt;p&gt;Catalog versioning provides a robust foundation for advanced data management scenarios by enabling:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Multi-Table Transactions&lt;/strong&gt;: Ensure atomic operations across multiple tables for consistent updates.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Multi-Table Rollbacks&lt;/strong&gt;: Revert changes across multiple tables to a consistent state, enhancing error recovery.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Zero-Copy Environments&lt;/strong&gt;: Create lightweight, zero-copy development or testing environments without duplicating data.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Multi-Table Isolation&lt;/strong&gt;: Create a branch to isolate work on data without affecting the main branch.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Tagging and Versioning&lt;/strong&gt;: Mark specific states of the catalog for easy access, auditing, or rollback.&lt;/li&gt;
&lt;/ul&gt;
&lt;h4&gt;&lt;strong&gt;Proposed Integration with Polaris&lt;/strong&gt;&lt;/h4&gt;
&lt;p&gt;Discussions around bringing catalog versioning to Polaris also involve designing a new model that aligns with Polaris&apos; architecture. This integration could enable:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Unified Catalog Management&lt;/strong&gt;: Allow users to manage table states and snapshots across all their data directly in Polaris.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Enhanced Interoperability&lt;/strong&gt;: Unify Polaris&apos; capabilities with the multi-table capabilities of Nessie, creating a comprehensive solution for data management.&lt;/li&gt;
&lt;/ul&gt;
&lt;h4&gt;&lt;strong&gt;Potential Impact&lt;/strong&gt;&lt;/h4&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Advanced Data Workflows&lt;/strong&gt;: Catalog versioning would enable Polaris users to orchestrate complex workflows with confidence and precision.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Improved Collaboration&lt;/strong&gt;: Teams could work in parallel using isolated views of the catalog, fostering innovation without risk to production data.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Ecosystem Leadership&lt;/strong&gt;: By adopting catalog versioning, Polaris would become the definitive platform for managing Iceberg catalogs, consolidating the best ideas from the community.&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;If implemented, catalog versioning in Polaris would elevate its capabilities, making it an indispensable tool for organizations looking to modernize their data lakehouse operations.&lt;/p&gt;
&lt;p&gt;&lt;a href=&quot;https://www.dremio.com/blog/intro-to-dremio-nessie-and-apache-iceberg-on-your-laptop/&quot;&gt;Try Catalog Versioning on your Laptop&lt;/a&gt;&lt;/p&gt;
&lt;h2&gt;9. Updates to Iceberg&apos;s Delete File Specification&lt;/h2&gt;
&lt;p&gt;Apache Iceberg’s innovative delete file specification has been central to enabling efficient upserts by managing record deletions with minimal performance overhead. Currently, Iceberg supports two types of delete files:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Position Deletes&lt;/strong&gt;: Track the position of a deleted record in a data file.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Equality Deletes&lt;/strong&gt;: Track the values being deleted across multiple rows.&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;While these mechanisms are effective, each comes with trade-offs. Position deletes can lead to high I/O costs when reconciling deletions during queries, while equality deletes, though fast to write, impose significant costs during reads and optimizations. Discussions in the Iceberg community propose enhancements to both approaches.&lt;/p&gt;
&lt;h4&gt;&lt;strong&gt;Proposed Changes to Position Deletes&lt;/strong&gt;&lt;/h4&gt;
&lt;p&gt;The key proposal is to transition position deletes from their current file-based storage to &lt;strong&gt;deletion vectors&lt;/strong&gt; within Puffin files. Puffin, a specification for structured metadata storage, allows for compact and efficient storage of additional data.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Benefits of Storing Deletion Vectors in Puffin Files&lt;/strong&gt;:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Reduced I/O Costs&lt;/strong&gt;: Instead of opening multiple delete files, engines can read a single blob within a Puffin file, significantly improving read performance.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Streamlined Metadata Access&lt;/strong&gt;: Puffin files consolidate metadata and auxiliary information, simplifying the reconciliation process.&lt;/li&gt;
&lt;/ul&gt;
&lt;h4&gt;&lt;strong&gt;Reimagining Equality Deletes for Streaming&lt;/strong&gt;&lt;/h4&gt;
&lt;p&gt;Another area of discussion is rethinking equality deletes to better suit streaming scenarios. The current design prioritizes fast writes but incurs steep costs for reading and optimizing. Possible enhancements include:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Streaming-Optimized Delete Mechanisms&lt;/strong&gt;: Developing a model where deletes are reconciled incrementally in real-time, reducing read-time overhead.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Hybrid Approaches&lt;/strong&gt;: Combining aspects of position and equality deletes to balance the cost of writes, reads, and optimizations.&lt;/li&gt;
&lt;/ul&gt;
&lt;h4&gt;&lt;strong&gt;Impact of These Changes&lt;/strong&gt;&lt;/h4&gt;
&lt;ol&gt;
&lt;li&gt;&lt;strong&gt;Improved Query Performance&lt;/strong&gt;: Faster reconciliation during queries, especially for workloads with high delete volumes.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Better Streaming Support&lt;/strong&gt;: Lower overhead for real-time processing scenarios, making Iceberg more viable for continuous data ingestion and updates.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Enhanced Scalability&lt;/strong&gt;: Reduced I/O during reconciliation improves scalability for large-scale datasets.&lt;/li&gt;
&lt;/ol&gt;
&lt;h3&gt;10. General Availability of the Dremio Hybrid Catalog&lt;/h3&gt;
&lt;p&gt;The &lt;strong&gt;Dremio Hybrid Catalog&lt;/strong&gt;, currently in private preview, is set to become generally available sometime in 2025. Built on the foundation of the Polaris catalog, this managed Iceberg catalog is tightly integrated into Dremio, offering a streamlined and feature-rich experience for managing data across cloud and on-prem environments.&lt;/p&gt;
&lt;h4&gt;&lt;strong&gt;Key Features of the Hybrid Catalog&lt;/strong&gt;&lt;/h4&gt;
&lt;ol&gt;
&lt;li&gt;&lt;strong&gt;Integrated Table Maintenance&lt;/strong&gt;: Automate table maintenance tasks such as compaction, cleanup, and optimization, ensuring that tables remain performant with minimal user intervention.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Multi-Location Cataloging&lt;/strong&gt;: Seamlessly manage and catalog tables across diverse storage environments, including multiple cloud providers and on-premises storage solutions.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Polaris-Based Capabilities&lt;/strong&gt;: Leverage the powerful features of the Polaris catalog, including RBAC, external catalogs, and potential catalog versioning (if implemented by Polaris).&lt;/li&gt;
&lt;/ol&gt;
&lt;h4&gt;&lt;strong&gt;Benefits of the Dremio Hybrid Catalog&lt;/strong&gt;&lt;/h4&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Simplified Data Management&lt;/strong&gt;: Provides a unified interface for managing Iceberg tables across different environments, reducing complexity.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Enhanced Performance&lt;/strong&gt;: Automated maintenance and cleanup ensure tables are always optimized for fast and efficient queries.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Flexibility and Scalability&lt;/strong&gt;: Supports hybrid architectures, allowing organizations to manage data wherever it resides without sacrificing control or performance.&lt;/li&gt;
&lt;/ul&gt;
&lt;h4&gt;&lt;strong&gt;Impact on the Iceberg Ecosystem&lt;/strong&gt;&lt;/h4&gt;
&lt;p&gt;The general availability of the Dremio Hybrid Catalog will mark a significant milestone for organizations adopting Iceberg. By integrating Polaris&apos; advanced capabilities into a managed catalog, Dremio is poised to deliver a seamless and efficient solution for managing data lakehouse environments. This innovation underscores Dremio&apos;s commitment to making Iceberg a cornerstone of modern data management strategies.&lt;/p&gt;
&lt;h3&gt;Conclusion&lt;/h3&gt;
&lt;p&gt;As we look ahead to 2025, the Apache Iceberg ecosystem is set to deliver groundbreaking advancements that will transform how organizations manage and analyze their data. From enhanced query optimization with scan planning endpoints and materialized views to broader support for geospatial and semi-structured data, Iceberg continues to push the boundaries of data lakehouse capabilities. Exciting developments like the Dremio Hybrid Catalog and updates to delete file specifications promise to make Iceberg even more efficient, scalable, and interoperable.&lt;/p&gt;
&lt;p&gt;These innovations highlight the vibrant community driving Apache Iceberg and the collective effort to address the evolving needs of modern data platforms. Whether you&apos;re leveraging Iceberg for its robust cataloging features, seamless multi-cloud support, or cutting-edge query capabilities, 2025 is shaping up to be a year of remarkable growth and opportunity. Stay tuned as Apache Iceberg continues to lead the way in open data lakehouse technology, empowering organizations to unlock the full potential of their data.&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;a href=&quot;https://www.dremio.com/blog/apache-iceberg-crash-course-what-is-a-data-lakehouse-and-a-table-format/?utm_source=ev_external_blog&amp;amp;utm_medium=influencer&amp;amp;utm_campaign=iceberg-developments&amp;amp;utm_content=alexmerced&amp;amp;utm_term=external_blog&quot;&gt;Blog: What is a Data Lakehouse and a Table Format?&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&quot;https://hello.dremio.com/wp-apache-iceberg-the-definitive-guide-reg.html?utm_source=ev_external_blog&amp;amp;utm_medium=influencer&amp;amp;utm_campaign=iceberg-developments&amp;amp;utm_content=alexmerced&amp;amp;utm_term=external_blog&quot;&gt;Free Copy of Apache Iceberg the Definitive Guide&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&quot;https://hello.dremio.com/webcast-an-apache-iceberg-lakehouse-crash-course-reg.html?utm_source=ev_external_blog&amp;amp;utm_medium=influencer&amp;amp;utm_campaign=iceberg-developments&amp;amp;utm_content=alexmerced&amp;amp;utm_term=external_blog&quot;&gt;Free Apache Iceberg Crash Course&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&quot;https://hello.dremio.com/webcast-an-in-depth-exploration-on-the-world-of-data-lakehouse-catalogs-reg.html?utm_source=ev_external_blog&amp;amp;utm_medium=influencer&amp;amp;utm_campaign=iceberg-developments&amp;amp;utm_content=alexmerced&amp;amp;utm_term=external_blog&quot;&gt;Lakehouse Catalog Course&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&quot;https://www.youtube.com/watch?v=SIriNcVIGJQ&amp;amp;list=PLsLAVBjQJO0p0Yq1fLkoHvt2lEJj5pcYe&quot;&gt;Iceberg Lakehouse Engineering Video Playlist&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;
</content:encoded><author>Alex Merced</author></item><item><title>Deep Dive into Dremio&apos;s File-based Auto Ingestion into Apache Iceberg Tables</title><link>https://iceberglakehouse.com/posts/2024-11-deep-dive-auto-ingest-dremio-iceberg/</link><guid isPermaLink="true">https://iceberglakehouse.com/posts/2024-11-deep-dive-auto-ingest-dremio-iceberg/</guid><description>
- [Blog: What is a Data Lakehouse and a Table Format?](https://www.dremio.com/blog/apache-iceberg-crash-course-what-is-a-data-lakehouse-and-a-table-f...</description><pubDate>Fri, 15 Nov 2024 09:00:00 GMT</pubDate><content:encoded>&lt;ul&gt;
&lt;li&gt;&lt;a href=&quot;https://www.dremio.com/blog/apache-iceberg-crash-course-what-is-a-data-lakehouse-and-a-table-format/?utm_source=ev_external_blog&amp;amp;utm_medium=influencer&amp;amp;utm_campaign=autoingestdremio&amp;amp;utm_content=alexmerced&amp;amp;utm_term=external_blog&quot;&gt;Blog: What is a Data Lakehouse and a Table Format?&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&quot;https://hello.dremio.com/wp-apache-iceberg-the-definitive-guide-reg.html?utm_source=ev_external_blog&amp;amp;utm_medium=influencer&amp;amp;utm_campaign=autoingestdremio&amp;amp;utm_content=alexmerced&amp;amp;utm_term=external_blog&quot;&gt;Free Copy of Apache Iceberg the Definitive Guide&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&quot;https://hello.dremio.com/webcast-an-apache-iceberg-lakehouse-crash-course-reg.html?utm_source=ev_external_blog&amp;amp;utm_medium=influencer&amp;amp;utm_campaign=autoingestdremio&amp;amp;utm_content=alexmerced&amp;amp;utm_term=external_blog&quot;&gt;Free Apache Iceberg Crash Course&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&quot;https://hello.dremio.com/webcast-an-in-depth-exploration-on-the-world-of-data-lakehouse-catalogs-reg.html?utm_source=ev_external_blog&amp;amp;utm_medium=influencer&amp;amp;utm_campaign=autoingestdremio&amp;amp;utm_content=alexmerced&amp;amp;utm_term=external_blog&quot;&gt;Lakehouse Catalog Course&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&quot;https://www.youtube.com/watch?v=SIriNcVIGJQ&amp;amp;list=PLsLAVBjQJO0p0Yq1fLkoHvt2lEJj5pcYe&quot;&gt;Iceberg Lakehouse Engineering Video Playlist&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;Manually orchestrating data pipelines to handle ever-increasing volumes of data can be both time-consuming and error-prone. Enter &lt;strong&gt;Dremio Auto-Ingest&lt;/strong&gt;, a game-changing feature that simplifies the process of loading data into &lt;strong&gt;Apache Iceberg&lt;/strong&gt; tables.&lt;/p&gt;
&lt;p&gt;With Auto-Ingest, you can create event-driven pipelines that automatically respond to changes in your object storage systems, such as new files being uploaded to Amazon S3. This approach eliminates the need for constant manual intervention, enabling real-time or near-real-time updates to your Iceberg tables. Whether you’re ingesting structured CSV data, semi-structured JSON files, or compact Parquet formats, Dremio Auto-Ingest ensures a seamless, reliable pipeline.&lt;/p&gt;
&lt;p&gt;But why choose Auto-Ingest over traditional methods? The answer lies in its ability to handle ingestion challenges like deduplication, error handling, and custom formatting, all while integrating smoothly with modern cloud infrastructure.&lt;/p&gt;
&lt;h2&gt;Understanding Auto-Ingest for Apache Iceberg&lt;/h2&gt;
&lt;p&gt;To fully appreciate the power of Dremio Auto-Ingest, it’s important to understand the core components and how they work together. At its heart, Auto-Ingest is designed to create a seamless pipeline that transfers files from object storage into &lt;strong&gt;Apache Iceberg tables&lt;/strong&gt; with minimal manual intervention. Let’s break it down.&lt;/p&gt;
&lt;h3&gt;What is a Pipe Object?&lt;/h3&gt;
&lt;p&gt;The &lt;strong&gt;pipe object&lt;/strong&gt; is the central feature enabling Auto-Ingest. Think of it as a pre-configured connection between your cloud storage and an Iceberg table. The pipe listens for events, such as the arrival of a new file, and automatically triggers the ingestion process. This eliminates the need for periodic manual data loads or complex batch scripts.&lt;/p&gt;
&lt;p&gt;Here’s what makes a pipe object powerful:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Notification Provider&lt;/strong&gt;: Specifies the mechanism for event detection, such as AWS SQS for Amazon S3.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Notification Queue Reference&lt;/strong&gt;: Points to the event queue where file changes are registered.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Deduplication&lt;/strong&gt;: Ensures no duplicate files are ingested, even if files are re-uploaded or processed multiple times.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Flexible Configuration&lt;/strong&gt;: Allows you to define file formats, custom settings, and error-handling rules.&lt;/li&gt;
&lt;/ul&gt;
&lt;h3&gt;How Does Auto-Ingest Work?&lt;/h3&gt;
&lt;p&gt;Auto-Ingest leverages an &lt;strong&gt;event-driven model&lt;/strong&gt;:&lt;/p&gt;
&lt;ol&gt;
&lt;li&gt;A file is added or updated in the storage location (e.g., an S3 bucket).&lt;/li&gt;
&lt;li&gt;A notification is sent to the queue specified in the pipe configuration.&lt;/li&gt;
&lt;li&gt;The pipe detects the notification and triggers the ingestion process using the &lt;code&gt;COPY INTO&lt;/code&gt; command to move data into the Iceberg table.&lt;/li&gt;
&lt;/ol&gt;
&lt;p&gt;This approach is both reactive and efficient, ensuring that your data remains fresh without the overhead of constant polling or manual triggers.&lt;/p&gt;
&lt;h3&gt;Benefits of Using Auto-Ingest&lt;/h3&gt;
&lt;p&gt;Why choose Auto-Ingest for your Iceberg tables? Here are some key benefits:&lt;/p&gt;
&lt;ol&gt;
&lt;li&gt;&lt;strong&gt;Real-Time Updates&lt;/strong&gt;: Ensure your Iceberg tables always reflect the latest data.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Simplified Pipeline Management&lt;/strong&gt;: Replace complex, custom ingestion scripts with a single declarative configuration.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Data Quality Assurance&lt;/strong&gt;: Built-in deduplication and error-handling mechanisms help maintain clean, accurate datasets.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Scalability&lt;/strong&gt;: Auto-Ingest works seamlessly with cloud-native object storage, enabling pipelines that scale with your data.&lt;/li&gt;
&lt;/ol&gt;
&lt;p&gt;By combining the power of Apache Iceberg with Dremio’s Auto-Ingest, you can build modern, efficient pipelines that support both analytical and operational workloads with ease.&lt;/p&gt;
&lt;h2&gt;Step-by-Step Guide: Setting Up Auto-Ingest&lt;/h2&gt;
&lt;p&gt;By following these steps, you can automate data ingestion from cloud storage and ensure seamless integration with your data lakehouse.&lt;/p&gt;
&lt;h3&gt;1. Prerequisites&lt;/h3&gt;
&lt;p&gt;Before creating an Auto-Ingest pipeline, ensure the following:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Cloud Storage Setup&lt;/strong&gt;: Configure your storage location (e.g., Amazon S3) as a source in Dremio.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Notification Service&lt;/strong&gt;: Set up an event notification provider, such as AWS SQS, to monitor changes in the storage location.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Apache Iceberg Table&lt;/strong&gt;: Ensure the target table exists and is properly configured in Dremio.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Supported File Formats&lt;/strong&gt;: Verify that your files are in one of the supported formats: CSV, JSON, or Parquet.&lt;/li&gt;
&lt;/ul&gt;
&lt;h3&gt;2. Creating a Pipe Object&lt;/h3&gt;
&lt;p&gt;The &lt;code&gt;CREATE PIPE&lt;/code&gt; command is the foundation of the Auto-Ingest setup. It connects your storage location to an Iceberg table, specifying ingestion parameters.&lt;/p&gt;
&lt;h4&gt;Syntax&lt;/h4&gt;
&lt;pre&gt;&lt;code class=&quot;language-sql&quot;&gt;CREATE PIPE [ IF NOT EXISTS ] &amp;lt;pipe_name&amp;gt;
  [ DEDUPE_LOOKBACK_PERIOD &amp;lt;number_of_days&amp;gt; ]
  NOTIFICATION_PROVIDER &amp;lt;notification_provider&amp;gt;
  NOTIFICATION_QUEUE_REFERENCE &amp;lt;notification_queue_ref&amp;gt;
  AS COPY INTO &amp;lt;table_name&amp;gt;
    [ AT BRANCH &amp;lt;branch_name&amp;gt; ]
    FROM &apos;@&amp;lt;storage_location_name&amp;gt;&apos;
    FILE_FORMAT &apos;&amp;lt;format&amp;gt;&apos;
    [(&amp;lt;format_options&amp;gt;)]
&lt;/code&gt;&lt;/pre&gt;
&lt;h4&gt;Key Parameters&lt;/h4&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;&lt;code&gt;DEDUPE_LOOKBACK_PERIOD:&lt;/code&gt;&lt;/strong&gt; Defines the time window (in days) for deduplication. Default is 14 days.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;&lt;code&gt;NOTIFICATION_PROVIDER:&lt;/code&gt;&lt;/strong&gt; Specifies the event notification system, such as AWS_SQS for Amazon S3.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;&lt;code&gt;NOTIFICATION_QUEUE_REFERENCE:&lt;/code&gt;&lt;/strong&gt; Points to the notification queue (e.g., the ARN of an SQS queue).&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;&lt;code&gt;COPY INTO:&lt;/code&gt;&lt;/strong&gt; Specifies the target Iceberg table and optional branch.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;&lt;code&gt;@&amp;lt;storage_location_name&amp;gt;:&lt;/code&gt;&lt;/strong&gt; Refers to the source storage location configured in Dremio.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;File Format Options:&lt;/strong&gt; Custom configurations for CSV, JSON, or Parquet files.&lt;/li&gt;
&lt;/ul&gt;
&lt;h4&gt;Examples&lt;/h4&gt;
&lt;p&gt;Basic Pipe for CSV Files&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-sql&quot;&gt;CREATE PIPE my_pipe
  NOTIFICATION_PROVIDER AWS_SQS
  NOTIFICATION_QUEUE_REFERENCE &apos;arn:aws:sqs:us-east-1:123456789012:my-queue&apos;
  AS COPY INTO sales_data
    FROM &apos;@s3_source/data_folder&apos;
    FILE_FORMAT &apos;csv&apos;;
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Pipe with Deduplication&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-sql&quot;&gt;CREATE PIPE deduped_pipe
  DEDUPE_LOOKBACK_PERIOD 7
  NOTIFICATION_PROVIDER AWS_SQS
  NOTIFICATION_QUEUE_REFERENCE &apos;arn:aws:sqs:us-east-1:123456789012:dedupe-queue&apos;
  AS COPY INTO analytics_table
    FROM &apos;@s3_source/analytics&apos;
    FILE_FORMAT &apos;parquet&apos;;
&lt;/code&gt;&lt;/pre&gt;
&lt;h3&gt;3. Customizing File Formats&lt;/h3&gt;
&lt;p&gt;Dremio allows you to tailor the ingestion process based on your file type and data requirements. Here’s how to configure each format:&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;CSV Options:&lt;/strong&gt;&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Delimiters (&lt;code&gt;FIELD_DELIMITER&lt;/code&gt;)&lt;/strong&gt;&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Null handling (&lt;code&gt;EMPTY_AS_NULL&lt;/code&gt;)&lt;/strong&gt;&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Header extraction (&lt;code&gt;EXTRACT_HEADER&lt;/code&gt;)&lt;/strong&gt;&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Error handling (&lt;code&gt;ON_ERROR&lt;/code&gt;)&lt;/strong&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;&lt;strong&gt;JSON Options:&lt;/strong&gt;&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Date and time formatting (&lt;code&gt;DATE_FORMAT&lt;/code&gt;, &lt;code&gt;TIME_FORMAT&lt;/code&gt;)&lt;/strong&gt;&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Null replacements (&lt;code&gt;NULL_IF&lt;/code&gt;)&lt;/strong&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;&lt;strong&gt;Parquet Options:&lt;/strong&gt;&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Simplified setup with error handling (&lt;code&gt;ON_ERROR&lt;/code&gt;)&lt;/strong&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;h4&gt;Example for CSV with custom settings:&lt;/h4&gt;
&lt;pre&gt;&lt;code class=&quot;language-sql&quot;&gt;CREATE PIPE custom_csv_pipe
  NOTIFICATION_PROVIDER AWS_SQS
  NOTIFICATION_QUEUE_REFERENCE &apos;arn:aws:sqs:us-east-1:123456789012:csv-queue&apos;
  AS COPY INTO transactions_table
    FROM &apos;@s3_source/csv_data&apos;
    FILE_FORMAT &apos;csv&apos;
    (FIELD_DELIMITER &apos;|&apos;, EXTRACT_HEADER &apos;true&apos;, ON_ERROR &apos;skip_file&apos;);
&lt;/code&gt;&lt;/pre&gt;
&lt;h3&gt;4. Error Handling&lt;/h3&gt;
&lt;p&gt;Errors during ingestion are inevitable, but Dremio’s Auto-Ingest provides robust handling options:&lt;/p&gt;
&lt;h4&gt;ON_ERROR Options:&lt;/h4&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;abort:&lt;/strong&gt; Stops the process at the first error (default for JSON and Parquet).&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;continue:&lt;/strong&gt; Skips faulty rows but processes valid ones (CSV only).&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;skip_file:&lt;/strong&gt; Skips the entire file if any error occurs.&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;For example:&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-sql&quot;&gt;CREATE PIPE error_handling_pipe
  NOTIFICATION_PROVIDER AWS_SQS
  NOTIFICATION_QUEUE_REFERENCE &apos;arn:aws:sqs:us-east-1:123456789012:error-queue&apos;
  AS COPY INTO error_log_table
    FROM &apos;@s3_source/faulty_data&apos;
    FILE_FORMAT &apos;json&apos;
    (ON_ERROR &apos;skip_file&apos;);
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;With your pipe configured, Dremio automatically monitors your storage for changes and ingests new files into the target Iceberg table. This setup provides a scalable, reliable pipeline for all your data ingestion needs.&lt;/p&gt;
&lt;h2&gt;Real-World Use Cases for Dremio Auto-Ingest&lt;/h2&gt;
&lt;p&gt;Dremio’s Auto-Ingest for Apache Iceberg tables offers significant advantages across a variety of data engineering scenarios. Whether you’re building real-time pipelines or automating batch data processing, Auto-Ingest provides the flexibility and automation necessary to simplify workflows. Here are some real-world use cases to illustrate its impact.&lt;/p&gt;
&lt;hr&gt;
&lt;h3&gt;1. &lt;strong&gt;Streaming Data Pipelines&lt;/strong&gt;&lt;/h3&gt;
&lt;p&gt;&lt;strong&gt;Scenario&lt;/strong&gt;: A smart city project collects real-time sensor data (e.g., temperature, traffic flow, air quality) from IoT devices. This data is stored as JSON files in an S3 bucket, and analytics teams require instant updates in their data warehouse for real-time dashboards.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Solution&lt;/strong&gt;:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Use Dremio Auto-Ingest with a pipe object that listens to the S3 bucket.&lt;/li&gt;
&lt;li&gt;Configure the pipe to process JSON files and load them into an Iceberg table.&lt;/li&gt;
&lt;li&gt;Leverage &lt;code&gt;ON_ERROR&lt;/code&gt; settings to gracefully handle malformed sensor data.&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;&lt;strong&gt;Example Configuration&lt;/strong&gt;:&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-sql&quot;&gt;CREATE PIPE streaming_pipe
  NOTIFICATION_PROVIDER AWS_SQS
  NOTIFICATION_QUEUE_REFERENCE &apos;arn:aws:sqs:us-west-2:123456789012:sensor-queue&apos;
  AS COPY INTO smart_city.sensor_data
    FROM &apos;@iot_source/live_data&apos;
    FILE_FORMAT &apos;json&apos;
    (ON_ERROR &apos;skip_file&apos;);
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;&lt;strong&gt;Outcome:&lt;/strong&gt;&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Real-time dashboards reflect the latest sensor data without manual intervention.&lt;/li&gt;
&lt;li&gt;Faulty data is isolated for later analysis, ensuring system stability.&lt;/li&gt;
&lt;/ul&gt;
&lt;h3&gt;2. Batch Data Processing&lt;/h3&gt;
&lt;p&gt;&lt;strong&gt;Scenario:&lt;/strong&gt; A retail company ingests daily sales logs in CSV format from its regional branches into a central data lake. These logs must be processed nightly and appended to a historical sales Iceberg table.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Solution:&lt;/strong&gt;&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Configure an Auto-Ingest pipe to monitor the S3 bucket where sales logs are uploaded.&lt;/li&gt;
&lt;li&gt;Set a deduplication lookback period to avoid reprocessing files if logs are accidentally re-uploaded.
Example Configuration:&lt;/li&gt;
&lt;/ul&gt;
&lt;pre&gt;&lt;code class=&quot;language-sql&quot;&gt;CREATE PIPE daily_batch_pipe
  DEDUPE_LOOKBACK_PERIOD 7
  NOTIFICATION_PROVIDER AWS_SQS
  NOTIFICATION_QUEUE_REFERENCE &apos;arn:aws:sqs:us-east-1:123456789012:sales-queue&apos;
  AS COPY INTO retail.sales_history
    FROM &apos;@s3_source/sales_logs&apos;
    FILE_FORMAT &apos;csv&apos;
    (EXTRACT_HEADER &apos;true&apos;, EMPTY_AS_NULL &apos;true&apos;);
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;&lt;strong&gt;Outcome:&lt;/strong&gt;&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Daily sales logs are automatically appended to the historical table.&lt;/li&gt;
&lt;li&gt;The deduplication window ensures no duplicate records are ingested.&lt;/li&gt;
&lt;/ul&gt;
&lt;h3&gt;3. Data Lakehouse Modernization&lt;/h3&gt;
&lt;p&gt;&lt;strong&gt;Scenario:&lt;/strong&gt; A financial services firm is transitioning from a traditional data warehouse to a modern lakehouse architecture. The team wants to automate ingestion from various sources (e.g., transactional Parquet files and JSON logs) into Iceberg tables for unified analytics.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Solution:&lt;/strong&gt;&lt;/p&gt;
&lt;p&gt;Use multiple Auto-Ingest pipes to handle ingestion for different file types and schemas.
Configure branch-specific ingestion for staging and production environments.
Example Configuration:&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Parquet Transactions:&lt;/strong&gt;&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-sql&quot;&gt;CREATE PIPE transactions_pipe
  NOTIFICATION_PROVIDER AWS_SQS
  NOTIFICATION_QUEUE_REFERENCE &apos;arn:aws:sqs:us-east-2:123456789012:transactions-queue&apos;
  AS COPY INTO finance.transactions
    FROM &apos;@finance_source/transactions&apos;
    FILE_FORMAT &apos;parquet&apos;;
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;&lt;strong&gt;JSON Application Logs:&lt;/strong&gt;&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-sql&quot;&gt;Copy code
CREATE PIPE logs_pipe
  NOTIFICATION_PROVIDER AWS_SQS
  NOTIFICATION_QUEUE_REFERENCE &apos;arn:aws:sqs:us-east-2:123456789012:logs-queue&apos;
  AS COPY INTO finance.app_logs
    FROM &apos;@logs_source/application&apos;
    FILE_FORMAT &apos;json&apos;
    (DATE_FORMAT &apos;YYYY-MM-DD&apos;, TIME_FORMAT &apos;HH24:MI:SS&apos;);
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;&lt;strong&gt;Outcome:&lt;/strong&gt;&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Unified, structured Iceberg tables ready for analytical queries.&lt;/li&gt;
&lt;li&gt;Improved agility with automated pipelines for different data sources.&lt;/li&gt;
&lt;/ul&gt;
&lt;h3&gt;4. Event-Driven Reporting&lt;/h3&gt;
&lt;p&gt;&lt;strong&gt;Scenario:&lt;/strong&gt; A marketing team tracks user engagement metrics (e.g., clicks, time on site, purchases) stored as CSV files in real-time. Reports must be updated immediately after new data arrives.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Solution:&lt;/strong&gt;&lt;/p&gt;
&lt;p&gt;Use an Auto-Ingest pipe with an AWS_SQS notification provider to ensure new engagement files are ingested as soon as they are uploaded.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Example Configuration:&lt;/strong&gt;&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-sql&quot;&gt;CREATE PIPE engagement_pipe
  NOTIFICATION_PROVIDER AWS_SQS
  NOTIFICATION_QUEUE_REFERENCE &apos;arn:aws:sqs:us-west-1:123456789012:engagement-queue&apos;
  AS COPY INTO marketing.user_engagement
    FROM &apos;@engagement_source/metrics&apos;
    FILE_FORMAT &apos;csv&apos;
    (FIELD_DELIMITER &apos;,&apos;, EXTRACT_HEADER &apos;true&apos;);
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;&lt;strong&gt;Outcome:&lt;/strong&gt;&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Marketing reports are updated in near-real-time, enabling faster decision-making.&lt;/li&gt;
&lt;li&gt;Automated ingestion removes the need for manual ETL processes.&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;These use cases showcase how Dremio Auto-Ingest can be a versatile and powerful tool for a wide range of data engineering challenges. Whether your focus is on real-time data processing, batch workflows, or transitioning to a lakehouse architecture, Auto-Ingest simplifies and enhances your pipeline capabilities.&lt;/p&gt;
&lt;h2&gt;Best Practices and Considerations for Dremio Auto-Ingest&lt;/h2&gt;
&lt;p&gt;To get the most out of Dremio Auto-Ingest for Apache Iceberg tables, it&apos;s essential to follow best practices and understand key considerations. These guidelines will help ensure your ingestion pipelines are reliable, efficient, and optimized for performance.&lt;/p&gt;
&lt;h3&gt;1. &lt;strong&gt;Optimize Deduplication Settings&lt;/strong&gt;&lt;/h3&gt;
&lt;p&gt;&lt;strong&gt;What It Does&lt;/strong&gt;: The &lt;code&gt;DEDUPE_LOOKBACK_PERIOD&lt;/code&gt; parameter ensures that duplicate files (e.g., files with the same name uploaded multiple times) are not ingested repeatedly.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Best Practices&lt;/strong&gt;:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Set an appropriate lookback period based on your ingestion frequency:
&lt;ul&gt;
&lt;li&gt;For high-frequency updates (e.g., hourly ingestion), a shorter period (1–3 days) is sufficient.&lt;/li&gt;
&lt;li&gt;For batch workflows with periodic reuploads, a longer window (7–14 days) may be needed.&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;li&gt;Avoid setting the period to &lt;code&gt;0&lt;/code&gt; unless you are certain duplicates are not an issue, as it disables deduplication.&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;&lt;strong&gt;Example&lt;/strong&gt;:&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-sql&quot;&gt;CREATE PIPE deduped_pipe
  DEDUPE_LOOKBACK_PERIOD 7
  NOTIFICATION_PROVIDER AWS_SQS
  NOTIFICATION_QUEUE_REFERENCE &apos;arn:aws:sqs:us-east-1:123456789012:dedupe-queue&apos;
  AS COPY INTO my_table
    FROM &apos;@s3_source/folder&apos;
    FILE_FORMAT &apos;json&apos;;
&lt;/code&gt;&lt;/pre&gt;
&lt;h3&gt;2. Organize Storage for Better Performance&lt;/h3&gt;
&lt;p&gt;&lt;strong&gt;Why It Matters:&lt;/strong&gt; Properly structured storage locations improve ingestion speed and reduce processing overhead.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Best Practices:&lt;/strong&gt;&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Use folder-based organization in your storage buckets (e.g., &lt;code&gt;/year/month/day/&lt;/code&gt;) for easier file management and regex-based ingestion.&lt;/li&gt;
&lt;li&gt;Keep related files in the same folder to avoid ingesting unrelated data by mistake.&lt;/li&gt;
&lt;li&gt;Avoid deeply nested directory structures, as they can slow down file scanning.&lt;/li&gt;
&lt;/ul&gt;
&lt;h3&gt;3. Choose the Right File Format&lt;/h3&gt;
&lt;p&gt;&lt;strong&gt;Impact of File Format:&lt;/strong&gt; Different file formats affect storage size, query performance, and ingestion speed.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Best Practices:&lt;/strong&gt;&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Use Parquet for columnar storage and analytics-heavy workloads due to its efficient storage and compression.&lt;/li&gt;
&lt;li&gt;Opt for CSV or JSON for semi-structured data but ensure proper formatting (e.g., consistent delimiters, headers, and escaping).&lt;/li&gt;
&lt;li&gt;Test ingestion performance with small sample files before committing to large-scale pipelines.&lt;/li&gt;
&lt;/ul&gt;
&lt;h3&gt;4. Leverage Error Handling Options&lt;/h3&gt;
&lt;p&gt;&lt;strong&gt;Why It Matters:&lt;/strong&gt; Errors during ingestion can interrupt pipelines or lead to data inconsistencies.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Best Practices:&lt;/strong&gt;&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Use &lt;code&gt;ON_ERROR &apos;skip_file&apos;&lt;/code&gt; to bypass files with errors and prevent pipeline interruptions.&lt;/li&gt;
&lt;li&gt;Regularly monitor the &lt;code&gt;sys.copy_errors_history&lt;/code&gt; table for ingestion errors and address recurring issues.&lt;/li&gt;
&lt;li&gt;For non-critical pipelines, consider &lt;code&gt;ON_ERROR &apos;continue&apos;&lt;/code&gt; (CSV only) to process valid rows even if some are faulty.&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;&lt;strong&gt;Example:&lt;/strong&gt;&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-sql&quot;&gt;CREATE PIPE error_handling_pipe
  NOTIFICATION_PROVIDER AWS_SQS
  NOTIFICATION_QUEUE_REFERENCE &apos;arn:aws:sqs:us-east-1:123456789012:error-queue&apos;
  AS COPY INTO my_table
    FROM &apos;@s3_source/folder&apos;
    FILE_FORMAT &apos;csv&apos;
    (ON_ERROR &apos;continue&apos;);
&lt;/code&gt;&lt;/pre&gt;
&lt;h3&gt;5. Monitor and Troubleshoot Pipelines&lt;/h3&gt;
&lt;p&gt;&lt;strong&gt;Monitoring Tools:&lt;/strong&gt;&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;System Tables:&lt;/strong&gt; Query sys.copy_errors_history to review errors during ingestion.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Job Logs:&lt;/strong&gt; Check job logs in Dremio for detailed error messages and ingestion stats.&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;&lt;strong&gt;Common Troubleshooting Tips:&lt;/strong&gt;&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Notification Issues:&lt;/strong&gt; Ensure the SQS queue ARN matches the one specified in the &lt;code&gt;NOTIFICATION_QUEUE_REFERENCE&lt;/code&gt;.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;File Format Mismatches:&lt;/strong&gt; Double-check that the specified file format aligns with the actual file type (e.g., don’t label a Parquet file as CSV).&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Deduplication Failures:&lt;/strong&gt; Verify that the deduplication period is set correctly and files aren’t inadvertently re-ingested due to naming conflicts.&lt;/li&gt;
&lt;/ul&gt;
&lt;h3&gt;6. Optimize Regex and File Selection&lt;/h3&gt;
&lt;p&gt;&lt;strong&gt;Why It Matters:&lt;/strong&gt; Using overly broad regex patterns or processing unnecessary files can impact pipeline performance.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Best Practices:&lt;/strong&gt;&lt;/p&gt;
&lt;p&gt;Write regex patterns that are as specific as possible to match only the files you need.
Avoid processing large directories unless required. Use the &lt;code&gt;FILES&lt;/code&gt; clause or specific folder paths to limit scope.
Example:&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-sql&quot;&gt;CREATE PIPE regex_pipe
  NOTIFICATION_PROVIDER AWS_SQS
  NOTIFICATION_QUEUE_REFERENCE &apos;arn:aws:sqs:us-west-2:123456789012:regex-queue&apos;
  AS COPY INTO my_table
    FROM &apos;@s3_source/folder&apos;
    REGEX &apos;^2024/11/.*.csv&apos;
    FILE_FORMAT &apos;csv&apos;;
&lt;/code&gt;&lt;/pre&gt;
&lt;h3&gt;7. Plan for Schema Evolution&lt;/h3&gt;
&lt;p&gt;&lt;strong&gt;Why It Matters:&lt;/strong&gt; Iceberg tables support schema evolution, but it’s crucial to manage changes thoughtfully to avoid ingestion failures.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Best Practices:&lt;/strong&gt;&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Test schema changes in a staging environment before applying them to production pipelines.&lt;/li&gt;
&lt;li&gt;Use Iceberg’s branching capabilities to isolate schema updates during development.&lt;/li&gt;
&lt;li&gt;Validate data types and formats in source files to avoid mismatches with the target table schema.&lt;/li&gt;
&lt;/ul&gt;
&lt;h3&gt;8. Integrate with Data Lakehouse Workflows&lt;/h3&gt;
&lt;p&gt;&lt;strong&gt;Why It Matters:&lt;/strong&gt; Auto-Ingest simplifies transitioning to a lakehouse architecture, but aligning with broader workflows ensures smooth integration.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Best Practices:&lt;/strong&gt;&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Combine Auto-Ingest with Dremio’s SQL-based querying to enable seamless analytics on ingested data.&lt;/li&gt;
&lt;li&gt;Use Iceberg’s time-travel feature to track historical changes and validate pipeline performance over time.&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;By following these best practices and considerations, you can ensure your Dremio Auto-Ingest pipelines are robust, efficient, and well-suited to your data engineering needs. These guidelines will help you avoid common pitfalls and fully leverage the power of automated ingestion for Apache Iceberg tables.&lt;/p&gt;
&lt;h2&gt;Troubleshooting and Debugging Auto-Ingest Pipelines&lt;/h2&gt;
&lt;p&gt;Even with a robust Auto-Ingest setup, you may encounter issues during the ingestion process. Dremio’s system tables, such as &lt;code&gt;SYS.COPY_ERRORS_HISTORY&lt;/code&gt;, provide detailed insights into ingestion errors, making it easier to diagnose and resolve problems. This section outlines common issues and how to effectively use the system table to debug your pipelines.&lt;/p&gt;
&lt;h3&gt;1. &lt;strong&gt;Common Issues and Resolutions&lt;/strong&gt;&lt;/h3&gt;
&lt;h4&gt;&lt;strong&gt;Notification Configuration Problems&lt;/strong&gt;&lt;/h4&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Symptom&lt;/strong&gt;: The pipe does not respond to new files being uploaded.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Resolution&lt;/strong&gt;:
&lt;ul&gt;
&lt;li&gt;Verify the &lt;code&gt;NOTIFICATION_PROVIDER&lt;/code&gt; is configured correctly (e.g., &lt;code&gt;AWS_SQS&lt;/code&gt; for S3).&lt;/li&gt;
&lt;li&gt;Ensure the &lt;code&gt;NOTIFICATION_QUEUE_REFERENCE&lt;/code&gt; points to the correct ARN of your event notification queue.&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;/ul&gt;
&lt;h4&gt;&lt;strong&gt;File Format Mismatch&lt;/strong&gt;&lt;/h4&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Symptom&lt;/strong&gt;: The pipeline fails with file parsing errors.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Resolution&lt;/strong&gt;:
&lt;ul&gt;
&lt;li&gt;Double-check that the &lt;code&gt;FILE_FORMAT&lt;/code&gt; in your pipe configuration matches the actual format of the uploaded files.&lt;/li&gt;
&lt;li&gt;Validate format-specific options (e.g., delimiter, null handling) for correctness.&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;/ul&gt;
&lt;h4&gt;&lt;strong&gt;Partial or Skipped File Loads&lt;/strong&gt;&lt;/h4&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Symptom&lt;/strong&gt;: Some files are partially loaded or skipped entirely.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Resolution&lt;/strong&gt;:
&lt;ul&gt;
&lt;li&gt;Use the &lt;code&gt;SYS.COPY_ERRORS_HISTORY&lt;/code&gt; table to identify problematic files and the reasons for rejection.&lt;/li&gt;
&lt;li&gt;Adjust error-handling options (&lt;code&gt;ON_ERROR&lt;/code&gt;) in your pipe to match your tolerance for bad records.&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;/ul&gt;
&lt;h3&gt;2. &lt;strong&gt;Using the &lt;code&gt;SYS.COPY_ERRORS_HISTORY&lt;/code&gt; Table&lt;/strong&gt;&lt;/h3&gt;
&lt;p&gt;The &lt;code&gt;SYS.COPY_ERRORS_HISTORY&lt;/code&gt; table logs detailed information about &lt;code&gt;COPY INTO&lt;/code&gt; jobs where records were rejected due to parsing or schema issues. This includes jobs configured with &lt;code&gt;ON_ERROR &apos;continue&apos;&lt;/code&gt; or &lt;code&gt;ON_ERROR &apos;skip_file&apos;&lt;/code&gt;.&lt;/p&gt;
&lt;h4&gt;&lt;strong&gt;Key Columns in the Table&lt;/strong&gt;&lt;/h4&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;&lt;code&gt;executed_at&lt;/code&gt;&lt;/strong&gt;: The timestamp when the job was executed.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;&lt;code&gt;job_id&lt;/code&gt;&lt;/strong&gt;: The unique identifier of the &lt;code&gt;COPY INTO&lt;/code&gt; job.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;&lt;code&gt;table_name&lt;/code&gt;&lt;/strong&gt;: The target Iceberg table for the job.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;&lt;code&gt;user_name&lt;/code&gt;&lt;/strong&gt;: The username of the individual who ran the job.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;&lt;code&gt;file_path&lt;/code&gt;&lt;/strong&gt;: The path of the file with rejected records.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;&lt;code&gt;file_state&lt;/code&gt;&lt;/strong&gt;:
&lt;ul&gt;
&lt;li&gt;&lt;code&gt;PARTIALLY_LOADED&lt;/code&gt;: Some records were loaded, but others were rejected.&lt;/li&gt;
&lt;li&gt;&lt;code&gt;SKIPPED&lt;/code&gt;: No records were loaded due to file-level errors.&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;&lt;code&gt;records_loaded_count&lt;/code&gt;&lt;/strong&gt;: The number of successfully ingested records.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;&lt;code&gt;records_rejected_count&lt;/code&gt;&lt;/strong&gt;: The number of records rejected due to errors.&lt;/li&gt;
&lt;/ul&gt;
&lt;h4&gt;&lt;strong&gt;Example Query: Identifying Problematic Files&lt;/strong&gt;&lt;/h4&gt;
&lt;p&gt;To view details about rejected files for a specific table:&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-sql&quot;&gt;SELECT executed_at, job_id, file_path, file_state, records_rejected_count
FROM SYS.COPY_ERRORS_HISTORY
WHERE table_name = &apos;my_table&apos;
ORDER BY executed_at DESC;
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;&lt;strong&gt;This query highlights:&lt;/strong&gt;&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;When ingestion errors occurred.&lt;/li&gt;
&lt;li&gt;Which files were affected.&lt;/li&gt;
&lt;li&gt;Whether files were partially loaded or skipped.&lt;/li&gt;
&lt;/ul&gt;
&lt;h3&gt;3. Drilling into Error Details&lt;/h3&gt;
&lt;p&gt;Once you identify a problematic job using the job_id, you can use the &lt;code&gt;copy_errors()&lt;/code&gt; function to extract detailed error information.&lt;/p&gt;
&lt;p&gt;Example: Retrieving Error Details&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-sql&quot;&gt;SELECT *
FROM copy_errors(&apos;1aacb195-ca94-ec4c-2b01-ecddac81a900&apos;, &apos;my_table&apos;);
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;This query provides granular information about errors encountered during the ingestion process for the specified job.&lt;/p&gt;
&lt;h3&gt;4. Best Practices for Debugging&lt;/h3&gt;
&lt;h4&gt;Proactive Monitoring&lt;/h4&gt;
&lt;ul&gt;
&lt;li&gt;Regularly query the &lt;code&gt;SYS.COPY_ERRORS_HISTORY&lt;/code&gt; table to track ingestion health.&lt;/li&gt;
&lt;li&gt;Set alerts for high records_rejected_count values to identify recurring issues.&lt;/li&gt;
&lt;/ul&gt;
&lt;h4&gt;Validate Source Data&lt;/h4&gt;
&lt;ul&gt;
&lt;li&gt;Audit source files for schema inconsistencies or formatting errors.&lt;/li&gt;
&lt;li&gt;Ensure files match the expected format (e.g., proper delimiters for CSV).&lt;/li&gt;
&lt;/ul&gt;
&lt;h4&gt;Tuning Error Handling&lt;/h4&gt;
&lt;ul&gt;
&lt;li&gt;Use &lt;code&gt;ON_ERROR &apos;skip_file&apos;&lt;/code&gt; for critical pipelines where partial loads are unacceptable.&lt;/li&gt;
&lt;li&gt;Opt for &lt;code&gt;ON_ERROR &apos;continue&apos;&lt;/code&gt; in cases where maximum data recovery is desired, especially for CSV files.&lt;/li&gt;
&lt;/ul&gt;
&lt;h4&gt;Housekeeping the System Table&lt;/h4&gt;
&lt;p&gt;The &lt;code&gt;SYS.COPY_ERRORS_HISTORY&lt;/code&gt; table can grow significantly over time. Manage its size using these configuration keys:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;code&gt;dremio.system_iceberg_tables.record_lifespan_in_millis&lt;/code&gt;: Retains history for a specified number of days (default is 7).&lt;/li&gt;
&lt;li&gt;&lt;code&gt;dremio.system_iceberg_tables.housekeeping_thread_frequency_in_millis&lt;/code&gt;: Controls how frequently old records are removed (default is daily).&lt;/li&gt;
&lt;/ul&gt;
&lt;h3&gt;5. Common Query Patterns for Debugging&lt;/h3&gt;
&lt;p&gt;Find Recently Skipped Files&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-sql&quot;&gt;SELECT file_path, file_state, records_rejected_count
FROM SYS.COPY_ERRORS_HISTORY
WHERE file_state = &apos;SKIPPED&apos;
ORDER BY executed_at DESC;
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Analyze Partially Loaded Files&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-sql&quot;&gt;SELECT file_path, records_loaded_count, records_rejected_count
FROM SYS.COPY_ERRORS_HISTORY
WHERE file_state = &apos;PARTIALLY_LOADED&apos;
ORDER BY executed_at DESC;
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;By leveraging the &lt;code&gt;SYS.COPY_ERRORS_HISTORY&lt;/code&gt; table and related debugging tools, you can effectively monitor and resolve issues in your Auto-Ingest pipelines. These capabilities ensure your pipelines are resilient and capable of handling a wide variety of data ingestion scenarios with minimal disruption.&lt;/p&gt;
&lt;h2&gt;Conclusion&lt;/h2&gt;
&lt;p&gt;Dremio Auto-Ingest for Apache Iceberg tables brings a new level of automation and simplicity to data ingestion workflows. By leveraging event-driven pipelines, you can reduce manual intervention, ensure data freshness, and streamline the integration of your object storage systems with Iceberg tables.&lt;/p&gt;
&lt;p&gt;From real-time updates to batch processing, Auto-Ingest handles diverse use cases with ease, offering powerful features like deduplication, error handling, and format-specific customization. By following best practices, monitoring your pipelines, and troubleshooting effectively, you can create reliable and efficient data ingestion workflows that scale with your business needs.&lt;/p&gt;
&lt;p&gt;Whether you&apos;re modernizing your data lakehouse architecture or building advanced analytics pipelines, Dremio Auto-Ingest is a must-have tool to unlock the full potential of your data.&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;a href=&quot;https://www.dremio.com/blog/apache-iceberg-crash-course-what-is-a-data-lakehouse-and-a-table-format/?utm_source=ev_external_blog&amp;amp;utm_medium=influencer&amp;amp;utm_campaign=autoingestdremio&amp;amp;utm_content=alexmerced&amp;amp;utm_term=external_blog&quot;&gt;Blog: What is a Data Lakehouse and a Table Format?&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&quot;https://hello.dremio.com/wp-apache-iceberg-the-definitive-guide-reg.html?utm_source=ev_external_blog&amp;amp;utm_medium=influencer&amp;amp;utm_campaign=autoingestdremio&amp;amp;utm_content=alexmerced&amp;amp;utm_term=external_blog&quot;&gt;Free Copy of Apache Iceberg the Definitive Guide&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&quot;https://hello.dremio.com/webcast-an-apache-iceberg-lakehouse-crash-course-reg.html?utm_source=ev_external_blog&amp;amp;utm_medium=influencer&amp;amp;utm_campaign=autoingestdremio&amp;amp;utm_content=alexmerced&amp;amp;utm_term=external_blog&quot;&gt;Free Apache Iceberg Crash Course&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&quot;https://hello.dremio.com/webcast-an-in-depth-exploration-on-the-world-of-data-lakehouse-catalogs-reg.html?utm_source=ev_external_blog&amp;amp;utm_medium=influencer&amp;amp;utm_campaign=autoingestdremio&amp;amp;utm_content=alexmerced&amp;amp;utm_term=external_blog&quot;&gt;Lakehouse Catalog Course&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&quot;https://www.youtube.com/watch?v=SIriNcVIGJQ&amp;amp;list=PLsLAVBjQJO0p0Yq1fLkoHvt2lEJj5pcYe&quot;&gt;Iceberg Lakehouse Engineering Video Playlist&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;
</content:encoded><author>Alex Merced</author></item><item><title>Intro to SQL using Apache Iceberg and Dremio</title><link>https://iceberglakehouse.com/posts/2024-11-intro-to-sql-with-dremio-and-apache-iceberg/</link><guid isPermaLink="true">https://iceberglakehouse.com/posts/2024-11-intro-to-sql-with-dremio-and-apache-iceberg/</guid><description>
- [Blog: What is a Data Lakehouse and a Table Format?](https://www.dremio.com/blog/apache-iceberg-crash-course-what-is-a-data-lakehouse-and-a-table-f...</description><pubDate>Fri, 08 Nov 2024 09:00:00 GMT</pubDate><content:encoded>&lt;ul&gt;
&lt;li&gt;&lt;a href=&quot;https://www.dremio.com/blog/apache-iceberg-crash-course-what-is-a-data-lakehouse-and-a-table-format/?utm_source=ev_external_blog&amp;amp;utm_medium=influencer&amp;amp;utm_campaign=introtosql&amp;amp;utm_content=alexmerced&amp;amp;utm_term=external_blog&quot;&gt;Blog: What is a Data Lakehouse and a Table Format?&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&quot;https://hello.dremio.com/wp-apache-iceberg-the-definitive-guide-reg.html?utm_source=ev_external_blog&amp;amp;utm_medium=influencer&amp;amp;utm_campaign=introtosql&amp;amp;utm_content=alexmerced&amp;amp;utm_term=external_blog&quot;&gt;Free Copy of Apache Iceberg the Definitive Guide&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&quot;https://hello.dremio.com/webcast-an-apache-iceberg-lakehouse-crash-course-reg.html?utm_source=ev_external_blog&amp;amp;utm_medium=influencer&amp;amp;utm_campaign=introtosql&amp;amp;utm_content=alexmerced&amp;amp;utm_term=external_blog&quot;&gt;Free Apache Iceberg Crash Course&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&quot;https://hello.dremio.com/webcast-an-in-depth-exploration-on-the-world-of-data-lakehouse-catalogs-reg.html?utm_source=ev_external_blog&amp;amp;utm_medium=influencer&amp;amp;utm_campaign=introtosql&amp;amp;utm_content=alexmerced&amp;amp;utm_term=external_blog&quot;&gt;Lakehouse Catalog Course&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&quot;https://www.youtube.com/watch?v=SIriNcVIGJQ&amp;amp;list=PLsLAVBjQJO0p0Yq1fLkoHvt2lEJj5pcYe&quot;&gt;Iceberg Lakehouse Engineering Video Playlist&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;h2&gt;Introduction&lt;/h2&gt;
&lt;p&gt;SQL (Structured Query Language) has long been the standard for interacting with data, providing a powerful and accessible language for data querying and manipulation. However, traditional data warehouses and databases often fall short when dealing with the scale and flexibility demanded by modern data workloads.&lt;/p&gt;
&lt;p&gt;This is where Apache Iceberg and Dremio come in. Apache Iceberg is an open table format designed for large-scale data lakes, enabling reliable data management with features like ACID transactions, schema evolution, and time-travel. Iceberg brings structure and governance to data lakes, making them more capable of handling enterprise data needs. Dremio, on the other hand, is a data lakehouse platform that brings SQL querying capabilities to data lakes, providing a unified interface to query and analyze data across various sources.&lt;/p&gt;
&lt;p&gt;By the end of this tutorial, you&apos;ll understand the basics of SQL in Dremio and how to perform essential data operations with Apache Iceberg tables.&lt;/p&gt;
&lt;h2&gt;What is SQL, Apache Iceberg, and Dremio, and Why They Matter&lt;/h2&gt;
&lt;h3&gt;What is SQL?&lt;/h3&gt;
&lt;p&gt;SQL, or Structured Query Language, is a language specifically designed for managing and querying data in relational databases. Its versatility and power make it ideal for a wide range of data operations, including data extraction, aggregation, and transformation. SQL&apos;s widespread use in data analysis and reporting has made it a cornerstone in the world of data management.&lt;/p&gt;
&lt;h3&gt;What is Apache Iceberg?&lt;/h3&gt;
&lt;p&gt;Apache Iceberg is an open-source table format that brings structure and governance to data lakes. Designed with scalability in mind, Iceberg offers features such as:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;ACID Transactions&lt;/strong&gt;: Ensuring data consistency across large datasets.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Time-Travel&lt;/strong&gt;: Querying historical versions of data, which is essential for audits and analysis.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Schema Evolution&lt;/strong&gt;: Modifying table schemas without disrupting ongoing operations.
Iceberg’s approach to data management provides a reliable foundation for large-scale analytics and data processing, making it a valuable component in any data lakehouse architecture.&lt;/li&gt;
&lt;/ul&gt;
&lt;h3&gt;What is Dremio?&lt;/h3&gt;
&lt;p&gt;Dremio is a data lakehouse platform that unifies data access, enabling users to perform SQL queries across data lakes, warehouses, and other data sources through a single, user-friendly interface. Dremio simplifies data analytics by providing:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Unified Semantic Layer&lt;/strong&gt;: Organizes and documents datasets for easier discovery and analysis.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Support for Apache Iceberg&lt;/strong&gt;: Seamless integration with Iceberg tables, allowing users to query and manipulate large datasets with SQL.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Data Versioning and Governance&lt;/strong&gt;: Through integrations with Nessie, Dremio supports versioned, Git-like data management, making it ideal for maintaining data accuracy and history.&lt;/li&gt;
&lt;/ul&gt;
&lt;h3&gt;Why They Matter Together&lt;/h3&gt;
&lt;p&gt;When combined, SQL, Apache Iceberg, and Dremio offer a powerful solution for data management and analysis. SQL provides the querying foundation, Apache Iceberg delivers the scalability and governance, and Dremio brings everything together in a streamlined, accessible environment. For businesses looking to harness the full potential of their data lakes, this stack delivers efficient querying, advanced data governance, and high performance.&lt;/p&gt;
&lt;p&gt;Let&apos;s set up an environment to work with these tools and walk through practical examples of using SQL with Apache Iceberg tables in Dremio.&lt;/p&gt;
&lt;h2&gt;Setting Up an Environment with Dremio, Nessie, and MinIO with Docker Compose&lt;/h2&gt;
&lt;p&gt;To start working with Apache Iceberg and Dremio, we&apos;ll set up a local environment using Docker Compose, a tool that allows us to configure and manage multiple containers with a single file. In this setup, we&apos;ll use:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Dremio&lt;/strong&gt; as the query engine for our data lakehouse.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Nessie&lt;/strong&gt; as the catalog for versioned data management.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;MinIO&lt;/strong&gt; as S3-compatible storage to hold our data.&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;This environment will give us a powerful foundation to perform SQL operations on Apache Iceberg tables with Dremio.&lt;/p&gt;
&lt;h3&gt;Prerequisites&lt;/h3&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Docker&lt;/strong&gt;: Ensure Docker is installed on your machine. You can download it from &lt;a href=&quot;https://www.docker.com/&quot;&gt;Docker&apos;s official website&lt;/a&gt;.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Docker Compose&lt;/strong&gt;: Typically included with Docker Desktop on Windows and macOS; on Linux, it may require separate installation.&lt;/li&gt;
&lt;/ul&gt;
&lt;h3&gt;Step 1: Create a Docker Compose File&lt;/h3&gt;
&lt;ol&gt;
&lt;li&gt;
&lt;p&gt;Open a text editor of your choice (such as VS Code, Notepad, or Sublime Text).&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;Create a new file named &lt;code&gt;docker-compose.yml&lt;/code&gt; in a new, empty folder. This file will define the services and configurations needed for our environment.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;Copy and paste the following configuration into &lt;code&gt;docker-compose.yml&lt;/code&gt;:&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-yaml&quot;&gt;version: &amp;quot;3&amp;quot;

services:
  # Nessie Catalog Server Using In-Memory Store
  nessie:
    image: projectnessie/nessie:latest
    container_name: nessie
    networks:
      - iceberg
    ports:
      - 19120:19120
      
  # MinIO Storage Server
  ## Creates two buckets named lakehouse and lake
  minio:
    image: minio/minio:latest
    container_name: minio
    environment:
      - MINIO_ROOT_USER=admin
      - MINIO_ROOT_PASSWORD=password
    networks:
      - iceberg
    ports:
      - 9001:9001
      - 9000:9000
    command: [&amp;quot;server&amp;quot;, &amp;quot;/data&amp;quot;, &amp;quot;--console-address&amp;quot;, &amp;quot;:9001&amp;quot;]
    entrypoint: &amp;gt;
      /bin/sh -c &amp;quot;
      minio server /data --console-address &apos;:9001&apos; &amp;amp;
      sleep 5 &amp;amp;&amp;amp;
      mc alias set myminio http://localhost:9000 admin password &amp;amp;&amp;amp;
      mc mb myminio/lakehouse &amp;amp;&amp;amp;
      mc mb myminio/lake &amp;amp;&amp;amp;
      tail -f /dev/null
      &amp;quot;
      
  # Dremio
  dremio:
    platform: linux/x86_64
    image: dremio/dremio-oss:latest
    ports:
      - 9047:9047
      - 31010:31010
      - 32010:32010
    container_name: dremio
    environment:
      - DREMIO_JAVA_SERVER_EXTRA_OPTS=-Dpaths.dist=file:///opt/dremio/data/dist
    networks:
      - iceberg

networks:
  iceberg:
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;&lt;strong&gt;Explanation of the Services&lt;/strong&gt;:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Nessie&lt;/strong&gt;: Acts as the catalog for Iceberg tables, providing version control for data through branching and merging.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;MinIO&lt;/strong&gt;: Stores data in buckets, simulating an S3-compatible environment. We configure two buckets, &lt;code&gt;lakehouse&lt;/code&gt; and &lt;code&gt;lake&lt;/code&gt;, to separate structured Iceberg data from raw data.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Dremio&lt;/strong&gt;: The engine for querying data stored in Iceberg tables on MinIO. Dremio will allow us to use SQL for managing and analyzing our data.&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;/ol&gt;
&lt;h3&gt;Step 2: Start the Environment&lt;/h3&gt;
&lt;p&gt;With the &lt;code&gt;docker-compose.yml&lt;/code&gt; file ready, follow these steps to launch the environment:&lt;/p&gt;
&lt;ol&gt;
&lt;li&gt;
&lt;p&gt;Open a terminal (Command Prompt, PowerShell, or terminal app) and navigate to the folder where you saved &lt;code&gt;docker-compose.yml&lt;/code&gt;.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;Run the following command to start all services in detached mode:&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-bash&quot;&gt;docker-compose up -d
&lt;/code&gt;&lt;/pre&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;Wait a few moments for the services to initialize. You can check if the services are running by using:&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-bash&quot;&gt;docker ps
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;This command should list &lt;code&gt;nessie&lt;/code&gt;, &lt;code&gt;minio&lt;/code&gt;, and &lt;code&gt;dremio&lt;/code&gt; as running containers.&lt;/p&gt;
&lt;/li&gt;
&lt;/ol&gt;
&lt;h3&gt;Step 3: Verify Each Service&lt;/h3&gt;
&lt;p&gt;After starting the containers, verify that each service is accessible:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Dremio&lt;/strong&gt;: Open a browser and go to &lt;code&gt;http://localhost:9047&lt;/code&gt;. You should see the Dremio login screen.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;MinIO&lt;/strong&gt;: In a new browser tab, go to &lt;code&gt;http://localhost:9001&lt;/code&gt;. Log in with the username &lt;code&gt;admin&lt;/code&gt; and password &lt;code&gt;password&lt;/code&gt; to access the MinIO console.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Nessie&lt;/strong&gt;: Nessie doesn’t have a direct UI in this setup, but you can interact with it through Dremio, as we’ll cover in later sections.&lt;/li&gt;
&lt;/ul&gt;
&lt;h3&gt;Step 4: Optional - Shutting Down the Environment&lt;/h3&gt;
&lt;p&gt;To stop the environment when you&apos;re done, run the following command in the same folder as your &lt;code&gt;docker-compose.yml&lt;/code&gt; file:&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-bash&quot;&gt;docker-compose down -v
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;This command stops and removes all containers and associated volumes, allowing you to start fresh next time.&lt;/p&gt;
&lt;p&gt;With our environment up and running, we’re ready to start using Dremio to create and manage Apache Iceberg tables. In the next section, we’ll explore how to connect Nessie to Dremio and begin querying our data.&lt;/p&gt;
&lt;h2&gt;Accessing Dremio and Connecting Nessie&lt;/h2&gt;
&lt;p&gt;Now that our environment is up and running, let’s connect to Dremio, which will act as our query engine, and configure Nessie as a source catalog. This setup will allow us to take advantage of Apache Iceberg’s versioned data management and perform SQL operations in a streamlined, unified environment.&lt;/p&gt;
&lt;h3&gt;Step 1: Accessing Dremio&lt;/h3&gt;
&lt;ol&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Open Dremio in Your Browser&lt;/strong&gt;:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Go to &lt;code&gt;http://localhost:9047&lt;/code&gt; in your browser. You should see the Dremio login screen.&lt;/li&gt;
&lt;li&gt;If this is your first time setting up Dremio, you may need to create an admin user. Follow the on-screen instructions to set up your login credentials.&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Familiarize Yourself with Dremio’s Interface&lt;/strong&gt;:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;After logging in, explore Dremio’s main interface. Key areas include:
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;SQL Runner&lt;/strong&gt;: Where you can run SQL queries.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Datasets&lt;/strong&gt;: A section for browsing and managing tables, views, and sources.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Jobs&lt;/strong&gt;: A log of executed queries and their performance metrics.&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;li&gt;The SQL Runner will be our primary workspace for running queries and interacting with Apache Iceberg tables.&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;/ol&gt;
&lt;h3&gt;Step 2: Connecting Nessie as a Catalog in Dremio&lt;/h3&gt;
&lt;p&gt;Nessie acts as the catalog for our Iceberg tables, enabling us to manage data with version control features such as branching and merging. Let’s add Nessie as a source in Dremio.&lt;/p&gt;
&lt;ol&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Add a New Source&lt;/strong&gt;:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;In Dremio, click on the &lt;strong&gt;Add Source&lt;/strong&gt; button in the lower left corner of the interface.&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Configure the Nessie Source&lt;/strong&gt;:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Select &lt;strong&gt;Nessie&lt;/strong&gt; from the list of source types.&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Enter Nessie Connection Settings&lt;/strong&gt;:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;General Settings&lt;/strong&gt;:
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Name&lt;/strong&gt;: Enter a name for the source, such as &lt;code&gt;lakehouse&lt;/code&gt;.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Endpoint URL&lt;/strong&gt;: Enter the endpoint for the Nessie API:&lt;pre&gt;&lt;code&gt;http://nessie:19120/api/v2
&lt;/code&gt;&lt;/pre&gt;
&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Authentication&lt;/strong&gt;: Choose &lt;strong&gt;None&lt;/strong&gt; (since Nessie is running locally and does not require additional credentials in this setup).&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Storage Settings&lt;/strong&gt;:
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Access Key&lt;/strong&gt;: Enter &lt;code&gt;admin&lt;/code&gt; (the MinIO username).&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Secret Key&lt;/strong&gt;: Enter &lt;code&gt;password&lt;/code&gt; (the MinIO password).&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Root Path&lt;/strong&gt;: Enter &lt;code&gt;lakehouse&lt;/code&gt; (this is the bucket where our Iceberg tables will be stored).&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Connection Properties&lt;/strong&gt;:
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;fs.s3a.path.style.access&lt;/strong&gt;: Set this to &lt;code&gt;true&lt;/code&gt;.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;fs.s3a.endpoint&lt;/strong&gt;: Set to &lt;code&gt;minio:9000&lt;/code&gt;.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;dremio.s3.compat&lt;/strong&gt;: Set to &lt;code&gt;true&lt;/code&gt;.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Encrypt Connection&lt;/strong&gt;: Uncheck this option, as we’re running Nessie locally over HTTP.&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Save the Source&lt;/strong&gt;:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;After filling out all the fields, click &lt;strong&gt;Save&lt;/strong&gt;. Dremio will now connect to the Nessie catalog, and you’ll see &lt;code&gt;lakehouse&lt;/code&gt; (or the name you assigned) listed in the &lt;strong&gt;Datasets&lt;/strong&gt; section of Dremio’s interface.&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;/ol&gt;
&lt;h3&gt;Step 3: Adding MinIO as an S3 Source in Dremio&lt;/h3&gt;
&lt;p&gt;In addition to Nessie, we can add MinIO as a general S3-compatible source in Dremio. This source allows us to access raw data files stored in the MinIO &lt;code&gt;lake&lt;/code&gt; bucket, enabling direct SQL queries on various file types (e.g., JSON, CSV, Parquet) without the need to define tables.&lt;/p&gt;
&lt;ol&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Add a New Source&lt;/strong&gt;:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Click the &lt;strong&gt;Add Source&lt;/strong&gt; button in Dremio again, then select &lt;strong&gt;S3&lt;/strong&gt; as the source type.&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Configure the MinIO Connection&lt;/strong&gt;:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;General Settings&lt;/strong&gt;:
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Name&lt;/strong&gt;: Enter a name like &lt;code&gt;lake&lt;/code&gt;.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Credentials&lt;/strong&gt;: Choose &lt;strong&gt;AWS access key&lt;/strong&gt;.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Access Key&lt;/strong&gt;: Enter &lt;code&gt;admin&lt;/code&gt;.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Secret Key&lt;/strong&gt;: Enter &lt;code&gt;password&lt;/code&gt;.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Encrypt Connection&lt;/strong&gt;: Uncheck this option, as we’re running locally.&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Advanced Options&lt;/strong&gt;:
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Enable Compatibility Mode&lt;/strong&gt;: Set this to &lt;code&gt;true&lt;/code&gt; to ensure compatibility with MinIO.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Root Path&lt;/strong&gt;: Set to &lt;code&gt;/lake&lt;/code&gt; (the bucket name for general storage).&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Connection Properties&lt;/strong&gt;:
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;fs.s3a.path.style.access&lt;/strong&gt;: Set this to &lt;code&gt;true&lt;/code&gt;.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;fs.s3a.endpoint&lt;/strong&gt;: Set to &lt;code&gt;minio:9000&lt;/code&gt;.&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Save the Source&lt;/strong&gt;:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;After configuring these settings, click &lt;strong&gt;Save&lt;/strong&gt;. Dremio will connect to MinIO, and the &lt;code&gt;lake&lt;/code&gt; source will appear in the &lt;strong&gt;Datasets&lt;/strong&gt; section.&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;/ol&gt;
&lt;h3&gt;Verifying the Connections&lt;/h3&gt;
&lt;p&gt;With both sources connected, you should see &lt;code&gt;lakehouse&lt;/code&gt; and &lt;code&gt;lake&lt;/code&gt; listed under &lt;strong&gt;Datasets&lt;/strong&gt; in Dremio. These sources provide access to structured, versioned data in the &lt;code&gt;lakehouse&lt;/code&gt; bucket and general-purpose data in the &lt;code&gt;lake&lt;/code&gt; bucket.&lt;/p&gt;
&lt;p&gt;Let&apos;s explore how to use SQL within Dremio to create tables, insert data, and perform various data operations on our Iceberg tables.&lt;/p&gt;
&lt;h2&gt;How to Create Tables with SQL&lt;/h2&gt;
&lt;p&gt;Now that our environment is configured and connected, let&apos;s dive into creating tables using SQL in Dremio. Apache Iceberg tables in Dremio allow us to take advantage of Iceberg’s powerful features, such as schema evolution and advanced partitioning.&lt;/p&gt;
&lt;h3&gt;Creating Tables with &lt;code&gt;CREATE TABLE&lt;/code&gt;&lt;/h3&gt;
&lt;p&gt;The &lt;code&gt;CREATE TABLE&lt;/code&gt; command in Dremio allows us to define a new Iceberg table with specific columns, data types, and optional partitioning. Below, we’ll cover the syntax and provide examples for creating tables.&lt;/p&gt;
&lt;h3&gt;Basic Syntax for &lt;code&gt;CREATE TABLE&lt;/code&gt;&lt;/h3&gt;
&lt;pre&gt;&lt;code class=&quot;language-sql&quot;&gt;CREATE TABLE [IF NOT EXISTS] &amp;lt;table_name&amp;gt; (
  &amp;lt;column_name1&amp;gt; &amp;lt;data_type&amp;gt;,
  &amp;lt;column_name2&amp;gt; &amp;lt;data_type&amp;gt;,
  ...
)
[ PARTITION BY (&amp;lt;partition_transform&amp;gt;) ];
&lt;/code&gt;&lt;/pre&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;&lt;code&gt;IF NOT EXISTS&lt;/code&gt;&lt;/strong&gt;: Optionally add this clause to create the table only if it does not already exist.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;table_name&lt;/strong&gt;: The name of the table to be created. In our setup, you can use lakehouse.&lt;code&gt;&amp;lt;table_name&amp;gt;&lt;/code&gt; to specify the location in the Nessie catalog.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;column_name / data_type&lt;/strong&gt;: Define each column with a name and a data type (e.g., &lt;code&gt;VARCHAR&lt;/code&gt;, &lt;code&gt;INT&lt;/code&gt;, &lt;code&gt;TIMESTAMP&lt;/code&gt;).&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;&lt;code&gt;PARTITION BY&lt;/code&gt;&lt;/strong&gt;: Specify a partitioning strategy, which is especially useful for Iceberg tables. Iceberg supports several partition transforms, such as year, month, day, bucket, and truncate.&lt;/li&gt;
&lt;/ul&gt;
&lt;h4&gt;Example 1: Creating a Basic Table&lt;/h4&gt;
&lt;p&gt;Let’s create a simple table to store customer data.&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-sql&quot;&gt;CREATE TABLE lakehouse.customers (
  id INT,
  first_name VARCHAR,
  last_name VARCHAR,
  age INT
);
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;In this example:&lt;/p&gt;
&lt;p&gt;We define a customers table within the lakehouse source, where each row represents a customer with an &lt;code&gt;ID&lt;/code&gt;, &lt;code&gt;first name&lt;/code&gt;, &lt;code&gt;last name&lt;/code&gt;, and &lt;code&gt;age&lt;/code&gt;.&lt;/p&gt;
&lt;h4&gt;Example 2: Creating a Partitioned Table&lt;/h4&gt;
&lt;p&gt;To optimize queries, we can partition the customers table by the first letter of the last_name column using the truncate transform.&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-sql&quot;&gt;CREATE TABLE lakehouse.customers_partitioned (
  id INT,
  first_name VARCHAR,
  last_name VARCHAR,
  age INT
) PARTITION BY (truncate(1, last_name));
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Here, we use the &lt;code&gt;PARTITION BY&lt;/code&gt; clause with &lt;code&gt;truncate(1, last_name)&lt;/code&gt;, which will partition the data by the first character of the &lt;code&gt;last_name&lt;/code&gt; column. Partitioning helps to improve query performance by allowing Dremio to read only the relevant data based on query filters.&lt;/p&gt;
&lt;h4&gt;Example 3: Creating a Date-Partitioned Table&lt;/h4&gt;
&lt;p&gt;If we have a table to store order data, we may want to partition it by the date the order was placed.&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-sql&quot;&gt;CREATE TABLE lakehouse.orders (
  order_id INT,
  customer_id INT,
  order_date DATE,
  total_amount DOUBLE
) PARTITION BY (month(order_date));
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;In this case, &lt;code&gt;month(order_date)&lt;/code&gt; partitions the table by the month of the &lt;code&gt;order_date&lt;/code&gt; field, making it easier to run queries filtered by month, as Iceberg will only read the relevant partitions.&lt;/p&gt;
&lt;h3&gt;Viewing Tables in Dremio&lt;/h3&gt;
&lt;p&gt;Once the tables are created, you can view them in Dremio’s Datasets section:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Navigate to the lakehouse source in the Dremio interface.&lt;/li&gt;
&lt;li&gt;You should see the &lt;code&gt;customers&lt;/code&gt;, &lt;code&gt;customers_partitioned&lt;/code&gt;,&lt;code&gt;and&lt;/code&gt;orders` tables listed.&lt;/li&gt;
&lt;li&gt;Clicking on a table name will show you the table, and in the metadata bar on the left show the schema, documentation and other information.&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;Now let&apos;s look at how to insert data into these tables using SQL.&lt;/p&gt;
&lt;h2&gt;How to Insert into Tables with SQL&lt;/h2&gt;
&lt;p&gt;With our tables created, the next step is to populate them with data. Dremio’s &lt;code&gt;INSERT INTO&lt;/code&gt; command allows us to add data to Apache Iceberg tables, whether inserting individual rows or multiple records at once.&lt;/p&gt;
&lt;h3&gt;Basic Syntax for &lt;code&gt;INSERT INTO&lt;/code&gt;&lt;/h3&gt;
&lt;pre&gt;&lt;code class=&quot;language-sql&quot;&gt;INSERT INTO &amp;lt;table_name&amp;gt; [(&amp;lt;column1&amp;gt;, &amp;lt;column2&amp;gt;, ...)]
VALUES (value1, value2, ...), (value1, value2, ...), ...;
&lt;/code&gt;&lt;/pre&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;table_name:&lt;/strong&gt; The name of the table to insert data into, such as lakehouse.customers.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;column1, column2, ...:&lt;/strong&gt; Optional column names if you’re inserting values into specific columns.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;&lt;code&gt;VALUES&lt;/code&gt;:&lt;/strong&gt; A list of values to insert. You can insert one or more rows by adding sets of values separated by commas.&lt;/li&gt;
&lt;/ul&gt;
&lt;h4&gt;Example 1: Inserting a Single Row&lt;/h4&gt;
&lt;p&gt;Let’s add a single row to the customers table.&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-sql&quot;&gt;INSERT INTO lakehouse.customers (id, first_name, last_name, age)
VALUES (1, &apos;John&apos;, &apos;Doe&apos;, 28);
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;In this example:&lt;/p&gt;
&lt;p&gt;We specify values for each column in the customers table: &lt;code&gt;id&lt;/code&gt;, &lt;code&gt;first_name&lt;/code&gt;, &lt;code&gt;last_name&lt;/code&gt;, and &lt;code&gt;age&lt;/code&gt;.&lt;/p&gt;
&lt;p&gt;This inserts a single record for a customer named John Doe, age 28.&lt;/p&gt;
&lt;h4&gt;Example 2: Inserting Multiple Rows&lt;/h4&gt;
&lt;p&gt;To add multiple rows to a table in one command, list each row in the VALUES clause.&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-sql&quot;&gt;INSERT INTO lakehouse.customers (id, first_name, last_name, age)
VALUES
  (2, &apos;Jane&apos;, &apos;Smith&apos;, 34),
  (3, &apos;Alice&apos;, &apos;Johnson&apos;, 22),
  (4, &apos;Bob&apos;, &apos;Williams&apos;, 45),
  (5, &apos;Charlie&apos;, &apos;Brown&apos;, 30);
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;In this example:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;
&lt;p&gt;We insert multiple records into the customers table in a single command.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;Each set of values corresponds to a different customer, making it easy to populate the table quickly.&lt;/p&gt;
&lt;/li&gt;
&lt;/ul&gt;
&lt;h4&gt;Example 3: Inserting Data into a Partitioned Table&lt;/h4&gt;
&lt;p&gt;For partitioned tables, Dremio and Iceberg automatically manage the partitioning based on the table’s partitioning rules. Let’s add some data to the customers_partitioned table, which is partitioned by the first letter of last_name.&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-sql&quot;&gt;INSERT INTO lakehouse.customers_partitioned (id, first_name, last_name, age)
VALUES
  (6, &apos;Emma&apos;, &apos;Anderson&apos;, 29),
  (7, &apos;Frank&apos;, &apos;Baker&apos;, 35),
  (8, &apos;Grace&apos;, &apos;Clark&apos;, 41);
&lt;/code&gt;&lt;/pre&gt;
&lt;ul&gt;
&lt;li&gt;This inserts three records into the customers_partitioned table, and Dremio will handle partitioning based on the first letter of each last_name (e.g., &amp;quot;A&amp;quot; for Anderson, &amp;quot;B&amp;quot; for Baker, and &amp;quot;C&amp;quot; for Clark).&lt;/li&gt;
&lt;/ul&gt;
&lt;h4&gt;Example 4: Inserting Data with a Select Query&lt;/h4&gt;
&lt;p&gt;You can also insert data into a table by selecting data from another table. This is particularly useful if you need to copy data or load data from a staging table.&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-sql&quot;&gt;
INSERT INTO lakehouse.customers_partitioned (id, first_name, last_name, age)
SELECT id, first_name, last_name, age
FROM lakehouse.customers
WHERE age &amp;gt; 30;
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;In this example:&lt;/p&gt;
&lt;p&gt;We insert rows into &lt;code&gt;customers_partitioned&lt;/code&gt; by selecting records from the &lt;code&gt;customers&lt;/code&gt; table.
Only customers older than 30 are inserted into &lt;code&gt;customers_partitioned&lt;/code&gt;.&lt;/p&gt;
&lt;h3&gt;Verifying Inserted Data&lt;/h3&gt;
&lt;p&gt;To confirm that data was successfully inserted, you can use a SELECT query to retrieve and view the data:&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-sql&quot;&gt;Copy code
SELECT * FROM lakehouse.customers;
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;This command will display all rows in the customers table, allowing you to verify that your insertions were successful.&lt;/p&gt;
&lt;p&gt;With INSERT INTO, you can populate your Iceberg tables with data, either by inserting individual rows, multiple records at once, or copying data from other tables. Next, let&apos;s explore how to query this data with SQL.&lt;/p&gt;
&lt;h2&gt;How to Query Tables with SQL&lt;/h2&gt;
&lt;p&gt;With data inserted into our tables, we can now use SQL to query and analyze it. Dremio supports various SQL features, including filtering, grouping, ordering, and even Iceberg’s unique time-travel capabilities.&lt;/p&gt;
&lt;h3&gt;Basic &lt;code&gt;SELECT&lt;/code&gt; Query Syntax&lt;/h3&gt;
&lt;p&gt;The &lt;code&gt;SELECT&lt;/code&gt; command allows you to retrieve data from a table. Here’s the basic syntax:&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-sql&quot;&gt;SELECT [ALL | DISTINCT] &amp;lt;columns&amp;gt;
FROM &amp;lt;table_name&amp;gt;
[WHERE &amp;lt;condition&amp;gt;]
[GROUP BY &amp;lt;expression&amp;gt;]
[ORDER BY &amp;lt;column&amp;gt; [DESC]]
[LIMIT &amp;lt;count&amp;gt;];
&lt;/code&gt;&lt;/pre&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;&lt;code&gt;ALL&lt;/code&gt; | &lt;code&gt;DISTINCT&lt;/code&gt;:&lt;/strong&gt; ALL returns all values, while DISTINCT eliminates duplicates. If omitted, ALL is used by default.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;columns:&lt;/strong&gt; Specify the columns you want to retrieve (e.g., id, first_name) or use * to retrieve all columns.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;&lt;code&gt;WHERE&lt;/code&gt;:&lt;/strong&gt; Filters records based on a condition.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;&lt;code&gt;GROUP BY&lt;/code&gt;:&lt;/strong&gt; Groups records with similar values, allowing aggregate functions like &lt;code&gt;COUNT&lt;/code&gt;, &lt;code&gt;SUM&lt;/code&gt;, and &lt;code&gt;AVG&lt;/code&gt;.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;&lt;code&gt;ORDER BY&lt;/code&gt;:&lt;/strong&gt; Sorts results by one or more columns; add &lt;code&gt;DESC&lt;/code&gt; for descending order.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;&lt;code&gt;LIMIT&lt;/code&gt;:&lt;/strong&gt; Restricts the number of rows returned.&lt;/li&gt;
&lt;/ul&gt;
&lt;h4&gt;Example 1: Selecting All Columns&lt;/h4&gt;
&lt;p&gt;To view all data in the customers table, use SELECT *:&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-sql&quot;&gt;SELECT * FROM lakehouse.customers;
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;This query retrieves every row and column in the customers table.&lt;/p&gt;
&lt;h4&gt;Example 2: Filtering Results with WHERE&lt;/h4&gt;
&lt;p&gt;Use the &lt;code&gt;WHERE&lt;/code&gt; clause to filter records based on a condition. For instance, let’s retrieve all customers over the age of 30:&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-sql&quot;&gt;SELECT * FROM lakehouse.customers
WHERE age &amp;gt; 30;
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;This query returns only the rows where age is greater than 30.&lt;/p&gt;
&lt;h3&gt;Example 3: Grouping Results with GROUP BY&lt;/h3&gt;
&lt;p&gt;The GROUP BY clause groups records based on a specified column, allowing you to calculate aggregates. For example, let’s count the number of customers by age:&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-sql&quot;&gt;SELECT age, COUNT(*) AS customer_count
FROM lakehouse.customers
GROUP BY age;
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;In this example:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;We group customers by age and count the number of customers in each age group.&lt;/li&gt;
&lt;li&gt;The result shows unique ages and the number of customers for each age.&lt;/li&gt;
&lt;/ul&gt;
&lt;h4&gt;Example 4: Ordering Results with ORDER BY&lt;/h4&gt;
&lt;p&gt;You can sort query results by one or more columns. To get a list of customers ordered by age in descending order:&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-sql&quot;&gt;SELECT * FROM lakehouse.customers
ORDER BY age DESC;
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;This will display customers from the oldest to the youngest.&lt;/p&gt;
&lt;h4&gt;Example 5: Limiting the Number of Rows with LIMIT&lt;/h4&gt;
&lt;p&gt;Use LIMIT to restrict the number of rows returned. This is useful for viewing a sample of your data.&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-sql&quot;&gt;SELECT * FROM lakehouse.customers
LIMIT 5;
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;This query will return only the first five rows in the customers table.&lt;/p&gt;
&lt;h4&gt;Example 6: Using Iceberg’s Time-Travel with Snapshots&lt;/h4&gt;
&lt;p&gt;One of Iceberg’s powerful features is time-travel, which allows you to query historical versions of a table. You can specify a particular snapshot ID or timestamp to view data as it was at that moment.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Query by Snapshot ID:&lt;/strong&gt;&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-sql&quot;&gt;SELECT * FROM lakehouse.customers AT SNAPSHOT &apos;1234567890123456789&apos;;
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Replace &apos;1234567890123456789&apos; with the actual snapshot ID.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Query by Timestamp:&lt;/strong&gt;&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-sql&quot;&gt;SELECT * FROM lakehouse.customers AT TIMESTAMP &apos;2024-01-01 00:00:00.000&apos;;
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Replace &apos;2024-01-01 00:00:00.000&apos; with the desired timestamp. This lets you view the table as it existed at that specific time.&lt;/p&gt;
&lt;h4&gt;Example 7: Aggregating with Window Functions&lt;/h4&gt;
&lt;p&gt;Window functions allow you to perform calculations across rows related to the current row within a specified window. For example, if we want to rank customers by age within groups, we can use RANK():&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-sql&quot;&gt;SELECT id, first_name, last_name, age,
  RANK() OVER (ORDER BY age DESC) AS age_rank
FROM lakehouse.customers;
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;This query assigns a rank based on age, with the oldest customers ranked first.&lt;/p&gt;
&lt;h3&gt;Verifying Query Results&lt;/h3&gt;
&lt;p&gt;To ensure your queries are correct, you can run them in Dremio’s SQL Runner and examine the results in the output pane. Dremio provides performance insights and query details, making it easy to optimize and validate your SQL queries.&lt;/p&gt;
&lt;p&gt;With SELECT statements, you can retrieve, filter, group, and order data in Dremio, as well as take advantage of Iceberg’s time-travel capabilities. Next, we’ll look at how to update records in your tables using SQL.&lt;/p&gt;
&lt;h2&gt;How to Update Records with SQL&lt;/h2&gt;
&lt;p&gt;In Dremio, you can use SQL to update existing records in Apache Iceberg tables, making it easy to modify data without rewriting entire datasets. The &lt;code&gt;UPDATE&lt;/code&gt; command lets you change specific columns for rows that meet certain conditions.&lt;/p&gt;
&lt;h3&gt;Basic Syntax for &lt;code&gt;UPDATE&lt;/code&gt;&lt;/h3&gt;
&lt;pre&gt;&lt;code class=&quot;language-sql&quot;&gt;UPDATE &amp;lt;table_name&amp;gt;
SET &amp;lt;column1&amp;gt; = &amp;lt;value1&amp;gt;, &amp;lt;column2&amp;gt; = &amp;lt;value2&amp;gt;, ...
[WHERE &amp;lt;condition&amp;gt;];
&lt;/code&gt;&lt;/pre&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;table_name:&lt;/strong&gt; The name of the table you want to update, such as lakehouse.customers.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;&lt;code&gt;SET&lt;/code&gt;:&lt;/strong&gt; Specifies the columns and new values to assign.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;&lt;code&gt;WHERE&lt;/code&gt;:&lt;/strong&gt; An optional clause to filter the rows that should be updated. Without &lt;code&gt;WHERE&lt;/code&gt;, all rows in the table will be updated.&lt;/li&gt;
&lt;/ul&gt;
&lt;h4&gt;Example 1: Updating a Single Column&lt;/h4&gt;
&lt;p&gt;Suppose we want to update the age of a specific customer. We can use the WHERE clause to target the correct row:&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-sql&quot;&gt;UPDATE lakehouse.customers
SET age = 29
WHERE id = 1;
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;In this example:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;
&lt;p&gt;We update the age of the &lt;code&gt;customer&lt;/code&gt; with &lt;code&gt;id&lt;/code&gt; = 1 to 29.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;Only rows that match the condition &lt;code&gt;id&lt;/code&gt; = 1 are affected.&lt;/p&gt;
&lt;/li&gt;
&lt;/ul&gt;
&lt;h4&gt;Example 2: Updating Multiple Columns&lt;/h4&gt;
&lt;p&gt;You can update multiple columns in a single UPDATE command. Let’s change both the first_name and last_name of a customer:&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-sql&quot;&gt;UPDATE lakehouse.customers
SET first_name = &apos;Jonathan&apos;, last_name = &apos;Doe-Smith&apos;
WHERE id = 1;
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Here:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;We update both &lt;code&gt;first_name&lt;/code&gt; and &lt;code&gt;last_name&lt;/code&gt; for the customer with &lt;code&gt;id&lt;/code&gt; = 1.&lt;/li&gt;
&lt;li&gt;This operation only affects rows that meet the &lt;code&gt;WHERE&lt;/code&gt; condition.&lt;/li&gt;
&lt;/ul&gt;
&lt;h4&gt;Example 3: Conditional Updates with WHERE&lt;/h4&gt;
&lt;p&gt;The &lt;code&gt;WHERE&lt;/code&gt; clause allows you to apply updates based on specific conditions. For instance, let’s increase the age of all customers under 25 by 1 year:&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-sql&quot;&gt;UPDATE lakehouse.customers
SET age = age + 1
WHERE age &amp;lt; 25;
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;In this example:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;We increase the age by 1 for all customers where age is less than 25.&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;This approach is useful for performing bulk updates based on a condition.&lt;/p&gt;
&lt;h4&gt;Example 4: Updating Records in a Specific Branch&lt;/h4&gt;
&lt;p&gt;If you’re using Nessie to manage versions, you can update records within a specific branch. This allows you to make updates in an isolated environment, which you can later merge into the main branch.&lt;/p&gt;
&lt;p&gt;First you&apos;d need to create a new branch&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-sql&quot;&gt;CREATE BRANCH development IN lakehouse;
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Then you can update records in the branch&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-sql&quot;&gt;UPDATE lakehouse.customers
AT BRANCH &apos;development&apos;
SET age = 30
WHERE id = 3;
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;In this case:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;We update the age of the customer with id = 3 to 30 on the development branch.&lt;/li&gt;
&lt;li&gt;This change will only affect the specified branch until it is merged back into main.&lt;/li&gt;
&lt;li&gt;This only works for Nessie sources&lt;/li&gt;
&lt;/ul&gt;
&lt;pre&gt;&lt;code class=&quot;language-sql&quot;&gt;MERGE BRANCH development INTO main IN lakehouse;
&lt;/code&gt;&lt;/pre&gt;
&lt;h3&gt;Verifying Updates&lt;/h3&gt;
&lt;p&gt;To confirm your updates, you can query the table to view the modified records:&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-sql&quot;&gt;SELECT * FROM lakehouse.customers WHERE id = 1;
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;This query will display the updated row, allowing you to verify that the changes were applied successfully.&lt;/p&gt;
&lt;h3&gt;Important Notes on Updates&lt;/h3&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Transactional Safety:&lt;/strong&gt; With Apache Iceberg, updates are transactional, so they ensure data consistency and reliability.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Using Branches:&lt;/strong&gt; When working with branches in Nessie, remember to specify the branch in your &lt;code&gt;UPDATE&lt;/code&gt; command if you want to limit changes to a specific branch.&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;Using the &lt;code&gt;UPDATE&lt;/code&gt; command, you can easily modify data in your Apache Iceberg tables in Dremio. Whether updating single rows or multiple records based on conditions, Dremio’s SQL capabilities make data management flexible and efficient. In the next section, we’ll explore how to alter a table’s structure using SQL.&lt;/p&gt;
&lt;h2&gt;How to Alter a Table with SQL&lt;/h2&gt;
&lt;p&gt;As your data needs evolve, you may need to modify the structure of an Apache Iceberg table. Dremio’s &lt;code&gt;ALTER TABLE&lt;/code&gt; command provides flexibility to add, drop, or modify columns in existing tables, allowing your schema to evolve without significant disruptions.&lt;/p&gt;
&lt;h3&gt;Basic Syntax for &lt;code&gt;ALTER TABLE&lt;/code&gt;&lt;/h3&gt;
&lt;pre&gt;&lt;code class=&quot;language-sql&quot;&gt;ALTER TABLE &amp;lt;table_name&amp;gt;
[ ADD COLUMNS ( &amp;lt;column_name&amp;gt; &amp;lt;data_type&amp;gt; [, ...] ) ]
[ DROP COLUMN &amp;lt;column_name&amp;gt; ]
[ ALTER COLUMN &amp;lt;column_name&amp;gt; SET MASKING POLICY &amp;lt;policy_name&amp;gt; ]
[ MODIFY COLUMN &amp;lt;column_name&amp;gt; &amp;lt;new_data_type&amp;gt; ];
&lt;/code&gt;&lt;/pre&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;table_name:&lt;/strong&gt; The name of the table you want to alter, such as lakehouse.customers.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;&lt;code&gt;ADD COLUMNS&lt;/code&gt;:&lt;/strong&gt; Adds new columns to the table.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;&lt;code&gt;DROP COLUMN&lt;/code&gt;:&lt;/strong&gt; Removes a specified column from the table.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;&lt;code&gt;ALTER COLUMN&lt;/code&gt;:&lt;/strong&gt; Allows you to set a masking policy for data security.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;&lt;code&gt;MODIFY COLUMN&lt;/code&gt;:&lt;/strong&gt; Changes the data type of an existing column.&lt;/li&gt;
&lt;/ul&gt;
&lt;h4&gt;Example 1: Adding a New Column&lt;/h4&gt;
&lt;p&gt;To add a new column to an existing table, use the &lt;code&gt;ADD COLUMNS&lt;/code&gt; clause. Let’s add an email column to the customers table.&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-sql&quot;&gt;ALTER TABLE lakehouse.customers
ADD COLUMNS (email VARCHAR);
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;In this example:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;
&lt;p&gt;We add a new column email with the data type &lt;code&gt;VARCHAR&lt;/code&gt; to store customer email addresses.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;All existing rows will have &lt;code&gt;NULL&lt;/code&gt; as the default value in the new email column until data is populated.&lt;/p&gt;
&lt;/li&gt;
&lt;/ul&gt;
&lt;h4&gt;Example 2: Dropping a Column&lt;/h4&gt;
&lt;p&gt;If a column is no longer needed, you can remove it using &lt;code&gt;DROP COLUMN&lt;/code&gt;. Let’s remove the age column from the customers table.&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-sql&quot;&gt;ALTER TABLE lakehouse.customers
DROP COLUMN age;
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Here:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;
&lt;p&gt;The age column is removed from the customers table.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;Once a column is dropped, the action cannot be undone, so use this command carefully.&lt;/p&gt;
&lt;/li&gt;
&lt;/ul&gt;
&lt;h4&gt;Example 3: Modifying a Column’s Data Type&lt;/h4&gt;
&lt;p&gt;To change the data type of an existing column, use &lt;code&gt;MODIFY COLUMN&lt;/code&gt;. For example, let’s change the id column from &lt;code&gt;INT&lt;/code&gt; to &lt;code&gt;BIGINT&lt;/code&gt; to allow larger values.&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-sql&quot;&gt;ALTER TABLE lakehouse.customers
MODIFY COLUMN id BIGINT;
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;In this example:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;We modify the id column to have a data type of &lt;code&gt;BIGINT&lt;/code&gt;, which can store larger values than &lt;code&gt;INT&lt;/code&gt;.&lt;/li&gt;
&lt;li&gt;Changing data types is restricted to compatible types (e.g., &lt;code&gt;INT&lt;/code&gt; to &lt;code&gt;BIGINT&lt;/code&gt;).&lt;/li&gt;
&lt;/ul&gt;
&lt;h4&gt;Example 4: Setting a Masking Policy on a Column&lt;/h4&gt;
&lt;p&gt;Data masking can enhance data security by obscuring sensitive information. In Dremio, you can apply a masking policy to a column, making sensitive data less accessible to unauthorized users.&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-sql&quot;&gt;ALTER TABLE lakehouse.customers
ALTER COLUMN email
SET MASKING POLICY mask_email (email);
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;In this case:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;
&lt;p&gt;We set a masking policy called mask_email on the email column. (these policies are UDF&apos;s you must create before hand)&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;The masking policy defines how the data in this column is obscured when queried by users who do not have permission to view the raw data.&lt;/p&gt;
&lt;/li&gt;
&lt;/ul&gt;
&lt;h4&gt;Example 5: Adding a Partition Field&lt;/h4&gt;
&lt;p&gt;For Iceberg tables, you can adjust partitioning without rewriting the table. Let’s add a partition field to the customers table to partition data by the first letter of &lt;code&gt;last_name&lt;/code&gt;.&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-sql&quot;&gt;ALTER TABLE lakehouse.customers
ADD PARTITION FIELD truncate(1, last_name);
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Here:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;
&lt;p&gt;We partition the customers table by the first letter of &lt;code&gt;last_name&lt;/code&gt;, making queries more efficient when filtering by &lt;code&gt;last_name&lt;/code&gt;.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;Iceberg’s partition evolution feature enables you to add or change partition fields without rewriting the existing data.&lt;/p&gt;
&lt;/li&gt;
&lt;/ul&gt;
&lt;h3&gt;Verifying Alterations&lt;/h3&gt;
&lt;p&gt;After altering a table, you can verify the changes by checking the schema in Dremio’s Datasets section or by running a &lt;code&gt;SELECT&lt;/code&gt; query to observe the modified structure:&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-sql&quot;&gt;SELECT * FROM lakehouse.customers;
&lt;/code&gt;&lt;/pre&gt;
&lt;h4&gt;Important Notes on Table Alterations&lt;/h4&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Schema Evolution:&lt;/strong&gt; Apache Iceberg supports schema evolution, allowing you to make changes to table structure with minimal disruption.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Partition Evolution:&lt;/strong&gt; Changes to partitioning do not require data rewriting, making it easy to adapt your partition strategy over time.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Data Masking:&lt;/strong&gt; Applying masking policies ensures sensitive information is protected while maintaining accessibility for authorized users.&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;Using the ALTER TABLE command in Dremio, you can evolve the structure of your Apache Iceberg tables by adding, modifying, or removing columns, as well as updating partitioning strategies. In the next section, we’ll look at how to delete records from tables using SQL.&lt;/p&gt;
&lt;h2&gt;How to Delete Records with SQL&lt;/h2&gt;
&lt;p&gt;Deleting specific records from an Apache Iceberg table in Dremio can be done using the &lt;code&gt;DELETE&lt;/code&gt; command. This allows you to remove rows based on conditions, keeping your data relevant and up-to-date without needing to rewrite the entire dataset.&lt;/p&gt;
&lt;h3&gt;Basic Syntax for &lt;code&gt;DELETE&lt;/code&gt;&lt;/h3&gt;
&lt;pre&gt;&lt;code class=&quot;language-sql&quot;&gt;DELETE FROM &amp;lt;table_name&amp;gt;
[WHERE &amp;lt;condition&amp;gt;];
&lt;/code&gt;&lt;/pre&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;table_name:&lt;/strong&gt; The name of the table from which you want to delete records, such as lakehouse.customers.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;WHERE:&lt;/strong&gt; An optional clause that filters rows based on a condition. Without WHERE, all rows in the table will be deleted.&lt;/li&gt;
&lt;/ul&gt;
&lt;h4&gt;Example 1: Deleting Specific Records&lt;/h4&gt;
&lt;p&gt;Suppose we want to delete records of customers under the age of 18. We can use the WHERE clause to filter these rows and remove them from the customers table.&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-sql&quot;&gt;DELETE FROM lakehouse.customers
WHERE age &amp;lt; 18;
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;In this example:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;
&lt;p&gt;Only rows where age is less than 18 are deleted from the customers table.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;The &lt;code&gt;WHERE&lt;/code&gt; clause ensures that only specific records are affected by the deletion.&lt;/p&gt;
&lt;/li&gt;
&lt;/ul&gt;
&lt;h4&gt;Example 2: Deleting All Records&lt;/h4&gt;
&lt;p&gt;If you need to clear all data from a table but keep the table structure intact, simply omit the &lt;code&gt;WHERE&lt;/code&gt; clause.&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-sql&quot;&gt;DELETE FROM lakehouse.customers;
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;This command:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Removes all rows from the customers table without deleting the table itself.&lt;/li&gt;
&lt;li&gt;The table schema remains intact, allowing new data to be inserted into the table later.&lt;/li&gt;
&lt;/ul&gt;
&lt;h4&gt;Example 3: Deleting Records in a Specific Branch&lt;/h4&gt;
&lt;p&gt;When using Nessie for versioned data management, you can delete records in an isolated branch. This allows for safe experimentation without affecting the main data.&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-sql&quot;&gt;DELETE FROM lakehouse.customers
AT BRANCH development
WHERE age &amp;gt; 60;
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;In this example:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;We delete records where age is greater than 60 on the development branch.&lt;/li&gt;
&lt;li&gt;The main branch remains unaffected by this operation until the changes are merged back.&lt;/li&gt;
&lt;/ul&gt;
&lt;h3&gt;Verifying Deletions&lt;/h3&gt;
&lt;p&gt;To confirm that records were successfully deleted, run a &lt;code&gt;SELECT&lt;/code&gt; query on the table:&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-sql&quot;&gt;SELECT * FROM lakehouse.customers;
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;This command will display the remaining records, allowing you to verify that the desired rows were removed.&lt;/p&gt;
&lt;h4&gt;Important Notes on Deleting Records&lt;/h4&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Transactional Deletions:&lt;/strong&gt; With Iceberg’s support for ACID compliance, deletions are transactional, ensuring consistency and reliability.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Version Control with Branches:&lt;/strong&gt; Using Nessie’s branching capabilities, you can isolate deletions in specific branches, allowing safe experimentation.&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;The &lt;code&gt;DELETE&lt;/code&gt; command in Dremio provides a straightforward way to remove unwanted data from your Apache Iceberg tables. This completes the basics of SQL operations with Apache Iceberg and Dremio, empowering you to handle data from creation to deletion with ease.&lt;/p&gt;
&lt;h2&gt;Conclusion&lt;/h2&gt;
&lt;p&gt;We explored the essentials of SQL operations using Apache Iceberg and Dremio. By combining Dremio’s powerful query engine with Apache Iceberg’s robust data management capabilities, you can efficiently handle large datasets, support schema evolution, and take advantage of advanced features like time-travel and branching. Here’s a quick recap of what we covered:&lt;/p&gt;
&lt;ol&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;What is SQL, Apache Iceberg, and Dremio&lt;/strong&gt;: We introduced the importance of SQL, Apache Iceberg as a data lakehouse table format, and Dremio as a platform that enhances querying capabilities in a data lakehouse environment.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Setting Up an Environment with Dremio, Nessie, and MinIO&lt;/strong&gt;: We configured a local environment using Docker Compose, allowing us to work with Dremio, Nessie for version control, and MinIO for S3-compatible storage.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Accessing Dremio and Connecting Nessie&lt;/strong&gt;: We connected Dremio to Nessie and MinIO, providing a foundation for managing and querying data.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;How to Create Tables with SQL&lt;/strong&gt;: Using the &lt;code&gt;CREATE TABLE&lt;/code&gt; command, we created Apache Iceberg tables, including partitioned tables for optimized performance.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;How to Insert into Tables with SQL&lt;/strong&gt;: We populated our tables using the &lt;code&gt;INSERT INTO&lt;/code&gt; command, demonstrating single and batch inserts.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;How to Query Tables with SQL&lt;/strong&gt;: With &lt;code&gt;SELECT&lt;/code&gt; queries, we retrieved data, applied filters, grouped results, and explored Iceberg’s time-travel capabilities.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;How to Update Records with SQL&lt;/strong&gt;: We used the &lt;code&gt;UPDATE&lt;/code&gt; command to modify specific records based on conditions, showing how to evolve data as needs change.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;How to Alter a Table with SQL&lt;/strong&gt;: Using &lt;code&gt;ALTER TABLE&lt;/code&gt;, we modified the structure of our tables, adding, dropping, and modifying columns as our data needs evolved.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;How to Delete Records with SQL&lt;/strong&gt;: Finally, we covered the &lt;code&gt;DELETE&lt;/code&gt; command, enabling record removal based on conditions and managing data cleanly.&lt;/p&gt;
&lt;/li&gt;
&lt;/ol&gt;
&lt;h3&gt;Next Steps&lt;/h3&gt;
&lt;p&gt;With these SQL basics under your belt, here are a few ways to continue expanding your skills with Apache Iceberg and Dremio:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Explore More SQL Functions&lt;/strong&gt;: Dive deeper into SQL functions supported by Dremio to handle more complex analytical tasks.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Experiment with Data Branching and Merging&lt;/strong&gt;: Use Nessie’s branching and merging capabilities for safe experimentation, making it easier to test changes without affecting production data.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Leverage Dremio Reflections&lt;/strong&gt;: Learn about Dremio’s Reflections feature to accelerate queries and enhance performance.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Scale to the Cloud&lt;/strong&gt;: Consider deploying Dremio and Iceberg in a cloud environment for greater scalability and to integrate with larger data sources.&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;By mastering these core SQL operations, you’re well-prepared to build, maintain, and analyze data in a modern data lakehouse architecture. Whether you’re managing structured or unstructured data, Dremio and Apache Iceberg offer the tools you need for efficient, flexible, and high-performance data workflows.&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;a href=&quot;https://www.dremio.com/blog/apache-iceberg-crash-course-what-is-a-data-lakehouse-and-a-table-format/?utm_source=ev_external_blog&amp;amp;utm_medium=influencer&amp;amp;utm_campaign=introtosql&amp;amp;utm_content=alexmerced&amp;amp;utm_term=external_blog&quot;&gt;Blog: What is a Data Lakehouse and a Table Format?&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&quot;https://hello.dremio.com/wp-apache-iceberg-the-definitive-guide-reg.html?utm_source=ev_external_blog&amp;amp;utm_medium=influencer&amp;amp;utm_campaign=introtosql&amp;amp;utm_content=alexmerced&amp;amp;utm_term=external_blog&quot;&gt;Free Copy of Apache Iceberg the Definitive Guide&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&quot;https://hello.dremio.com/webcast-an-apache-iceberg-lakehouse-crash-course-reg.html?utm_source=ev_external_blog&amp;amp;utm_medium=influencer&amp;amp;utm_campaign=introtosql&amp;amp;utm_content=alexmerced&amp;amp;utm_term=external_blog&quot;&gt;Free Apache Iceberg Crash Course&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&quot;https://hello.dremio.com/webcast-an-in-depth-exploration-on-the-world-of-data-lakehouse-catalogs-reg.html?utm_source=ev_external_blog&amp;amp;utm_medium=influencer&amp;amp;utm_campaign=introtosql&amp;amp;utm_content=alexmerced&amp;amp;utm_term=external_blog&quot;&gt;Lakehouse Catalog Course&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&quot;https://www.youtube.com/watch?v=SIriNcVIGJQ&amp;amp;list=PLsLAVBjQJO0p0Yq1fLkoHvt2lEJj5pcYe&quot;&gt;Iceberg Lakehouse Engineering Video Playlist&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;
</content:encoded><author>Alex Merced</author></item><item><title>Dremio, Apache Iceberg and their role in AI-Ready Data</title><link>https://iceberglakehouse.com/posts/2024-11-Dremio-and-AI-Ready-Data/</link><guid isPermaLink="true">https://iceberglakehouse.com/posts/2024-11-Dremio-and-AI-Ready-Data/</guid><description>- [Blog: What is a Data Lakehouse and a Table Format?](https://www.dremio.com/blog/apache-iceberg-crash-course-what-is-a-data-lakehouse-and-a-table-fo...</description><pubDate>Tue, 05 Nov 2024 09:00:00 GMT</pubDate><content:encoded>&lt;ul&gt;
&lt;li&gt;&lt;a href=&quot;https://www.dremio.com/blog/apache-iceberg-crash-course-what-is-a-data-lakehouse-and-a-table-format/?utm_source=ev_external_blog&amp;amp;utm_medium=influencer&amp;amp;utm_campaign=dremioaireadydata&amp;amp;utm_content=alexmerced&amp;amp;utm_term=external_blog&quot;&gt;Blog: What is a Data Lakehouse and a Table Format?&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&quot;https://hello.dremio.com/wp-apache-iceberg-the-definitive-guide-reg.html?utm_source=ev_external_blog&amp;amp;utm_medium=influencer&amp;amp;utm_campaign=dremioaireadydata&amp;amp;utm_content=alexmerced&amp;amp;utm_term=external_blog&quot;&gt;Free Copy of Apache Iceberg the Definitive Guide&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&quot;https://hello.dremio.com/webcast-an-apache-iceberg-lakehouse-crash-course-reg.html?utm_source=ev_external_blog&amp;amp;utm_medium=influencer&amp;amp;utm_campaign=dremioaireadydata&amp;amp;utm_content=alexmerced&amp;amp;utm_term=external_blog&quot;&gt;Free Apache Iceberg Crash Course&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&quot;https://hello.dremio.com/webcast-an-in-depth-exploration-on-the-world-of-data-lakehouse-catalogs-reg.html?utm_source=ev_external_blog&amp;amp;utm_medium=influencer&amp;amp;utm_campaign=dremioaireadydata&amp;amp;utm_content=alexmerced&amp;amp;utm_term=external_blog&quot;&gt;Lakehouse Catalog Course&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&quot;https://www.youtube.com/watch?v=SIriNcVIGJQ&amp;amp;list=PLsLAVBjQJO0p0Yq1fLkoHvt2lEJj5pcYe&quot;&gt;Iceberg Lakehouse Engineering Video Playlist&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;AI models, whether for machine learning or deep learning, require vast amounts of data to train, validate, and test. But not just any data will do—this data must be accessible, scalable, and optimized for efficient processing. This is where the concept of &amp;quot;AI-ready data&amp;quot; comes into play.&lt;/p&gt;
&lt;p&gt;&amp;quot;AI-ready data&amp;quot; refers to data that meets specific criteria to support the demands of AI development: it must be accessible for easy access, scalable for large volumes, and governed to ensure compliance. Ensuring data meets these criteria can be challenging, especially with the complexity of modern data landscapes that include data lakes, databases, warehouses, and more.&lt;/p&gt;
&lt;p&gt;Let&apos;s explore the critical roles Dremio and Apache Iceberg play in making data AI-ready. By leveraging these tools, data teams can prepare, manage, and optimize structured data to meet the demands of AI workloads, helping organizations scale their AI development efficiently.&lt;/p&gt;
&lt;h2&gt;What is AI-Ready Data?&lt;/h2&gt;
&lt;p&gt;For data to be truly AI-ready, it must meet several key requirements. Here’s a look at the core attributes of AI-ready data and why each is essential in AI development:&lt;/p&gt;
&lt;ol&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Accessibility&lt;/strong&gt;: Data should be accessible from various environments and applications. AI models often rely on multiple data sources, and having data that’s readily accessible without extensive ETL (Extract, Transform, Load) processes saves time and resources.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Scalability&lt;/strong&gt;: AI workloads are typically data-intensive. To scale, data must be stored in formats that allow for efficient retrieval and processing at scale, without performance bottlenecks.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Transformability&lt;/strong&gt;: AI models often require data in a particular structure or with certain attributes. AI-ready data should support complex transformations to fit the needs of different models, whether it’s feature engineering, data normalization, or other preprocessing steps.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Governance&lt;/strong&gt;: Ensuring compliance is crucial, especially when working with sensitive data. Governance controls, such as access rules and audit trails, ensure that data usage aligns with privacy policies and regulatory requirements. Governance is also important for model accuracy—making sure the model isn’t trained on irrelevant or unauthorized data.&lt;/p&gt;
&lt;/li&gt;
&lt;/ol&gt;
&lt;p&gt;Preparing data that meets these criteria can be difficult, particularly when handling vast amounts of structured and unstructured data across multiple systems. However, with tools like Apache Iceberg and Dremio, data teams can address these challenges and streamline structured data preparation for AI workloads.&lt;/p&gt;
&lt;h2&gt;How Apache Iceberg Enables AI-Ready Structured Data&lt;/h2&gt;
&lt;p&gt;Apache Iceberg is a powerful open table format designed for large-scale, structured data in data lakes. Its unique capabilities help make data AI-ready by ensuring accessibility, scalability, and flexibility in data management. Here’s how Iceberg supports the requirements of AI-ready data:&lt;/p&gt;
&lt;ol&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Accessible, Transformable Data at Scale&lt;/strong&gt;: Apache Iceberg enables large-scale structured data to be easily accessed and transformed within data lakes, ensuring that data can be queried and modified without the complexities typically associated with data lake storage. Iceberg’s robust schema evolution and versioning features allow data to stay accessible and flexible, accommodating changing requirements for AI models.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Historical Data Benchmarking with Time Travel&lt;/strong&gt;: Iceberg’s time-travel functionality allows data teams to query historical versions of data, making it possible to benchmark models against different points in time. This is invaluable for training models on data snapshots from various periods, allowing comparison and validation with past data states.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Partition Evolution for Data Optimization&lt;/strong&gt;: Iceberg’s partition evolution feature enables experimentation with partitioning strategies, helping data teams optimize how data is organized and retrieved. Optimized partitioning allows for faster data access and retrieval, which can reduce model training time and improve overall efficiency.&lt;/p&gt;
&lt;/li&gt;
&lt;/ol&gt;
&lt;p&gt;With these features, Apache Iceberg helps maintain structured data that’s accessible, transformable, and optimized, creating a robust foundation for AI workloads in data lakes.&lt;/p&gt;
&lt;h2&gt;How Dremio Empowers AI-Ready Data Management&lt;/h2&gt;
&lt;p&gt;Dremio provides a unified data platform that enhances the management and accessibility of data, making it an ideal tool for preparing AI-ready data. Here are some of the ways Dremio’s features support AI development:&lt;/p&gt;
&lt;ol&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;First-Class Support for Apache Iceberg&lt;/strong&gt;: Dremio integrates seamlessly with Apache Iceberg, allowing users to manage and query Iceberg tables without complex configurations. This makes it easier for data teams to leverage Iceberg’s capabilities directly within Dremio.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Data Federation Across Multiple Sources&lt;/strong&gt;: Dremio enables federated queries across databases, data warehouses, data lakes, and lakehouse catalogs, providing a unified view of disparate data sources. This removes data silos and allows AI models to access and utilize data from a variety of sources without moving or duplicating data.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Curated Views for Simplified Data Wrangling&lt;/strong&gt;: Dremio allows users to create curated views on top of multiple data sources, simplifying data wrangling and transformation. These views provide a streamlined view of the data, making it easier to prepare data for AI without extensive data processing.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Integrated Catalog with Versioning&lt;/strong&gt;: Dremio’s integrated catalog supports versioning with multi-table branching, merging, and tagging. This allows data teams to create replicable data snapshots and zero-copy experimental environments, making it easy to experiment, tag datasets, and manage different versions of data used for AI development.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Apache Arrow Flight for Fast Data Access&lt;/strong&gt;: Dremio supports Apache Arrow Flight, a high-performance protocol that allows data to be pulled from Dremio at speeds much faster than traditional JDBC. This significantly accelerates data retrieval for model training, reducing overall model development time.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Comprehensive SQL Functions for Data Wrangling&lt;/strong&gt;: Dremio provides a rich set of SQL functions that help data teams perform complex transformations and data wrangling tasks, making it efficient to prepare data for AI workloads.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Granular Access Controls&lt;/strong&gt;: Dremio offers role-based, row-based, and column-based access controls, ensuring that only authorized data is used for model training. This helps maintain compliance and prevents models from training on sensitive or unauthorized data.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Query Acceleration with Data Reflections&lt;/strong&gt;: Dremio’s data reflections feature enables efficient query acceleration by creating optimized representations of datasets, tailored for specific types of queries. Data reflections reduce the need to repeatedly process raw data, instead offering pre-aggregated or pre-sorted versions that speed up query performance. For AI workloads, this translates to faster data retrieval, especially when models require frequent access to large or complex datasets, significantly reducing wait times during model training and experimentation.&lt;/p&gt;
&lt;/li&gt;
&lt;/ol&gt;
&lt;p&gt;By combining data federation, powerful data wrangling tools, integrated catalog management, and high-performance data access, Dremio empowers teams to manage data effectively for AI, supporting a seamless flow from raw data to AI-ready datasets.&lt;/p&gt;
&lt;h2&gt;Use Cases: Dremio and Apache Iceberg for AI Workloads&lt;/h2&gt;
&lt;p&gt;Let’s look at some practical scenarios where Dremio and Apache Iceberg streamline data preparation for AI workloads, showcasing how they help overcome common challenges in AI development:&lt;/p&gt;
&lt;ol&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Training Models on Historical Data Snapshots&lt;/strong&gt;: With Iceberg’s time-travel capabilities, data teams can train models on historical snapshots, enabling AI models to learn from data as it existed in different periods. This is particularly useful for time-sensitive applications, such as financial forecasting or customer behavior analysis, where benchmarking against historical trends is essential.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Experimenting with Data Optimization for Faster Model Training&lt;/strong&gt;: Iceberg’s partition evolution and Dremio’s curated views allow data teams to experiment with data layouts and transformations. By optimizing data partitioning, models can retrieve data faster, resulting in more efficient model training and faster experimentation cycles.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Creating Zero-Copy Experimental Environments&lt;/strong&gt;: With Dremio’s integrated catalog versioning, data teams can create isolated, zero-copy environments to test AI models on different datasets or data versions without affecting the original data. This enables rapid prototyping and experimentation, allowing data scientists to try different approaches and configurations safely and efficiently.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Unified Access to Diverse Data Sources for AI Development&lt;/strong&gt;: Dremio’s federated query capabilities enable AI models to access data across multiple sources, such as relational databases, data warehouses, and data lakes. This allows data scientists to bring together diverse datasets without moving or duplicating data, providing a more comprehensive training set for their models.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Ensuring Compliance with Fine-Grained Access Controls&lt;/strong&gt;: Dremio’s role-based, row-based, and column-based access controls ensure that AI models train only on permissible data. This level of data governance is crucial for models that must meet regulatory standards, such as those in healthcare, finance, or other highly regulated industries.&lt;/p&gt;
&lt;/li&gt;
&lt;/ol&gt;
&lt;h2&gt;Conclusion&lt;/h2&gt;
&lt;p&gt;Having access to &amp;quot;AI-ready data&amp;quot; is paramount for developing models that are accurate, efficient, and compliant. Dremio and Apache Iceberg are instrumental in creating a robust foundation for AI workloads, making it easy to access, transform, and manage large-scale structured data.&lt;/p&gt;
&lt;p&gt;With Iceberg, data teams gain control over data management at scale, leveraging features like time travel and partition evolution to keep data organized and optimized. Dremio complements this with seamless Iceberg integration, federated data access, and powerful data wrangling capabilities, enabling a smooth path from raw data to AI-ready datasets.&lt;/p&gt;
&lt;p&gt;Together, Dremio and Apache Iceberg provide an end-to-end solution that empowers data teams to meet the demands of modern AI. Whether you’re building models on historical data, experimenting with data partitions, or ensuring compliance with strict governance rules, Dremio and Iceberg offer the tools you need to manage and optimize data, setting the stage for successful AI development.&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;a href=&quot;https://www.dremio.com/blog/apache-iceberg-crash-course-what-is-a-data-lakehouse-and-a-table-format/?utm_source=ev_external_blog&amp;amp;utm_medium=influencer&amp;amp;utm_campaign=dremioaireadydata&amp;amp;utm_content=alexmerced&amp;amp;utm_term=external_blog&quot;&gt;Blog: What is a Data Lakehouse and a Table Format?&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&quot;https://hello.dremio.com/wp-apache-iceberg-the-definitive-guide-reg.html?utm_source=ev_external_blog&amp;amp;utm_medium=influencer&amp;amp;utm_campaign=dremioaireadydata&amp;amp;utm_content=alexmerced&amp;amp;utm_term=external_blog&quot;&gt;Free Copy of Apache Iceberg the Definitive Guide&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&quot;https://hello.dremio.com/webcast-an-apache-iceberg-lakehouse-crash-course-reg.html?utm_source=ev_external_blog&amp;amp;utm_medium=influencer&amp;amp;utm_campaign=dremioaireadydata&amp;amp;utm_content=alexmerced&amp;amp;utm_term=external_blog&quot;&gt;Free Apache Iceberg Crash Course&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&quot;https://hello.dremio.com/webcast-an-in-depth-exploration-on-the-world-of-data-lakehouse-catalogs-reg.html?utm_source=ev_external_blog&amp;amp;utm_medium=influencer&amp;amp;utm_campaign=dremioaireadydata&amp;amp;utm_content=alexmerced&amp;amp;utm_term=external_blog&quot;&gt;Lakehouse Catalog Course&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&quot;https://www.youtube.com/watch?v=SIriNcVIGJQ&amp;amp;list=PLsLAVBjQJO0p0Yq1fLkoHvt2lEJj5pcYe&quot;&gt;Iceberg Lakehouse Engineering Video Playlist&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;
</content:encoded><author>Alex Merced</author></item><item><title>Introduction to Cargo and cargo.toml</title><link>https://iceberglakehouse.com/posts/2024-11-rust-cargo-toml/</link><guid isPermaLink="true">https://iceberglakehouse.com/posts/2024-11-rust-cargo-toml/</guid><description>
When working with Rust, Cargo is your go-to tool for managing dependencies, building, and running your projects. Acting as Rust&apos;s package manager and...</description><pubDate>Tue, 05 Nov 2024 09:00:00 GMT</pubDate><content:encoded>&lt;p&gt;When working with Rust, Cargo is your go-to tool for managing dependencies, building, and running your projects. Acting as Rust&apos;s package manager and build system, Cargo simplifies a lot of the heavy lifting in a project’s lifecycle. Central to this is the &lt;code&gt;cargo.toml&lt;/code&gt; file, which is at the heart of every Cargo-managed Rust project.&lt;/p&gt;
&lt;p&gt;The &lt;code&gt;cargo.toml&lt;/code&gt; file serves as the project&apos;s configuration file, defining essential details like metadata, dependencies, and optional features. This file not only controls which libraries your project depends on but also provides configurations for different build profiles, conditional compilation features, and workspace settings.&lt;/p&gt;
&lt;p&gt;Understanding &lt;code&gt;cargo.toml&lt;/code&gt; is crucial for managing dependencies efficiently, setting up multiple crates within a workspace, and optimizing your project&apos;s build performance. In this guide, we’ll explore how &lt;code&gt;cargo.toml&lt;/code&gt; is structured, how to add dependencies, define build configurations, and make the most of this file to manage your Rust projects effectively.&lt;/p&gt;
&lt;h2&gt;Structure of the &lt;code&gt;cargo.toml&lt;/code&gt; File&lt;/h2&gt;
&lt;p&gt;The &lt;code&gt;cargo.toml&lt;/code&gt; file is organized into multiple sections, each serving a specific purpose in configuring various aspects of a Rust project. Let’s break down the key sections you’ll encounter:&lt;/p&gt;
&lt;h3&gt;&lt;code&gt;[package]&lt;/code&gt;: General Project Metadata&lt;/h3&gt;
&lt;p&gt;The &lt;code&gt;[package]&lt;/code&gt; section contains metadata about your Rust project. It includes fields like:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;code&gt;name&lt;/code&gt;: The name of your package, which should be unique if you’re publishing to crates.io.&lt;/li&gt;
&lt;li&gt;&lt;code&gt;version&lt;/code&gt;: The version of your project, following Semantic Versioning (e.g., &lt;code&gt;1.0.0&lt;/code&gt;).&lt;/li&gt;
&lt;li&gt;&lt;code&gt;authors&lt;/code&gt;: Your name or the names of the contributors (optional).&lt;/li&gt;
&lt;li&gt;&lt;code&gt;edition&lt;/code&gt;: Specifies the Rust edition you’re using, such as &lt;code&gt;2018&lt;/code&gt; or &lt;code&gt;2021&lt;/code&gt;.&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;Example:&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-toml&quot;&gt;[package]
name = &amp;quot;my_project&amp;quot;
version = &amp;quot;0.1.0&amp;quot;
authors = [&amp;quot;Alex Merced &amp;lt;alex@example.com&amp;gt;&amp;quot;]
edition = &amp;quot;2021&amp;quot;
&lt;/code&gt;&lt;/pre&gt;
&lt;h3&gt;&lt;code&gt;[dependencies]&lt;/code&gt;: Managing Project Dependencies&lt;/h3&gt;
&lt;p&gt;The &lt;code&gt;[dependencies]&lt;/code&gt; section lists the external libraries your project relies on. For each dependency, you specify the name and version, and Cargo will automatically download and manage these dependencies.&lt;/p&gt;
&lt;p&gt;Example:&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-toml&quot;&gt;[dependencies]
serde = &amp;quot;1.0&amp;quot;
reqwest = { version = &amp;quot;0.11&amp;quot;, features = [&amp;quot;json&amp;quot;] }
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;This example includes serde with a version constraint and reqwest with specific features enabled.&lt;/p&gt;
&lt;h3&gt;&lt;code&gt;[dev-dependencies]&lt;/code&gt;: Development-Only Dependencies&lt;/h3&gt;
&lt;p&gt;&lt;code&gt;[dev-dependencies]&lt;/code&gt; works like &lt;code&gt;[dependencies]&lt;/code&gt; but is only used for development or testing. For example, if you need a library solely for testing, you can add it here, and it won’t be included in the final build.&lt;/p&gt;
&lt;p&gt;Example:&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-toml&quot;&gt;[dev-dependencies]
rand = &amp;quot;0.8&amp;quot;
&lt;/code&gt;&lt;/pre&gt;
&lt;h3&gt;&lt;code&gt;[features]&lt;/code&gt;: Defining Optional Features&lt;/h3&gt;
&lt;p&gt;Features allow you to conditionally include dependencies or enable specific parts of your project. They’re useful for creating optional functionality and reducing bloat in builds.&lt;/p&gt;
&lt;p&gt;Example:&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-toml&quot;&gt;[features]
default = [&amp;quot;json_support&amp;quot;]
json_support = [&amp;quot;serde&amp;quot;, &amp;quot;serde_json&amp;quot;]
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;In this example, the &lt;code&gt;json_support&lt;/code&gt; feature adds &lt;code&gt;serde&lt;/code&gt; and &lt;code&gt;serde_json&lt;/code&gt; libraries, and it’s included by default.&lt;/p&gt;
&lt;h3&gt;&lt;code&gt;[profile]&lt;/code&gt;: Configurations for Build Profiles&lt;/h3&gt;
&lt;p&gt;The &lt;code&gt;[profile]&lt;/code&gt; section allows customization of build settings for different profiles, such as dev for development and release for optimized production builds. Adjusting these settings helps optimize for speed, size, or other factors based on your environment.&lt;/p&gt;
&lt;p&gt;Example:&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-toml&quot;&gt;[profile.release]
opt-level = 3
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Here, the opt-level for release builds is set to 3, the highest optimization level.&lt;/p&gt;
&lt;p&gt;These sections provide a foundational understanding of cargo.toml. In the following sections, we’ll dive into more details on each and show how to use them effectively.&lt;/p&gt;
&lt;h2&gt;Configuring Project Metadata&lt;/h2&gt;
&lt;p&gt;The &lt;code&gt;[package]&lt;/code&gt; section of &lt;code&gt;cargo.toml&lt;/code&gt; provides essential metadata about your project, which can be useful for project organization, publishing, and versioning. Let’s explore the common fields used within this section and their purposes:&lt;/p&gt;
&lt;h3&gt;Key Fields in &lt;code&gt;[package]&lt;/code&gt;&lt;/h3&gt;
&lt;ul&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;&lt;code&gt;name&lt;/code&gt;&lt;/strong&gt;: The name of your project, which should be unique if you plan to publish to crates.io. This name is how users will identify and include your crate as a dependency.&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-toml&quot;&gt;name = &amp;quot;my_project&amp;quot;
&lt;/code&gt;&lt;/pre&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;&lt;code&gt;version&lt;/code&gt;&lt;/strong&gt;: Specifies the current version of your project. Cargo follows Semantic Versioning, so use a version format like 0.1.0 or 1.0.0. This field is especially important for tracking releases.&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-toml&quot;&gt;  version = &amp;quot;0.1.0&amp;quot;
&lt;/code&gt;&lt;/pre&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;&lt;code&gt;authors&lt;/code&gt;&lt;/strong&gt;: An optional list of contributors’ names or emails. Although it’s not mandatory, adding authors can help document who has worked on the project.&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-toml&quot;&gt;authors = [&amp;quot;Alex Merced &amp;lt;alex@example.com&amp;gt;&amp;quot;]
&lt;/code&gt;&lt;/pre&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;&lt;code&gt;edition&lt;/code&gt;&lt;/strong&gt;: Specifies the Rust edition your project is based on. The most common editions are 2018 and 2021. This setting ensures compatibility with language features specific to each edition.&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-toml&quot;&gt;edition = &amp;quot;2021&amp;quot;
&lt;/code&gt;&lt;/pre&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;&lt;code&gt;description&lt;/code&gt;&lt;/strong&gt;: A short description of your project, which is optional but useful if you plan to publish your crate. It gives users a quick idea of what your project does.&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-toml&quot;&gt;description = &amp;quot;A simple Rust project demonstrating cargo.toml usage&amp;quot;
&lt;/code&gt;&lt;/pre&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;&lt;code&gt;license:&lt;/code&gt;&lt;/strong&gt; Defines the license under which your project is distributed. Common choices include MIT, Apache-2.0, or GPL-3.0. Licensing helps clarify legal use for other developers and users.&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-toml&quot;&gt;license = &amp;quot;MIT&amp;quot;
&lt;/code&gt;&lt;/pre&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;&lt;code&gt;repository:&lt;/code&gt;&lt;/strong&gt; A link to the project’s repository (e.g., GitHub). Providing this link is helpful for users who want to see the source code or contribute.&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-toml&quot;&gt;repository = &amp;quot;https://github.com/alexmerced/my_project&amp;quot;
&lt;/code&gt;&lt;/pre&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;&lt;code&gt;documentation:&lt;/code&gt;&lt;/strong&gt; A URL linking to the project’s documentation. This is especially useful if you’ve hosted API docs, like those generated by cargo doc, on platforms such as docs.rs.&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-toml&quot;&gt;documentation = &amp;quot;https://docs.rs/my_project&amp;quot;
&lt;/code&gt;&lt;/pre&gt;
&lt;/li&gt;
&lt;/ul&gt;
&lt;h3&gt;Example &lt;code&gt;[package]&lt;/code&gt; Section&lt;/h3&gt;
&lt;p&gt;Here’s an example that combines these fields to form a complete &lt;code&gt;[package]&lt;/code&gt; configuration:&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-toml&quot;&gt;[package]
name = &amp;quot;my_project&amp;quot;
version = &amp;quot;0.1.0&amp;quot;
authors = [&amp;quot;Alex Merced &amp;lt;alex@example.com&amp;gt;&amp;quot;]
edition = &amp;quot;2021&amp;quot;
description = &amp;quot;A simple Rust project demonstrating cargo.toml usage&amp;quot;
license = &amp;quot;MIT&amp;quot;
repository = &amp;quot;https://github.com/alexmerced/my_project&amp;quot;
documentation = &amp;quot;https://docs.rs/my_project&amp;quot;
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;This setup makes your project easier to understand, document, and share. With a well-configured &lt;code&gt;[package]&lt;/code&gt; section, your project gains a professional touch, preparing it for development, collaboration, or even public release on crates.io.&lt;/p&gt;
&lt;h2&gt;Adding and Managing Dependencies&lt;/h2&gt;
&lt;p&gt;Dependencies are a core aspect of any Rust project, enabling you to reuse code and leverage external libraries. The &lt;code&gt;[dependencies]&lt;/code&gt; section of &lt;code&gt;cargo.toml&lt;/code&gt; lets you specify which libraries (or &amp;quot;crates&amp;quot;) your project requires and manages them efficiently.&lt;/p&gt;
&lt;h3&gt;Basic Dependency Syntax&lt;/h3&gt;
&lt;p&gt;To add a dependency, simply specify the crate name and version in the &lt;code&gt;[dependencies]&lt;/code&gt; section. Cargo will automatically fetch and compile it for you.&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-toml&quot;&gt;[dependencies]
serde = &amp;quot;1.0&amp;quot;  # Add Serde library for serialization/deserialization
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;In this example, the serde crate will be added at the latest compatible version within the &lt;code&gt;1.0.x&lt;/code&gt; series. Cargo&apos;s versioning follows Semantic Versioning, meaning &lt;code&gt;1.0&lt;/code&gt; covers any version from &lt;code&gt;1.0.0&lt;/code&gt; to &lt;code&gt;&amp;lt;2.0.0&lt;/code&gt;.&lt;/p&gt;
&lt;h3&gt;Specifying Dependency Versions&lt;/h3&gt;
&lt;p&gt;You can control the version of each dependency by using different version specifiers:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Exact Version&lt;/strong&gt;: Only uses this exact version.&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-toml&quot;&gt;serde = &amp;quot;=1.0.104&amp;quot;
&lt;/code&gt;&lt;/pre&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Caret (&lt;code&gt;^&lt;/code&gt;)&lt;/strong&gt;: Allows updates within the same major version (default behavior).&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-toml&quot;&gt;serde = &amp;quot;^1.0&amp;quot;
&lt;/code&gt;&lt;/pre&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Tilde (&lt;code&gt;~&lt;/code&gt;)&lt;/strong&gt;: Allows updates within the same minor version.&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-toml&quot;&gt;serde = &amp;quot;~1.0.104&amp;quot;
&lt;/code&gt;&lt;/pre&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Wildcard (&lt;code&gt;*&lt;/code&gt;)&lt;/strong&gt;: Accepts any version, which can lead to unpredictable changes in your project.&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-toml&quot;&gt;serde = &amp;quot;*&amp;quot;
&lt;/code&gt;&lt;/pre&gt;
&lt;/li&gt;
&lt;/ul&gt;
&lt;h3&gt;Using Features with Dependencies&lt;/h3&gt;
&lt;p&gt;Some crates offer optional features that you can enable in cargo.toml. For instance, the reqwest crate has features for JSON support. You can enable these by specifying them within the dependency configuration.&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-toml&quot;&gt;[dependencies]
reqwest = { version = &amp;quot;0.11&amp;quot;, features = [&amp;quot;json&amp;quot;] }
&lt;/code&gt;&lt;/pre&gt;
&lt;h3&gt;Adding Git Dependencies&lt;/h3&gt;
&lt;p&gt;Cargo supports dependencies directly from Git repositories, allowing you to include unreleased versions or custom forks. You can also specify a branch, tag, or commit to ensure consistency.&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-toml&quot;&gt;[dependencies]
my_crate = { git = &amp;quot;https://github.com/user/my_crate.git&amp;quot;, branch = &amp;quot;main&amp;quot; }
&lt;/code&gt;&lt;/pre&gt;
&lt;h3&gt;Path Dependencies for Local Crates&lt;/h3&gt;
&lt;p&gt;If you have a local crate you want to use as a dependency, specify its path. This is useful for working on related crates without publishing them.&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-toml&quot;&gt;[dependencies]
my_local_crate = { path = &amp;quot;../my_local_crate&amp;quot; }
&lt;/code&gt;&lt;/pre&gt;
&lt;h3&gt;Dev-Only Dependencies&lt;/h3&gt;
&lt;p&gt;Dependencies in the &lt;code&gt;[dev-dependencies]&lt;/code&gt; section are only used for development (e.g., testing frameworks) and will not be included in the final build. This helps keep production builds smaller and faster.&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-toml&quot;&gt;[dev-dependencies]
rand = &amp;quot;0.8&amp;quot;
&lt;/code&gt;&lt;/pre&gt;
&lt;h3&gt;Optional Dependencies&lt;/h3&gt;
&lt;p&gt;Optional dependencies can be enabled as needed by configuring them in &lt;code&gt;[features]&lt;/code&gt; and adding them to cargo.toml. This allows you to activate these dependencies on demand, reducing bloat.&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-toml&quot;&gt;[dependencies]
serde_json = { version = &amp;quot;1.0&amp;quot;, optional = true }

[features]
default = []
json_support = [&amp;quot;serde_json&amp;quot;]
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Now, you can enable &lt;code&gt;json_support&lt;/code&gt; by using &lt;code&gt;cargo build --features &amp;quot;json_support&amp;quot;&lt;/code&gt;, adding the functionality only when needed.&lt;/p&gt;
&lt;p&gt;Example of a Complete &lt;code&gt;[dependencies]&lt;/code&gt; Section
Here’s a &lt;code&gt;[dependencies]&lt;/code&gt; section showcasing different types of dependencies:&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-toml&quot;&gt;[dependencies]
serde = &amp;quot;1.0&amp;quot;  # Standard dependency
rand = { version = &amp;quot;0.8&amp;quot;, features = [&amp;quot;small_rng&amp;quot;] }  # Dependency with features
my_crate = { git = &amp;quot;https://github.com/user/my_crate.git&amp;quot;, branch = &amp;quot;main&amp;quot; }  # Git dependency
serde_json = { version = &amp;quot;1.0&amp;quot;, optional = true }  # Optional dependency

[dev-dependencies]
mockito = &amp;quot;0.29&amp;quot;  # Dev-only dependency

[features]
default = []
json_support = [&amp;quot;serde_json&amp;quot;]  # Feature for optional dependency
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;This setup provides flexibility for managing dependencies based on your project’s requirements. By organizing dependencies in this way, you gain control over your project’s footprint, allowing for efficient, maintainable, and optimized builds.&lt;/p&gt;
&lt;h2&gt;Using Features for Conditional Compilation&lt;/h2&gt;
&lt;p&gt;Features in &lt;code&gt;cargo.toml&lt;/code&gt; allow you to enable or disable certain functionalities within your project based on conditional dependencies. This is particularly useful when you want to offer optional components or modularize your code for different use cases. By using feature flags, you can control which parts of your codebase get compiled, helping to keep the build lightweight and efficient.&lt;/p&gt;
&lt;h3&gt;Defining Features in &lt;code&gt;cargo.toml&lt;/code&gt;&lt;/h3&gt;
&lt;p&gt;To define features, add them under the &lt;code&gt;[features]&lt;/code&gt; section in &lt;code&gt;cargo.toml&lt;/code&gt;. Each feature is a list of dependencies or other features that should be enabled when the feature itself is activated.&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-toml&quot;&gt;[features]
default = [&amp;quot;json_support&amp;quot;]  # Sets `json_support` as the default feature
json_support = [&amp;quot;serde&amp;quot;, &amp;quot;serde_json&amp;quot;]  # Enables Serde and Serde JSON support
async = [&amp;quot;tokio&amp;quot;]  # Adds async functionality with Tokio
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;In this example:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;The default feature includes &lt;code&gt;json_support&lt;/code&gt;.&lt;/li&gt;
&lt;li&gt;The &lt;code&gt;json_support&lt;/code&gt; feature enables both &lt;code&gt;serde&lt;/code&gt; and &lt;code&gt;serde_json&lt;/code&gt; libraries.&lt;/li&gt;
&lt;li&gt;The &lt;code&gt;async&lt;/code&gt; feature brings in tokio for asynchronous programming.&lt;/li&gt;
&lt;/ul&gt;
&lt;h3&gt;Enabling Features at Build Time&lt;/h3&gt;
&lt;p&gt;To compile with a specific feature, use the &lt;code&gt;--features&lt;/code&gt; flag when running Cargo commands, like &lt;code&gt;cargo build&lt;/code&gt;. For example, to enable the &lt;code&gt;async&lt;/code&gt; feature, run:&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-bash&quot;&gt;cargo build --features &amp;quot;async&amp;quot;
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;If your default feature is defined, it will be activated by default unless you specify &lt;code&gt;--no-default-features&lt;/code&gt;:&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-bash&quot;&gt;cargo build --no-default-features --features &amp;quot;async&amp;quot;
&lt;/code&gt;&lt;/pre&gt;
&lt;h3&gt;Using Feature Flags in Code&lt;/h3&gt;
&lt;p&gt;In your Rust code, you can use the cfg attribute to conditionally include code based on active features. This keeps the codebase modular and allows you to add/remove functionality based on build requirements.&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-rust&quot;&gt;#[cfg(feature = &amp;quot;async&amp;quot;)]
async fn async_function() {
    // Async function logic
}

#[cfg(not(feature = &amp;quot;async&amp;quot;))]
fn async_function() {
    // Non-async fallback logic
}
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;In this example, the async_function function behaves differently depending on whether the async feature is enabled.&lt;/p&gt;
&lt;h3&gt;Combining Multiple Features&lt;/h3&gt;
&lt;p&gt;Sometimes, you might want a feature that only enables certain functionality if multiple other features are active. You can achieve this by combining features in the &lt;code&gt;[features]&lt;/code&gt; section.&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-toml&quot;&gt;[features]
default = []
full = [&amp;quot;json_support&amp;quot;, &amp;quot;async&amp;quot;]  # Combines `json_support` and `async`
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;With this configuration, enabling the &lt;code&gt;full feature&lt;/code&gt; will activate both &lt;code&gt;json_support&lt;/code&gt; and &lt;code&gt;async&lt;/code&gt; simultaneously.&lt;/p&gt;
&lt;h3&gt;Practical Example of Feature Flags&lt;/h3&gt;
&lt;p&gt;Suppose you’re building a library that has JSON support and async capabilities as optional features. Here’s how your cargo.toml might look:&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-toml&quot;&gt;[dependencies]
serde = { version = &amp;quot;1.0&amp;quot;, optional = true }
serde_json = { version = &amp;quot;1.0&amp;quot;, optional = true }
tokio = { version = &amp;quot;1.0&amp;quot;, optional = true }

[features]
default = []
json_support = [&amp;quot;serde&amp;quot;, &amp;quot;serde_json&amp;quot;]
async = [&amp;quot;tokio&amp;quot;]
full = [&amp;quot;json_support&amp;quot;, &amp;quot;async&amp;quot;]
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;In this setup:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;The &lt;code&gt;json_support&lt;/code&gt; feature enables &lt;code&gt;serde&lt;/code&gt; and &lt;code&gt;serde_json&lt;/code&gt; for JSON handling.&lt;/li&gt;
&lt;li&gt;The &lt;code&gt;async&lt;/code&gt; feature enables &lt;code&gt;tokio&lt;/code&gt; for asynchronous programming.&lt;/li&gt;
&lt;li&gt;The &lt;code&gt;full feature&lt;/code&gt; enables both &lt;code&gt;json_support&lt;/code&gt; and &lt;code&gt;async&lt;/code&gt;.&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;To use only JSON support, run:&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-bash&quot;&gt;cargo build --features &amp;quot;json_support&amp;quot;
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Or to use everything with the full feature:&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-bash&quot;&gt;cargo build --features &amp;quot;full&amp;quot;
&lt;/code&gt;&lt;/pre&gt;
&lt;h3&gt;Benefits of Using Features&lt;/h3&gt;
&lt;p&gt;Using feature flags in cargo.toml can make your project more flexible and modular:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Reduce Bloat&lt;/strong&gt;: Only compile what’s necessary for each use case.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Improved Compile Times&lt;/strong&gt;: Faster compilation when unused features are disabled.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Targeted Functionality&lt;/strong&gt;: Offer a single codebase with multiple configurations, making your library or application more adaptable.&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;With feature flags, cargo.toml enables conditional compilation that fits various project requirements and user preferences, optimizing both development and runtime performance.&lt;/p&gt;
&lt;h2&gt;Configuring Build Profiles&lt;/h2&gt;
&lt;p&gt;Cargo provides different build profiles to optimize your project based on specific needs, such as development or production. These profiles let you adjust settings like optimization levels, debug symbols, and other compiler flags. The main profiles in &lt;code&gt;cargo.toml&lt;/code&gt; are &lt;code&gt;dev&lt;/code&gt;, &lt;code&gt;release&lt;/code&gt;, and custom profiles you can define as needed.&lt;/p&gt;
&lt;h3&gt;Common Build Profiles&lt;/h3&gt;
&lt;ul&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;&lt;code&gt;dev&lt;/code&gt;&lt;/strong&gt;: This is the default profile for development builds, which prioritizes compile speed over runtime performance. It includes debug information but does not heavily optimize the code.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;&lt;code&gt;release&lt;/code&gt;&lt;/strong&gt;: The release profile is optimized for performance and typically used for production builds. It enables higher levels of optimization but takes longer to compile.&lt;/p&gt;
&lt;/li&gt;
&lt;/ul&gt;
&lt;h3&gt;Configuring Profiles in &lt;code&gt;cargo.toml&lt;/code&gt;&lt;/h3&gt;
&lt;p&gt;You can customize each profile by defining them in the &lt;code&gt;[profile.*]&lt;/code&gt; sections of &lt;code&gt;cargo.toml&lt;/code&gt;. Each profile has various settings that control the build process:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;&lt;code&gt;opt-level&lt;/code&gt;&lt;/strong&gt;: Controls the optimization level, with values from 0 (no optimization) to 3 (maximum optimization).&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;&lt;code&gt;debug&lt;/code&gt;&lt;/strong&gt;: Controls the inclusion of debug symbols, helpful for debugging.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;&lt;code&gt;lto&lt;/code&gt;&lt;/strong&gt;: Enables Link-Time Optimization, which can reduce binary size.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;&lt;code&gt;panic&lt;/code&gt;&lt;/strong&gt;: Determines how panics are handled (&lt;code&gt;unwind&lt;/code&gt; or &lt;code&gt;abort&lt;/code&gt;).&lt;/li&gt;
&lt;/ul&gt;
&lt;h3&gt;Customizing the &lt;code&gt;dev&lt;/code&gt; Profile&lt;/h3&gt;
&lt;p&gt;The &lt;code&gt;dev&lt;/code&gt; profile is ideal for development, focusing on quick compile times and ease of debugging. You might want to add minimal optimization for better performance while testing.&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-toml&quot;&gt;[profile.dev]
opt-level = 0  # No optimization for fast compile times
debug = true   # Include debug symbols
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;In this example, no optimization is applied to keep build times short, and debug symbols are included to aid debugging.&lt;/p&gt;
&lt;h3&gt;Customizing the release Profile&lt;/h3&gt;
&lt;p&gt;The release profile is typically used for production builds, prioritizing runtime performance through higher optimization levels. This can make your application faster and reduce binary size, but it comes with longer compile times.&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-toml&quot;&gt;[profile.release]
opt-level = 3    # Maximum optimization for performance
debug = false    # Exclude debug symbols for smaller binary size
lto = true       # Link-Time Optimization for further size reduction
panic = &amp;quot;abort&amp;quot;  # Use `abort` to reduce binary size further
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;In this setup:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;The opt-level of 3 maximizes performance.&lt;/li&gt;
&lt;li&gt;debug is set to false to exclude debug symbols, keeping the binary smaller.&lt;/li&gt;
&lt;li&gt;lto enables Link-Time Optimization to further reduce the binary size.&lt;/li&gt;
&lt;li&gt;panic = &amp;quot;abort&amp;quot; changes the panic strategy to abort, which can further reduce binary size.&lt;/li&gt;
&lt;/ul&gt;
&lt;h3&gt;Defining Custom Profiles&lt;/h3&gt;
&lt;p&gt;You can create custom profiles if you need specific settings for different environments, such as testing or benchmarking. For instance, a bench profile could be created to optimize for performance testing.&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-toml&quot;&gt;[profile.bench]
opt-level = 3
debug = false
overflow-checks = false  # Disable overflow checks for benchmarking
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;This bench profile maximizes performance by disabling overflow checks and excluding debug symbols, making it suitable for benchmarking.&lt;/p&gt;
&lt;p&gt;Example of a Complete Profile Configuration
Here’s an example configuration that customizes both dev and release profiles while adding a custom bench profile:&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-toml&quot;&gt;[profile.dev]
opt-level = 1       # Low-level optimization for faster dev builds
debug = true        # Include debug symbols
overflow-checks = true

[profile.release]
opt-level = 3       # Max optimization for production
debug = false       # Exclude debug symbols
lto = &amp;quot;fat&amp;quot;         # Enable Link-Time Optimization
panic = &amp;quot;abort&amp;quot;     # Use abort for panics

[profile.bench]
opt-level = 3       # High optimization for benchmarks
debug = false       # Exclude debug symbols for smaller binary
overflow-checks = false  # Disable overflow checks to reduce overhead
&lt;/code&gt;&lt;/pre&gt;
&lt;h3&gt;Choosing the Right Profile&lt;/h3&gt;
&lt;p&gt;When building, Cargo automatically selects the dev profile for cargo build and the release profile for &lt;code&gt;cargo build --release&lt;/code&gt;. You can also specify custom profiles when running cargo commands by using the &lt;code&gt;--profile&lt;/code&gt; flag:&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-bash&quot;&gt;cargo build --profile bench
&lt;/code&gt;&lt;/pre&gt;
&lt;h3&gt;Benefits of Profile Customization&lt;/h3&gt;
&lt;p&gt;Customizing profiles in cargo.toml helps you optimize your project based on your current needs:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Development Efficiency&lt;/strong&gt;: Faster builds with the dev profile keep your development loop quick.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Production Performance&lt;/strong&gt;: release profile optimizations ensure your app runs efficiently in production.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Targeted Tuning&lt;/strong&gt;: Custom profiles allow you to fine-tune settings for testing, benchmarking, or any other specialized needs.&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;Configuring build profiles is a powerful way to control the balance between performance, debugging, and compile time, giving you a flexible workflow from development to production&lt;/p&gt;
&lt;h2&gt;Workspace and Sub-Crate Configurations&lt;/h2&gt;
&lt;p&gt;In Rust, a workspace allows you to manage multiple related packages (or &amp;quot;crates&amp;quot;) within a single project directory, sharing common dependencies and build output. Workspaces are helpful when you want to organize large projects into smaller, modular crates that can be built, tested, and developed together. This setup is especially valuable for monorepo-style projects, where all related crates live in a single repository.&lt;/p&gt;
&lt;h3&gt;Setting Up a Workspace&lt;/h3&gt;
&lt;p&gt;To create a workspace, start by defining a &lt;code&gt;[workspace]&lt;/code&gt; section in the root &lt;code&gt;cargo.toml&lt;/code&gt; file. In this section, you’ll specify which directories contain the member crates of the workspace.&lt;/p&gt;
&lt;p&gt;For example, in the root &lt;code&gt;cargo.toml&lt;/code&gt;:&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-toml&quot;&gt;[workspace]
members = [&amp;quot;crate_a&amp;quot;, &amp;quot;crate_b&amp;quot;]
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;This setup indicates that there are two crates within the workspace: &lt;code&gt;crate_a&lt;/code&gt; and &lt;code&gt;crate_b&lt;/code&gt;, located in directories named &lt;code&gt;crate_a&lt;/code&gt; and &lt;code&gt;crate_b&lt;/code&gt; within the project root.&lt;/p&gt;
&lt;h3&gt;Creating Sub-Crates&lt;/h3&gt;
&lt;p&gt;Each member of the workspace (sub-crate) needs its own &lt;code&gt;cargo.toml&lt;/code&gt; file, where you define the specific dependencies and settings for that crate. Each crate in a workspace functions as an independent Rust package but shares common build output and dependencies with the other workspace members.&lt;/p&gt;
&lt;p&gt;For example, the &lt;code&gt;cargo.toml&lt;/code&gt; for &lt;code&gt;crate_a&lt;/code&gt; might look like this:&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-toml&quot;&gt;[package]
name = &amp;quot;crate_a&amp;quot;
version = &amp;quot;0.1.0&amp;quot;
edition = &amp;quot;2021&amp;quot;

[dependencies]
serde = &amp;quot;1.0&amp;quot;
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;And &lt;code&gt;crate_b&lt;/code&gt;’s &lt;code&gt;cargo.toml&lt;/code&gt; could be:&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-toml&quot;&gt;[package]
name = &amp;quot;crate_b&amp;quot;
version = &amp;quot;0.1.0&amp;quot;
edition = &amp;quot;2021&amp;quot;

[dependencies]
rand = &amp;quot;0.8&amp;quot;
&lt;/code&gt;&lt;/pre&gt;
&lt;h3&gt;Sharing Dependencies Across Crates&lt;/h3&gt;
&lt;p&gt;One of the advantages of a workspace is that it allows crates to share dependencies, reducing duplication and ensuring version consistency. You can specify dependencies in the root cargo.toml so that all workspace members have access to them without redefining the dependencies in each sub-crate.&lt;/p&gt;
&lt;p&gt;For example, you can add a shared dependency like this:&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-toml&quot;&gt;[workspace.dependencies]
serde = &amp;quot;1.0&amp;quot;
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Now, all workspace members can use serde without adding it to their individual cargo.toml files.&lt;/p&gt;
&lt;h3&gt;Inter-Crate Dependencies&lt;/h3&gt;
&lt;p&gt;In many cases, one crate in a workspace will depend on another crate in the same workspace. To specify such a dependency, reference the other crate by name in the cargo.toml file, and Cargo will understand that it refers to a member of the workspace.&lt;/p&gt;
&lt;p&gt;For example, if &lt;code&gt;crate_b&lt;/code&gt; depends on &lt;code&gt;crate_a&lt;/code&gt;, you would add this to &lt;code&gt;crate_b&lt;/code&gt;&apos;s &lt;code&gt;cargo.toml&lt;/code&gt;:&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-toml&quot;&gt;[dependencies]
crate_a = { path = &amp;quot;../crate_a&amp;quot; }
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Cargo will recognize &lt;code&gt;crate_a&lt;/code&gt; as part of the workspace and handle the dependency locally.&lt;/p&gt;
&lt;h3&gt;Managing Workspace Configuration&lt;/h3&gt;
&lt;p&gt;You can also set configurations specific to the workspace, such as build profiles or custom features, within the &lt;code&gt;[workspace]&lt;/code&gt; section of the root cargo.toml. This allows you to configure build settings and features that apply across all workspace members.&lt;/p&gt;
&lt;p&gt;Example:&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-toml&quot;&gt;[workspace]
members = [&amp;quot;crate_a&amp;quot;, &amp;quot;crate_b&amp;quot;]

[profile.release]
opt-level = 3
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;In this example, all crates in the workspace will use an optimization level of 3 for release builds, reducing binary size and improving runtime performance.&lt;/p&gt;
&lt;h3&gt;Example Project Structure&lt;/h3&gt;
&lt;p&gt;Here’s how a workspace project might look in your file system:&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-bash&quot;&gt;my_workspace/
├── Cargo.toml           # Root workspace configuration
├── crate_a/
│   └── Cargo.toml       # crate_a configuration
├── crate_b/
│   └── Cargo.toml       # crate_b configuration
└── target/              # Shared build output directory
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;With this structure, all build output will be stored in a single target/ directory, reducing redundancy and speeding up compilation when multiple crates share dependencies.&lt;/p&gt;
&lt;h3&gt;Benefits of Using Workspaces&lt;/h3&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Dependency Management&lt;/strong&gt;: Avoid duplicating dependencies by sharing them across crates.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Build Efficiency&lt;/strong&gt;: Workspace members share a single target/ directory, reducing compilation time and storage.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Modularity&lt;/strong&gt;: Break down complex projects into modular crates that can be developed and tested independently.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Version Control&lt;/strong&gt;: Simplifies managing versioning within related packages, especially useful for large projects.&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;By setting up a workspace, you can streamline your project structure, reduce duplication, and make your Rust project more modular and scalable, all while keeping related packages tightly integrated.&lt;/p&gt;
&lt;h2&gt;Advanced Configuration Options&lt;/h2&gt;
&lt;p&gt;The &lt;code&gt;cargo.toml&lt;/code&gt; file provides several advanced options that allow you to further customize and fine-tune your Rust project. These configurations are useful for handling edge cases, managing dependencies in complex projects, and adding metadata to your package. Let’s explore some of these advanced options.&lt;/p&gt;
&lt;h3&gt;&lt;code&gt;[patch]&lt;/code&gt;: Overriding Dependencies&lt;/h3&gt;
&lt;p&gt;The &lt;code&gt;[patch]&lt;/code&gt; section allows you to override dependencies across your project. This is helpful if you need to fix a bug in an external crate or use a custom version of a dependency without waiting for an official release. By specifying &lt;code&gt;[patch]&lt;/code&gt;, you can tell Cargo to use a different source for a specific dependency across the entire workspace.&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-toml&quot;&gt;[patch.crates-io]
serde = { git = &amp;quot;https://github.com/your-fork/serde.git&amp;quot;, branch = &amp;quot;fix-branch&amp;quot; }
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;In this example, all references to serde in the project will use the specified Git repository instead of crates.io.&lt;/p&gt;
&lt;h3&gt;&lt;code&gt;[replace]&lt;/code&gt;: Replacing Dependencies&lt;/h3&gt;
&lt;p&gt;Similar to &lt;code&gt;[patch]&lt;/code&gt;, the &lt;code&gt;[replace]&lt;/code&gt; section lets you swap out a specific version of a dependency. However, it’s more restrictive and generally used in very specific cases, like managing local dependencies. &lt;code&gt;[replace]&lt;/code&gt; should be used cautiously because it can lead to version conflicts.&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-toml&quot;&gt;[replace]
&amp;quot;rand:0.8.3&amp;quot; = { path = &amp;quot;local_path_to_rand&amp;quot; }
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Here, the rand version 0.8.3 dependency is replaced by a local path, allowing you to work with a local copy.&lt;/p&gt;
&lt;h3&gt;&lt;code&gt;[build-dependencies]&lt;/code&gt;: Dependencies for Build Scripts&lt;/h3&gt;
&lt;p&gt;Sometimes, a Rust project needs a custom build script (e.g., build.rs) to generate or process files before compilation. The &lt;code&gt;[build-dependencies]&lt;/code&gt; section is used to specify dependencies required only by the build script, avoiding unnecessary dependencies in the final build.&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-toml&quot;&gt;[build-dependencies]
cc = &amp;quot;1.0&amp;quot;  # Compiler tool for building C dependencies
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;In this example, the &lt;code&gt;cc&lt;/code&gt; crate is available only to the &lt;code&gt;build.rs&lt;/code&gt; script, allowing you to compile native code or other build-specific tasks.&lt;/p&gt;
&lt;h3&gt;&lt;code&gt;[badges]&lt;/code&gt;: Adding Metadata for Continuous Integration (CI)&lt;/h3&gt;
&lt;p&gt;Badges provide a way to display status information, such as build status, on your project’s page on crates.io or GitHub. The &lt;code&gt;[badges]&lt;/code&gt; section allows you to define these directly in &lt;code&gt;cargo.toml&lt;/code&gt;.&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-toml&quot;&gt;[badges]
travis-ci = { repository = &amp;quot;user/my_project&amp;quot; }
github-actions = { repository = &amp;quot;user/my_project&amp;quot;, branch = &amp;quot;main&amp;quot;, workflow = &amp;quot;CI&amp;quot; }
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Here, badges for Travis CI and GitHub Actions are configured, displaying their status on platforms that support badges.&lt;/p&gt;
&lt;h3&gt;&lt;code&gt;[package.metadata]&lt;/code&gt;: Custom Metadata&lt;/h3&gt;
&lt;p&gt;The [package.metadata] section allows you to add custom fields that are not processed by Cargo itself but can be used by external tools. This is useful for plugins or scripts that require specific information beyond the default Cargo configuration.&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-toml&quot;&gt;[package.metadata]
documentation_url = &amp;quot;https://docs.rs/my_project&amp;quot;
custom_key = &amp;quot;custom_value&amp;quot;
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;External tools can read these values to provide custom functionality for your project.&lt;/p&gt;
&lt;h3&gt;Defining &lt;code&gt;build.rs&lt;/code&gt; Scripts&lt;/h3&gt;
&lt;p&gt;If your project requires dynamic configuration, you can create a &lt;code&gt;build.rs&lt;/code&gt; file, which Cargo automatically runs before compiling your project. The &lt;code&gt;build.rs&lt;/code&gt; file can generate code, compile additional resources, or link native libraries. In cargo.toml, dependencies for this script should be listed under &lt;code&gt;[build-dependencies]&lt;/code&gt;.&lt;/p&gt;
&lt;p&gt;Example &lt;code&gt;build.rs&lt;/code&gt;:&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-rust&quot;&gt;fn main() {
    println!(&amp;quot;cargo:rustc-link-lib=static=foo&amp;quot;);
}
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;This example tells Cargo to link a static library named foo to your project. You can control these instructions via environment variables, allowing your build process to adapt to different platforms.&lt;/p&gt;
&lt;h3&gt;Using &lt;code&gt;[workspace.dependencies]&lt;/code&gt; for Shared Dependencies&lt;/h3&gt;
&lt;p&gt;In a workspace, you may want all crates to use the same version of a shared dependency. You can specify such dependencies in the &lt;code&gt;[workspace.dependencies]&lt;/code&gt; section, making them available to all workspace members.&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-toml&quot;&gt;[workspace.dependencies]
serde = &amp;quot;1.0&amp;quot;
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;This setting simplifies dependency management across a workspace and ensures that each crate is using the same version of serde, helping to avoid conflicts and maintain consistency.&lt;/p&gt;
&lt;h3&gt;Example of Advanced cargo.toml Configuration&lt;/h3&gt;
&lt;p&gt;Here’s an example that brings together some of these advanced options:&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-toml&quot;&gt;[package]
name = &amp;quot;my_project&amp;quot;
version = &amp;quot;0.1.0&amp;quot;
edition = &amp;quot;2021&amp;quot;

[dependencies]
serde = &amp;quot;1.0&amp;quot;

[build-dependencies]
cc = &amp;quot;1.0&amp;quot;

[patch.crates-io]
serde = { git = &amp;quot;https://github.com/your-fork/serde.git&amp;quot;, branch = &amp;quot;fix-branch&amp;quot; }

[badges]
github-actions = { repository = &amp;quot;user/my_project&amp;quot;, branch = &amp;quot;main&amp;quot;, workflow = &amp;quot;CI&amp;quot; }

[package.metadata]
custom_field = &amp;quot;This is a custom metadata field&amp;quot;
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Benefits of Using Advanced Configurations
These advanced configuration options provide you with a wide range of tools to tailor cargo.toml to your project’s specific requirements:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Dependency Control&lt;/strong&gt;: Patch or replace dependencies to use the exact version or source you need.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Build Flexibility&lt;/strong&gt;: Add custom scripts or compile native dependencies with [build-dependencies].&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Enhanced Documentation&lt;/strong&gt;: Use badges to make the project status visible on supported platforms.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Custom Metadata&lt;/strong&gt;: Store additional project-specific information for tools or scripts.&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;With these configurations, cargo.toml becomes a powerful and flexible tool for managing Rust projects, accommodating both simple setups and complex requirements.&lt;/p&gt;
&lt;h2&gt;Troubleshooting and Best Practices&lt;/h2&gt;
&lt;p&gt;Working with &lt;code&gt;cargo.toml&lt;/code&gt; can be straightforward, but as your project grows, you might encounter common issues or challenges. Here are some troubleshooting tips and best practices to help you manage your &lt;code&gt;cargo.toml&lt;/code&gt; effectively.&lt;/p&gt;
&lt;h3&gt;Common Errors and Solutions&lt;/h3&gt;
&lt;ol&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Dependency Version Conflicts&lt;/strong&gt;&lt;/p&gt;
&lt;p&gt;When multiple crates depend on different versions of the same dependency, Cargo may not be able to resolve the conflict, leading to a build failure.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Solution&lt;/strong&gt;: Consider using &lt;code&gt;[patch]&lt;/code&gt; to enforce a specific version across your project, or review and align the dependency versions if possible.&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-toml&quot;&gt;[patch.crates-io]
serde = &amp;quot;1.0.104&amp;quot;
&lt;/code&gt;&lt;/pre&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Missing or Unsupported Features&lt;/strong&gt;&lt;/p&gt;
&lt;/li&gt;
&lt;/ol&gt;
&lt;p&gt;If you attempt to enable a feature that doesn’t exist or isn’t compatible with a dependency, Cargo will return an error.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Solution&lt;/strong&gt;: Double-check the available features for each dependency in the documentation. Ensure that you’re spelling the feature name correctly and that it’s supported in the specified version.&lt;/p&gt;
&lt;ol start=&quot;3&quot;&gt;
&lt;li&gt;&lt;strong&gt;Invalid cargo.toml Syntax&lt;/strong&gt;&lt;/li&gt;
&lt;/ol&gt;
&lt;p&gt;Sometimes, simple syntax errors in cargo.toml, like missing brackets or commas, can cause parsing issues.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Solution&lt;/strong&gt;: Carefully check your syntax, especially after making edits. Tools like cargo fmt can help with formatting, but a manual review can also catch issues.&lt;/p&gt;
&lt;ol start=&quot;4&quot;&gt;
&lt;li&gt;&lt;strong&gt;Feature Flag Conflicts&lt;/strong&gt;&lt;/li&gt;
&lt;/ol&gt;
&lt;p&gt;Occasionally, enabling multiple features that depend on conflicting dependencies or configurations can lead to errors.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Solution&lt;/strong&gt;: Use Cargo’s conditional compilation to define feature flags carefully. Make sure dependencies don’t conflict, and test combinations of features if your project has multiple optional features.&lt;/p&gt;
&lt;ol start=&quot;5&quot;&gt;
&lt;li&gt;&lt;strong&gt;Circular Dependencies&lt;/strong&gt;&lt;/li&gt;
&lt;/ol&gt;
&lt;p&gt;Circular dependencies can happen if crates in a workspace depend on each other in a loop.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Solution&lt;/strong&gt;: Reevaluate the dependency structure of your crates. Consider refactoring shared code into a separate crate that both depend on, rather than forming a circular chain.&lt;/p&gt;
&lt;h3&gt;Best Practices for Managing cargo.toml&lt;/h3&gt;
&lt;ol&gt;
&lt;li&gt;&lt;strong&gt;Use Semantic Versioning Thoughtfully&lt;/strong&gt;&lt;/li&gt;
&lt;/ol&gt;
&lt;p&gt;When specifying dependency versions, follow semantic versioning principles. For production code, prefer specifying minor and patch versions (e.g., &lt;code&gt;&amp;quot;^1.2.3&amp;quot;&lt;/code&gt; or &lt;code&gt;&amp;quot;~1.2.3&amp;quot;&lt;/code&gt;) to avoid unexpected updates that could introduce breaking changes.&lt;/p&gt;
&lt;ol start=&quot;2&quot;&gt;
&lt;li&gt;&lt;strong&gt;Leverage Workspaces for Large Projects&lt;/strong&gt;&lt;/li&gt;
&lt;/ol&gt;
&lt;p&gt;If you have a large project with multiple related components, consider organizing it into a workspace. This allows you to manage dependencies centrally, share a build directory, and simplify testing across modules.&lt;/p&gt;
&lt;ol start=&quot;3&quot;&gt;
&lt;li&gt;&lt;strong&gt;Define Meaningful Features&lt;/strong&gt;&lt;/li&gt;
&lt;/ol&gt;
&lt;p&gt;Use features to modularize your project and enable or disable components based on project needs. Avoid adding too many features that create complex interdependencies, as this can complicate both code and dependency management.&lt;/p&gt;
&lt;ol start=&quot;4&quot;&gt;
&lt;li&gt;&lt;strong&gt;Group Dependencies by Purpose&lt;/strong&gt;&lt;/li&gt;
&lt;/ol&gt;
&lt;p&gt;Organize dependencies based on their purpose, such as &lt;code&gt;[dependencies]&lt;/code&gt; for core libraries, &lt;code&gt;[dev-dependencies]&lt;/code&gt; for testing tools, and &lt;code&gt;[build-dependencies]&lt;/code&gt; for build scripts. This structure helps keep your project organized and reduces unnecessary bloat in production builds.&lt;/p&gt;
&lt;ol start=&quot;5&quot;&gt;
&lt;li&gt;&lt;strong&gt;Keep &lt;code&gt;cargo.toml&lt;/code&gt; Clean and Well-Documented&lt;/strong&gt;&lt;/li&gt;
&lt;/ol&gt;
&lt;p&gt;Use comments to explain any non-standard configurations or complex dependency requirements. This makes it easier for other contributors to understand your &lt;code&gt;cargo.toml&lt;/code&gt; file and for you to maintain it over time.&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-toml&quot;&gt;# This dependency is only needed for JSON support
serde_json = { version = &amp;quot;1.0&amp;quot;, optional = true }
&lt;/code&gt;&lt;/pre&gt;
&lt;ol start=&quot;6&quot;&gt;
&lt;li&gt;&lt;strong&gt;Use &lt;code&gt;[workspace.dependencies]&lt;/code&gt; for Consistency&lt;/strong&gt;&lt;/li&gt;
&lt;/ol&gt;
&lt;p&gt;In workspaces, declare shared dependencies in &lt;code&gt;[workspace.dependencies]&lt;/code&gt; to ensure all crates use the same version. This reduces version conflicts and keeps dependency management consistent across crates.&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-toml&quot;&gt;[workspace.dependencies]
serde = &amp;quot;1.0&amp;quot;
&lt;/code&gt;&lt;/pre&gt;
&lt;ol start=&quot;7&quot;&gt;
&lt;li&gt;&lt;strong&gt;Regularly Update Dependencies&lt;/strong&gt;&lt;/li&gt;
&lt;/ol&gt;
&lt;p&gt;Rust’s ecosystem evolves quickly, and keeping dependencies up-to-date ensures you benefit from the latest features, bug fixes, and performance improvements. Use cargo update to update your Cargo.lock file and check for the latest versions.&lt;/p&gt;
&lt;ol start=&quot;8&quot;&gt;
&lt;li&gt;&lt;strong&gt;Automate Testing Across Configurations&lt;/strong&gt;&lt;/li&gt;
&lt;/ol&gt;
&lt;p&gt;If your project uses multiple features, test all feature combinations to ensure compatibility. You can set up &lt;code&gt;CI&lt;/code&gt; (Continuous Integration) workflows to automate this process, making sure your code works across all enabled configurations.&lt;/p&gt;
&lt;h2&gt;Final Thoughts&lt;/h2&gt;
&lt;p&gt;Managing dependencies and configurations with cargo.toml is a powerful way to structure your Rust projects. By following best practices and knowing how to troubleshoot common issues, you can maintain a clean, efficient, and resilient setup. Taking time to organize your cargo.toml file thoughtfully will pay off as your project grows, making it easier to manage and scale in the long run.&lt;/p&gt;
</content:encoded><author>Alex Merced</author></item><item><title>Leveraging Python&apos;s Pattern Matching and Comprehensions for Data Analytics</title><link>https://iceberglakehouse.com/posts/2024-11-Python-Analytics-Pattern-Matching/</link><guid isPermaLink="true">https://iceberglakehouse.com/posts/2024-11-Python-Analytics-Pattern-Matching/</guid><description>
- [Blog: What is a Data Lakehouse and a Table Format?](https://www.dremio.com/blog/apache-iceberg-crash-course-what-is-a-data-lakehouse-and-a-table-f...</description><pubDate>Fri, 01 Nov 2024 09:00:00 GMT</pubDate><content:encoded>&lt;ul&gt;
&lt;li&gt;&lt;a href=&quot;https://www.dremio.com/blog/apache-iceberg-crash-course-what-is-a-data-lakehouse-and-a-table-format/?utm_source=ev_external_blog&amp;amp;utm_medium=influencer&amp;amp;utm_campaign=pymatching&amp;amp;utm_content=alexmerced&amp;amp;utm_term=external_blog&quot;&gt;Blog: What is a Data Lakehouse and a Table Format?&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&quot;https://hello.dremio.com/wp-apache-iceberg-the-definitive-guide-reg.html?utm_source=ev_external_blog&amp;amp;utm_medium=influencer&amp;amp;utm_campaign=pymatching&amp;amp;utm_content=alexmerced&amp;amp;utm_term=external_blog&quot;&gt;Free Copy of Apache Iceberg the Definitive Guide&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&quot;https://hello.dremio.com/webcast-an-apache-iceberg-lakehouse-crash-course-reg.html?utm_source=ev_external_blog&amp;amp;utm_medium=influencer&amp;amp;utm_campaign=pymatching&amp;amp;utm_content=alexmerced&amp;amp;utm_term=external_blog&quot;&gt;Free Apache Iceberg Crash Course&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&quot;https://hello.dremio.com/webcast-an-in-depth-exploration-on-the-world-of-data-lakehouse-catalogs-reg.html?utm_source=ev_external_blog&amp;amp;utm_medium=influencer&amp;amp;utm_campaign=pymatching&amp;amp;utm_content=alexmerced&amp;amp;utm_term=external_blog&quot;&gt;Lakehouse Catalog Course&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&quot;https://www.youtube.com/watch?v=SIriNcVIGJQ&amp;amp;list=PLsLAVBjQJO0p0Yq1fLkoHvt2lEJj5pcYe&quot;&gt;Iceberg Lakehouse Engineering Video Playlist&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;Python stands out as a powerful and versatile tool. Known for its simplicity and readability, Python provides an array of built-in features that make it an ideal language for data manipulation, analysis, and visualization. Among these features, two capabilities—pattern matching and comprehensions—offer significant advantages for transforming and structuring data efficiently.&lt;/p&gt;
&lt;p&gt;Pattern matching, introduced in Python 3.10, allows for more intuitive and readable conditional logic by enabling the matching of complex data structures with minimal code. This feature is particularly useful in data analytics when dealing with diverse data formats, nested structures, or when applying multiple conditional transformations. On the other hand, comprehensions (list, set, and dictionary comprehensions) allow for concise, readable expressions that can filter, transform, and aggregate data on the fly, making repetitive data tasks faster and less error-prone.&lt;/p&gt;
&lt;p&gt;Let&apos;s explore how these two features can help data analysts and engineers write cleaner, faster, and more readable code. We’ll dive into practical examples of how pattern matching and comprehensions can be applied to streamline data processing, showing how they simplify complex tasks and optimize data workflows. By the end, you&apos;ll have a clearer understanding of how these Python features can enhance your data analytics toolkit.&lt;/p&gt;
&lt;h2&gt;Understanding Pattern Matching in Python&lt;/h2&gt;
&lt;p&gt;Pattern matching, introduced with the &lt;code&gt;match&lt;/code&gt; and &lt;code&gt;case&lt;/code&gt; syntax in Python 3.10 (PEP 634), enables cleaner and more readable conditional logic, particularly when handling complex data structures. Unlike traditional &lt;code&gt;if-else&lt;/code&gt; chains, pattern matching lets you define specific patterns that Python will match against, simplifying code that deals with various data formats and nested structures.&lt;/p&gt;
&lt;p&gt;With pattern matching, data analysts can write expressive code to handle different data transformations and formats with minimal boilerplate. For instance, when working with datasets that contain multiple types of values—like dictionaries, nested lists, or JSON objects—pattern matching can help categorize, transform, or validate data based on structure and content.&lt;/p&gt;
&lt;h3&gt;Pattern Matching Use Cases in Data Analytics&lt;/h3&gt;
&lt;p&gt;Here are a few ways pattern matching can benefit data analytics:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Data Transformation&lt;/strong&gt;: In data workflows, datasets often contain mixed or nested data types. Pattern matching can identify specific structures within a dataset and apply transformations based on those structures, simplifying tasks like type conversions or string manipulations.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Handling Nested Data&lt;/strong&gt;: JSON files and nested dictionaries are common in data analytics. Pattern matching enables intuitive unpacking and restructuring of these nested formats, making it easier to extract insights from deeply nested data.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Type Checking and Filtering&lt;/strong&gt;: When cleaning data, it’s essential to handle various data types accurately. Pattern matching can be used to check for certain types (e.g., &lt;code&gt;str&lt;/code&gt;, &lt;code&gt;int&lt;/code&gt;, &lt;code&gt;list&lt;/code&gt;) within a dataset, making it easy to filter out unwanted types or process each type differently for validation and transformation.&lt;/p&gt;
&lt;/li&gt;
&lt;/ul&gt;
&lt;h2&gt;Practical Applications of Pattern Matching&lt;/h2&gt;
&lt;p&gt;Pattern matching is not only a powerful concept but also extremely practical in real-world data analytics workflows. By matching specific data structures and patterns, it allows analysts to write concise code for tasks like cleaning, categorizing, and transforming data. Let’s explore a few common applications where pattern matching can simplify data processing.&lt;/p&gt;
&lt;h3&gt;Example 1: Data Cleaning with Pattern Matching&lt;/h3&gt;
&lt;p&gt;One of the first steps in any data analytics project is data cleaning. This often involves handling missing values, type mismatches, and incorrect formats. Using pattern matching, you can match specific patterns in your dataset to clean or transform the data accordingly.&lt;/p&gt;
&lt;p&gt;For example, let’s say you have a dataset where certain entries may contain &lt;code&gt;None&lt;/code&gt; values, incorrect date formats, or unexpected data types. Pattern matching enables you to handle each case concisely:&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-python&quot;&gt;def clean_entry(entry):
    match entry:
        case None:
            return &amp;quot;Missing&amp;quot;
        case str(date) if date.isdigit():
            return f&amp;quot;2023-{date[:2]}-{date[2:]}&amp;quot;  # Convert YYMMDD to YYYY-MM-DD
        case int(value):
            return float(value)  # Convert integers to floats
        case _:
            return entry  # Keep other cases as-is
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;In this example, pattern matching simplifies handling different data cases in a single function, reducing the need for multiple if-elif checks.&lt;/p&gt;
&lt;h3&gt;Example 2: Categorizing Data&lt;/h3&gt;
&lt;p&gt;Another useful application of pattern matching is in data categorization. Suppose you have a dataset where each record has a set of attributes that can help classify the data into categories, such as product type, risk level, or customer segment. Pattern matching allows you to classify records based on attribute patterns easily.&lt;/p&gt;
&lt;p&gt;For instance, if you want to categorize customer data based on their spending patterns, you could use pattern matching to define these categories:&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-python&quot;&gt;def categorize_customer(spending):
    match spending:
        case {&amp;quot;amount&amp;quot;: amount} if amount &amp;gt; 1000:
            return &amp;quot;High spender&amp;quot;
        case {&amp;quot;amount&amp;quot;: amount} if 500 &amp;lt; amount &amp;lt;= 1000:
            return &amp;quot;Medium spender&amp;quot;
        case {&amp;quot;amount&amp;quot;: amount} if amount &amp;lt;= 500:
            return &amp;quot;Low spender&amp;quot;
        case _:
            return &amp;quot;Unknown category&amp;quot;
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;This approach lets you apply rules-based categorization quickly, making your code more modular and readable.&lt;/p&gt;
&lt;h3&gt;Example 3: Mapping JSON to DataFrames&lt;/h3&gt;
&lt;p&gt;JSON data, often nested and hierarchical, can be challenging to work with directly. Pattern matching makes it easy to traverse and reshape JSON structures, allowing for direct mapping of data into pandas DataFrames. Consider the following example:&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-python&quot;&gt;import pandas as pd

def json_to_dataframe(json_data):
    rows = []
    for entry in json_data:
        match entry:
            case {&amp;quot;id&amp;quot;: id, &amp;quot;attributes&amp;quot;: {&amp;quot;name&amp;quot;: name, &amp;quot;value&amp;quot;: value}}:
                rows.append({&amp;quot;ID&amp;quot;: id, &amp;quot;Name&amp;quot;: name, &amp;quot;Value&amp;quot;: value})
            case {&amp;quot;id&amp;quot;: id, &amp;quot;name&amp;quot;: name}:
                rows.append({&amp;quot;ID&amp;quot;: id, &amp;quot;Name&amp;quot;: name, &amp;quot;Value&amp;quot;: None})
            case _:
                pass  # Ignore entries that don&apos;t match any pattern
    return pd.DataFrame(rows)
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;This function processes JSON entries according to specific patterns and then converts them into a structured DataFrame. Pattern matching ensures only relevant data is extracted, saving time on manual transformations.&lt;/p&gt;
&lt;p&gt;In these examples, pattern matching streamlines data cleaning, categorization, and transformation tasks, making it a valuable tool for any data analyst or engineer. In the next section, we’ll explore comprehensions and how they can further simplify data manipulation tasks.&lt;/p&gt;
&lt;h2&gt;Using List, Set, and Dictionary Comprehensions&lt;/h2&gt;
&lt;p&gt;Comprehensions are one of Python’s most powerful features, allowing for concise, readable expressions that streamline data processing tasks. List, set, and dictionary comprehensions enable analysts to quickly filter, transform, and aggregate data, all within a single line of code. When dealing with large datasets or repetitive transformations, comprehensions can significantly reduce the amount of code you write, making it easier to read and maintain.&lt;/p&gt;
&lt;h3&gt;Use Cases of Comprehensions in Data Analytics&lt;/h3&gt;
&lt;p&gt;Below are some common applications of comprehensions that can greatly enhance your data manipulation workflows.&lt;/p&gt;
&lt;h3&gt;Data Filtering&lt;/h3&gt;
&lt;p&gt;Data filtering is a common task in analytics, especially when removing outliers or isolating records that meet specific criteria. List comprehensions offer a simple way to filter data efficiently. Suppose you have a list of transaction amounts and want to isolate transactions over $500:&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-python&quot;&gt;transactions = [100, 250, 600, 1200, 300]
high_value_transactions = [t for t in transactions if t &amp;gt; 500]
# Output: [600, 1200]
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;This one-liner achieves in a single step what would require several lines of code with a traditional loop. Comprehensions make it easy to quickly filter data without adding much complexity.&lt;/p&gt;
&lt;h3&gt;Data Transformation&lt;/h3&gt;
&lt;p&gt;Transforming data, such as changing formats or applying functions to each element, is another common need. Let’s say you have a list of prices in USD and want to convert them to euros at a rate of 1 USD = 0.85 EUR. List comprehensions allow you to apply the conversion effortlessly:&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-python&quot;&gt;prices_usd = [100, 200, 300]
prices_eur = [price * 0.85 for price in prices_usd]
# Output: [85.0, 170.0, 255.0]
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;This method is not only concise but also efficient, making it ideal for quick transformations across entire datasets.&lt;/p&gt;
&lt;h3&gt;Dictionary Aggregations&lt;/h3&gt;
&lt;p&gt;Comprehensions are also highly effective for aggregating data into dictionaries, which can be helpful for categorizing data or creating quick summaries. For instance, suppose you have a list of tuples containing product names and their sales. You could use a dictionary comprehension to aggregate these into a dictionary format:&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-python&quot;&gt;sales_data = [(&amp;quot;Product A&amp;quot;, 30), (&amp;quot;Product B&amp;quot;, 45), (&amp;quot;Product A&amp;quot;, 25)]
sales_summary = {product: sum(sale for p, sale in sales_data if p == product) for product, _ in sales_data}
# Output: {&apos;Product A&apos;: 55, &apos;Product B&apos;: 45}
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;This comprehension aggregates sales by product, providing a summary of total sales for each product without the need for multiple loops or intermediate data structures.&lt;/p&gt;
&lt;h3&gt;Set Comprehensions for Unique Values&lt;/h3&gt;
&lt;p&gt;If you need to extract unique values from a dataset, set comprehensions provide a quick and clean solution. Imagine you have a dataset with duplicate entries and want a list of unique customer IDs:&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-python&quot;&gt;customer_ids = [101, 102, 103, 101, 104, 102]
unique_ids = {id for id in customer_ids}
# Output: {101, 102, 103, 104}
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;This set comprehension removes duplicates automatically, ensuring that each ID appears only once in the output.&lt;/p&gt;
&lt;h3&gt;Nested Comprehensions for Complex Transformations&lt;/h3&gt;
&lt;p&gt;In some cases, datasets may contain nested structures that require multiple levels of transformation. Nested comprehensions enable you to flatten these structures or apply transformations at each level. For instance, if you have a list of lists representing survey responses and want to normalize the data, you could use nested comprehensions:&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-python&quot;&gt;responses = [[5, 4, 3], [3, 5, 4], [4, 4, 5]]
normalized_responses = [[score / 5 for score in response] for response in responses]
# Output: [[1.0, 0.8, 0.6], [0.6, 1.0, 0.8], [0.8, 0.8, 1.0]]
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;This example applies a transformation to each individual score within the nested lists, enabling a consistent normalization across all responses.&lt;/p&gt;
&lt;p&gt;Comprehensions are powerful tools in any data analyst&apos;s toolkit, providing a quick way to handle repetitive data transformations, filter data, and create summary statistics. In the next section, we’ll explore how to combine pattern matching and comprehensions for even more effective data manipulation workflows.&lt;/p&gt;
&lt;h1&gt;Advanced Examples Combining Pattern Matching and Comprehensions&lt;/h1&gt;
&lt;p&gt;When used together, pattern matching and comprehensions enable even more powerful data manipulation workflows, allowing you to handle complex transformations, analyze nested data structures, and apply conditional logic in a concise, readable way. In this section, we’ll explore some advanced examples that showcase the synergy between these two features.&lt;/p&gt;
&lt;h3&gt;Complex Data Transformations&lt;/h3&gt;
&lt;p&gt;Suppose you have a dataset with different types of records, and you want to perform different transformations based on each record type. By combining pattern matching and comprehensions, you can efficiently categorize and transform each entry in one step.&lt;/p&gt;
&lt;p&gt;For instance, imagine a dataset of mixed records where each entry can be either a number, a list of numbers, or a dictionary with numerical values. Using pattern matching and comprehensions together, you can process this dataset in a single line:&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-python&quot;&gt;data = [5, [2, 3, 4], {&amp;quot;value&amp;quot;: 10}, 8, {&amp;quot;value&amp;quot;: 7}, [6, 9]]
transformed_data = [
    value * 2 if isinstance(value, int) else 
    [x * 2 for x in value] if isinstance(value, list) else 
    value[&amp;quot;value&amp;quot;] * 2 if isinstance(value, dict) 
    else value 
    for value in data
]
# Output: [10, [4, 6, 8], 20, 16, 14, [12, 18]]
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;In this example, each type of entry is handled differently using conditional expressions and comprehensions, allowing you to transform mixed data types cleanly.&lt;/p&gt;
&lt;h3&gt;Nested Data Manipulation&lt;/h3&gt;
&lt;p&gt;When dealing with deeply nested data structures like JSON files, combining pattern matching and nested comprehensions can simplify data extraction and transformation. Imagine a dataset where each entry is a nested dictionary containing information about users, including their hobbies. You want to extract and flatten these hobbies for analysis.&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-python&quot;&gt;users = [
    {&amp;quot;id&amp;quot;: 1, &amp;quot;info&amp;quot;: {&amp;quot;name&amp;quot;: &amp;quot;Alice&amp;quot;, &amp;quot;hobbies&amp;quot;: [&amp;quot;reading&amp;quot;, &amp;quot;hiking&amp;quot;]}},
    {&amp;quot;id&amp;quot;: 2, &amp;quot;info&amp;quot;: {&amp;quot;name&amp;quot;: &amp;quot;Bob&amp;quot;, &amp;quot;hobbies&amp;quot;: [&amp;quot;cycling&amp;quot;]}},
    {&amp;quot;id&amp;quot;: 3, &amp;quot;info&amp;quot;: {&amp;quot;name&amp;quot;: &amp;quot;Charlie&amp;quot;, &amp;quot;hobbies&amp;quot;: [&amp;quot;music&amp;quot;, &amp;quot;swimming&amp;quot;]}}
]
hobbies_list = [hobby for user in users for hobby in user[&amp;quot;info&amp;quot;][&amp;quot;hobbies&amp;quot;]]
# Output: [&apos;reading&apos;, &apos;hiking&apos;, &apos;cycling&apos;, &apos;music&apos;, &apos;swimming&apos;]
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;In this example, we use nested comprehensions to access each user’s hobbies directly, extracting and flattening them into a single list. Combining comprehensions with structured data extraction saves time and simplifies code readability.&lt;/p&gt;
&lt;h3&gt;Applying Conditional Transformations with Minimal Code&lt;/h3&gt;
&lt;p&gt;Sometimes, you may want to apply transformations conditionally, based on data patterns. Let’s say you have a dataset of transactions where each transaction has an amount and a type. Using pattern matching with comprehensions, you can easily apply different transformations based on transaction type.&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-python&quot;&gt;transactions = [
    {&amp;quot;type&amp;quot;: &amp;quot;credit&amp;quot;, &amp;quot;amount&amp;quot;: 100},
    {&amp;quot;type&amp;quot;: &amp;quot;debit&amp;quot;, &amp;quot;amount&amp;quot;: 50},
    {&amp;quot;type&amp;quot;: &amp;quot;credit&amp;quot;, &amp;quot;amount&amp;quot;: 200},
    {&amp;quot;type&amp;quot;: &amp;quot;debit&amp;quot;, &amp;quot;amount&amp;quot;: 75}
]
processed_transactions = [
    transaction[&amp;quot;amount&amp;quot;] * 1.05 if transaction[&amp;quot;type&amp;quot;] == &amp;quot;credit&amp;quot; else 
    transaction[&amp;quot;amount&amp;quot;] * 0.95 
    for transaction in transactions
]
# Output: [105.0, 47.5, 210.0, 71.25]
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;In this example, credits are increased by 5%, while debits are reduced by 5%. By combining pattern matching logic with comprehensions, you can apply these conditional transformations in a single step, creating a clean, readable transformation pipeline.&lt;/p&gt;
&lt;h3&gt;Summary Statistics Based on Pattern Matches&lt;/h3&gt;
&lt;p&gt;In certain scenarios, you may need to compute statistics based on patterns within your data. Suppose you have a log of events, each with a different status, and you want to calculate the count of each status type. Using pattern matching along with dictionary comprehensions, you can efficiently create a summary of each event type.&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-python&quot;&gt;events = [
    {&amp;quot;status&amp;quot;: &amp;quot;success&amp;quot;},
    {&amp;quot;status&amp;quot;: &amp;quot;failure&amp;quot;},
    {&amp;quot;status&amp;quot;: &amp;quot;success&amp;quot;},
    {&amp;quot;status&amp;quot;: &amp;quot;pending&amp;quot;},
    {&amp;quot;status&amp;quot;: &amp;quot;success&amp;quot;},
    {&amp;quot;status&amp;quot;: &amp;quot;failure&amp;quot;}
]

status_counts = {
    status: sum(1 for event in events if event[&amp;quot;status&amp;quot;] == status)
    for status in {event[&amp;quot;status&amp;quot;] for event in events}
}
# Output: {&apos;success&apos;: 3, &apos;failure&apos;: 2, &apos;pending&apos;: 1}
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;In this example, we use a set comprehension to collect unique statuses from the event log. Then, with a dictionary comprehension, we count occurrences of each status type by matching patterns within the dataset. This approach is concise and leverages both comprehensions and pattern-based logic to produce a summary efficiently.&lt;/p&gt;
&lt;h2&gt;Performance Considerations&lt;/h2&gt;
&lt;p&gt;While pattern matching and comprehensions bring efficiency and readability to data processing tasks, it’s essential to consider their performance impact, especially when working with large datasets. Understanding when and how to use these features can help you write optimal code that balances readability with speed.&lt;/p&gt;
&lt;h3&gt;Efficiency of Comprehensions&lt;/h3&gt;
&lt;p&gt;List, set, and dictionary comprehensions are generally faster than traditional loops, as they are optimized at the Python interpreter level. However, when working with very large datasets, you may encounter memory limitations since comprehensions create an entire data structure in memory. In such cases, generator expressions (using parentheses instead of square brackets) can be a memory-efficient alternative, especially when iterating over large data without needing to store all elements at once.&lt;/p&gt;
&lt;p&gt;Example with generator expression:&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-python&quot;&gt;large_dataset = range(1_000_000)
# Only processes items one by one, conserving memory
squared_data = (x**2 for x in large_dataset if x % 2 == 0)
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Using a generator here allows you to process each element on-the-fly without creating a large list in memory, making it ideal for massive datasets.&lt;/p&gt;
&lt;h3&gt;Pattern Matching in Large Datasets&lt;/h3&gt;
&lt;p&gt;Pattern matching is efficient for conditional branching and handling different data structures, but with complex nested data or highly conditional patterns, performance can be impacted. In these cases, try to:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Simplify Patterns&lt;/strong&gt;: Use minimal and specific patterns for matches rather than broad cases, as fewer branches improve matching speed.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Avoid Deep Nesting&lt;/strong&gt;: Deeply nested patterns can increase matching complexity. When dealing with deeply structured data, consider preprocessing it into a flatter structure if possible.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Batch Processing&lt;/strong&gt;: If you need to match patterns across a large dataset, consider processing data in batches. This approach can prevent excessive memory usage and improve cache efficiency.&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;Pattern matching is a valuable tool when handling diverse data structures or multiple conditional cases. However, for simpler conditional logic, traditional &lt;code&gt;if-elif&lt;/code&gt; statements may offer better performance. By keeping patterns straightforward and using batch processing when necessary, you can leverage pattern matching effectively even in large datasets.&lt;/p&gt;
&lt;h3&gt;Choosing Between Pattern Matching and Traditional Methods&lt;/h3&gt;
&lt;p&gt;Pattern matching is powerful, but it’s not always the most efficient choice. In scenarios where simple conditionals (&lt;code&gt;if-elif&lt;/code&gt; statements) suffice, traditional methods may be faster due to less overhead. Use pattern matching when you need to handle multiple cases or work with nested structures, but keep simpler constructs for straightforward conditions to maintain speed.&lt;/p&gt;
&lt;h3&gt;Combining Features for Optimal Performance&lt;/h3&gt;
&lt;p&gt;When combining comprehensions and pattern matching, remember:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Limit Data Structure Size&lt;/strong&gt;: Avoid creating large intermediate data structures with comprehensions if they’re not necessary.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Leverage Generators for Streaming Data&lt;/strong&gt;: When processing large datasets with pattern matching, use generators within comprehensions or directly in your pattern-matching logic for memory-efficient processing.&lt;/li&gt;
&lt;/ul&gt;
&lt;h3&gt;Summary&lt;/h3&gt;
&lt;p&gt;Pattern matching and comprehensions are powerful features for writing clear and efficient code, but they require mindful usage in performance-critical applications. By understanding how to use these features effectively, data analysts and engineers can maximize their utility while keeping code performance optimal.&lt;/p&gt;
&lt;h2&gt;Conclusion&lt;/h2&gt;
&lt;p&gt;Python’s pattern matching and comprehension features provide an efficient way to handle complex data transformations, conditional logic, and data filtering. By leveraging these tools, data analysts and engineers can write cleaner, more concise code that is not only easier to read but also faster to execute in many cases. Pattern matching simplifies handling diverse data structures and nested formats, making it ideal for working with JSON files, dictionaries, and mixed-type records. Meanwhile, comprehensions streamline filtering, transformation, and aggregation tasks, all within single-line expressions.&lt;/p&gt;
&lt;p&gt;When used together, these features enable powerful data manipulation workflows, allowing you to handle large datasets with complex structures or conditional needs effectively. However, as with any tool, it’s essential to consider performance and memory implications, especially when working with very large datasets. By incorporating strategies like generator expressions and batch processing, you can make your pattern matching and comp&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;a href=&quot;https://www.dremio.com/blog/apache-iceberg-crash-course-what-is-a-data-lakehouse-and-a-table-format/?utm_source=ev_external_blog&amp;amp;utm_medium=influencer&amp;amp;utm_campaign=pymatching&amp;amp;utm_content=alexmerced&amp;amp;utm_term=external_blog&quot;&gt;Blog: What is a Data Lakehouse and a Table Format?&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&quot;https://hello.dremio.com/wp-apache-iceberg-the-definitive-guide-reg.html?utm_source=ev_external_blog&amp;amp;utm_medium=influencer&amp;amp;utm_campaign=pymatching&amp;amp;utm_content=alexmerced&amp;amp;utm_term=external_blog&quot;&gt;Free Copy of Apache Iceberg the Definitive Guide&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&quot;https://hello.dremio.com/webcast-an-apache-iceberg-lakehouse-crash-course-reg.html?utm_source=ev_external_blog&amp;amp;utm_medium=influencer&amp;amp;utm_campaign=pymatching&amp;amp;utm_content=alexmerced&amp;amp;utm_term=external_blog&quot;&gt;Free Apache Iceberg Crash Course&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&quot;https://hello.dremio.com/webcast-an-in-depth-exploration-on-the-world-of-data-lakehouse-catalogs-reg.html?utm_source=ev_external_blog&amp;amp;utm_medium=influencer&amp;amp;utm_campaign=pymatching&amp;amp;utm_content=alexmerced&amp;amp;utm_term=external_blog&quot;&gt;Lakehouse Catalog Course&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&quot;https://www.youtube.com/watch?v=SIriNcVIGJQ&amp;amp;list=PLsLAVBjQJO0p0Yq1fLkoHvt2lEJj5pcYe&quot;&gt;Iceberg Lakehouse Engineering Video Playlist&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;
</content:encoded><author>Alex Merced</author></item><item><title>Hands-on with Apache Iceberg &amp; Dremio on Your Laptop within 10 Minutes</title><link>https://iceberglakehouse.com/posts/2024-10-hands-on-with-iceberg-dremio-laptop/</link><guid isPermaLink="true">https://iceberglakehouse.com/posts/2024-10-hands-on-with-iceberg-dremio-laptop/</guid><description>
- [Free Copy of Apache Iceberg the Definitive Guide](https://hello.dremio.com/wp-apache-iceberg-the-definitive-guide-reg.html?utm_source=ev_external_...</description><pubDate>Thu, 31 Oct 2024 09:00:00 GMT</pubDate><content:encoded>&lt;ul&gt;
&lt;li&gt;&lt;a href=&quot;https://hello.dremio.com/wp-apache-iceberg-the-definitive-guide-reg.html?utm_source=ev_external_blog&amp;amp;utm_medium=influencer&amp;amp;utm_campaign=iceberggov&amp;amp;utm_content=alexmerced&amp;amp;utm_term=external_blog&quot;&gt;Free Copy of Apache Iceberg the Definitive Guide&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&quot;https://hello.dremio.com/webcast-an-apache-iceberg-lakehouse-crash-course-reg.html?utm_source=ev_external_blog&amp;amp;utm_medium=influencer&amp;amp;utm_campaign=iceberggov&amp;amp;utm_content=alexmerced&amp;amp;utm_term=external_blog&quot;&gt;Free Apache Iceberg Crash Course&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&quot;https://www.youtube.com/watch?v=SIriNcVIGJQ&amp;amp;list=PLsLAVBjQJO0p0Yq1fLkoHvt2lEJj5pcYe&quot;&gt;Iceberg Lakehouse Engineering Video Playlist&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;Efficiently managing and analyzing data is essential for business success, and the data lakehouse architecture is leading the way in making this easier and more cost-effective. By combining the flexibility of data lakes with the structured performance of data warehouses, lakehouses offer a powerful solution for data storage, querying, and governance.&lt;/p&gt;
&lt;p&gt;For this hands-on guide, we’ll dive into setting up a data lakehouse on your own laptop in just ten minutes using &lt;strong&gt;Dremio&lt;/strong&gt;, &lt;strong&gt;Nessie&lt;/strong&gt;, and &lt;strong&gt;Apache Iceberg&lt;/strong&gt;. This setup will enable you to perform analytics on your data seamlessly and leverage a versioned, Git-like approach to data management with pre-configured storage buckets for simplicity.&lt;/p&gt;
&lt;h3&gt;Tools We’ll Use:&lt;/h3&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Dremio&lt;/strong&gt;: A lakehouse platform that organizes, documents, and queries data from databases, data warehouses, data lakes and lakehouse catalogs in a unified semantic layer, providing seamless access to data for analytics and reporting.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Nessie&lt;/strong&gt;: A transactional catalog that enables Git-like branching and merging capabilities for data, allowing for easier experimentation and version control.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Apache Iceberg&lt;/strong&gt;: A data lakehouse table format that turns your data lake into an ACID-compliant structure, supporting operations like time travel, schema evolution, and advanced partitioning.&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;By the end of this tutorial, you’ll be ready to set up a local lakehouse environment quickly, complete with sample data to explore. Let’s get started and see how easy it can be to work with Dremio and Apache Iceberg on your laptop!&lt;/p&gt;
&lt;h2&gt;Environment Setup&lt;/h2&gt;
&lt;p&gt;Before diving into the data lakehouse setup, let’s ensure your environment is ready. We’ll use &lt;strong&gt;Docker&lt;/strong&gt;, a tool that allows you to run applications in isolated environments called &amp;quot;containers.&amp;quot; If you’re new to Docker, don’t worry—this guide will walk you through each step!&lt;/p&gt;
&lt;h3&gt;Step 1: Install Docker&lt;/h3&gt;
&lt;ol&gt;
&lt;li&gt;&lt;strong&gt;Download Docker&lt;/strong&gt;: Go to &lt;a href=&quot;https://www.docker.com/products/docker-desktop/&quot;&gt;docker.com&lt;/a&gt; and download Docker Desktop for your operating system (Windows, macOS, or Linux).&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Install Docker&lt;/strong&gt;: Follow the installation instructions for your operating system. This will include some on-screen prompts to complete the installation process.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Verify Installation&lt;/strong&gt;: After installing Docker, open a terminal (Command Prompt, PowerShell, or a terminal app on Linux/macOS) and type:&lt;/li&gt;
&lt;/ol&gt;
&lt;pre&gt;&lt;code class=&quot;language-bash&quot;&gt;   docker --version
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;This command should display the version number if Docker is successfully installed.&lt;/p&gt;
&lt;p&gt;Once Docker is installed and running, you’ll have the core tool needed to set up our data lakehouse.&lt;/p&gt;
&lt;h3&gt;Step 2: Create a Docker Compose File&lt;/h3&gt;
&lt;p&gt;With Docker installed, let’s move on to Docker Compose, a tool that helps you define and manage multiple containers with a single configuration file. We’ll use it to set up and start Dremio, Nessie, and MinIO (an S3-compatible storage solution). Docker Compose will also automatically create the storage &amp;quot;buckets&amp;quot; needed in MinIO, so you won’t need to configure them manually.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Open a Text Editor:&lt;/strong&gt; Open any text editor (like VS Code, Notepad, or Sublime Text) and create a new file called docker-compose.yml in a new, empty folder. This file will contain all the configuration needed to launch our environment.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Add the Docker Compose Configuration:&lt;/strong&gt; Copy the following code and paste it into the docker-compose.yml file:&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-yaml&quot;&gt;version: &amp;quot;3&amp;quot;

services:
  # Nessie Catalog Server Using In-Memory Store
  nessie:
    image: projectnessie/nessie:latest
    container_name: nessie
    networks:
      - iceberg
    ports:
      - 19120:19120
  # MinIO Storage Server
  ## Creates two buckets named lakehouse and lake
  ## tail -f /dev/null is to keep the container running
  minio:
    image: minio/minio:latest
    container_name: minio
    environment:
      - MINIO_ROOT_USER=admin
      - MINIO_ROOT_PASSWORD=password
    networks:
      - iceberg
    ports:
      - 9001:9001
      - 9000:9000
    command: [&amp;quot;server&amp;quot;, &amp;quot;/data&amp;quot;, &amp;quot;--console-address&amp;quot;, &amp;quot;:9001&amp;quot;]
    entrypoint: &amp;gt;
      /bin/sh -c &amp;quot;
      minio server /data --console-address &apos;:9001&apos; &amp;amp;
      sleep 5 &amp;amp;&amp;amp;
      mc alias set myminio http://localhost:9000 admin password &amp;amp;&amp;amp;
      mc mb myminio/lakehouse &amp;amp;&amp;amp;
      mc mb myminio/lake &amp;amp;&amp;amp;
      tail -f /dev/null
      &amp;quot;
  # Dremio
  dremio:
    platform: linux/x86_64
    image: dremio/dremio-oss:latest
    ports:
      - 9047:9047
      - 31010:31010
      - 32010:32010
    container_name: dremio
    environment:
      - DREMIO_JAVA_SERVER_EXTRA_OPTS=-Dpaths.dist=file:///opt/dremio/data/dist
    networks:
      - iceberg

networks:
  iceberg:
&lt;/code&gt;&lt;/pre&gt;
&lt;h3&gt;Explanation of the Code:&lt;/h3&gt;
&lt;p&gt;This file defines three services:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;nessie (the catalog)&lt;/li&gt;
&lt;li&gt;minio (the storage server)&lt;/li&gt;
&lt;li&gt;dremio (the query engine).&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;Each service has specific network settings, ports, and configurations to allow them to communicate with each other.&lt;/p&gt;
&lt;h3&gt;Step 3: Start Your Environment&lt;/h3&gt;
&lt;p&gt;With your docker-compose.yml file saved, it’s time to start your data lakehouse environment!&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Open a Terminal:&lt;/strong&gt; Navigate to the folder where you saved the docker-compose.yml file.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Run Docker Compose:&lt;/strong&gt; In your terminal, type:&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-bash&quot;&gt;docker-compose up -d
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;This command tells Docker to start each of the services specified in docker-compose.yml and run them in the background (the -d flag).&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Wait for Setup to Complete:&lt;/strong&gt; It may take a few minutes for all services to start. You’ll see a lot of text in your terminal as each service starts up. When you see lines indicating that each service is &amp;quot;running,&amp;quot; the setup is complete.&lt;/p&gt;
&lt;h3&gt;Step 4: Verify Each Service is Running&lt;/h3&gt;
&lt;p&gt;Now that the environment is up, let’s verify that each service is accessible:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Dremio:&lt;/strong&gt; Open a web browser and go to http://localhost:9047. You should see a Dremio login screen.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;MinIO:&lt;/strong&gt; In a new browser tab, go to http://localhost:9001. Log in with the username admin and password password. You should see the MinIO console, where you can view storage &amp;quot;buckets.&amp;quot;&lt;/li&gt;
&lt;/ul&gt;
&lt;h3&gt;Step 5: Optional - Shutting Down the Environment&lt;/h3&gt;
&lt;p&gt;When you’re done with the setup and want to stop the services, simply open a terminal in the same folder where you created the &lt;code&gt;docker-compose.yml&lt;/code&gt; file and run:&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-bash&quot;&gt;docker-compose down -v
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;This command will stop and remove all containers, so you can start fresh next time.&lt;/p&gt;
&lt;p&gt;The &lt;code&gt;-v&lt;/code&gt; flag removes any volumes associated with the containers, which is important if you want to start fresh next time.&lt;/p&gt;
&lt;p&gt;Congratulations! You now have a fully functional data lakehouse environment running on your laptop. In the next section, we’ll connect Dremio to Nessie and MinIO and start creating and querying tables.&lt;/p&gt;
&lt;h2&gt;Getting Started with Dremio: Connecting the Nessie and MinIO Sources&lt;/h2&gt;
&lt;p&gt;Now that Dremio is up and running, let&apos;s connect it to our MinIO buckets, &lt;code&gt;lakehouse&lt;/code&gt; and &lt;code&gt;lake&lt;/code&gt;, which will act as the main data sources in our local lakehouse environment. This section will guide you through connecting both the Nessie catalog (using the &lt;code&gt;lakehouse&lt;/code&gt; bucket) and a general S3-like data lake connection (using the &lt;code&gt;lake&lt;/code&gt; bucket) in Dremio.&lt;/p&gt;
&lt;h3&gt;Step 1: Adding the Nessie Source in Dremio&lt;/h3&gt;
&lt;ol&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Open Dremio&lt;/strong&gt;: In your web browser, navigate to &lt;a href=&quot;http://localhost:9047&quot;&gt;http://localhost:9047&lt;/a&gt; to access the Dremio UI.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Add the Nessie Source&lt;/strong&gt;:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Click on the &lt;strong&gt;&amp;quot;Add Source&amp;quot;&lt;/strong&gt; button in the bottom left corner of the Dremio interface.&lt;/li&gt;
&lt;li&gt;Select &lt;strong&gt;Nessie&lt;/strong&gt; from the list of available sources.&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Configure the Nessie Source&lt;/strong&gt;:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;You’ll need to fill out both the &lt;strong&gt;General&lt;/strong&gt; and &lt;strong&gt;Storage&lt;/strong&gt; settings as follows:&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;&lt;strong&gt;General Settings&lt;/strong&gt;:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Name&lt;/strong&gt;: Set the source name to &lt;code&gt;lakehouse&lt;/code&gt;.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Endpoint URL&lt;/strong&gt;: Enter the Nessie API endpoint URL:&lt;pre&gt;&lt;code&gt;http://nessie:19120/api/v2
&lt;/code&gt;&lt;/pre&gt;
&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Authentication&lt;/strong&gt;: Select &lt;strong&gt;None&lt;/strong&gt; (no additional credentials are required).&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;&lt;strong&gt;Storage Settings&lt;/strong&gt;:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Access Key&lt;/strong&gt;: Set to &lt;code&gt;admin&lt;/code&gt; (MinIO username).&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Secret Key&lt;/strong&gt;: Set to &lt;code&gt;password&lt;/code&gt; (MinIO password).&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Root Path&lt;/strong&gt;: Set to &lt;code&gt;lakehouse&lt;/code&gt; (this is the bucket where our Iceberg tables will be stored).&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;&lt;strong&gt;Connection Properties&lt;/strong&gt;:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;fs.s3a.path.style.access&lt;/strong&gt;: Set this to &lt;code&gt;true&lt;/code&gt;.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;fs.s3a.endpoint&lt;/strong&gt;: Set to &lt;code&gt;minio:9000&lt;/code&gt;.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;dremio.s3.compat&lt;/strong&gt;: Set to &lt;code&gt;true&lt;/code&gt;.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Encrypt Connection&lt;/strong&gt;: Uncheck this option since we’re running Nessie locally on HTTP.&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Save the Source&lt;/strong&gt;: Once all settings are configured, click &lt;strong&gt;Save&lt;/strong&gt;. The &lt;code&gt;lakehouse&lt;/code&gt; source will now be connected in Dremio, allowing you to browse and query tables stored in the Nessie catalog.&lt;/p&gt;
&lt;/li&gt;
&lt;/ol&gt;
&lt;h3&gt;Step 2: Adding MinIO as an S3 Source in Dremio (Data Lake Connection)&lt;/h3&gt;
&lt;p&gt;In addition to Nessie, we’ll set up a general-purpose data lake connection using the &lt;code&gt;lake&lt;/code&gt; bucket in MinIO. This bucket can store non-Iceberg table data, making it suitable for raw data or other types of files. So if you wanted to upload CSV, JSON, XLS or Parquet files you can put them in the &amp;quot;lake&amp;quot; bucket and view them from this source in Dremio.&lt;/p&gt;
&lt;ol&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Add an S3 Source&lt;/strong&gt;:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Click the &lt;strong&gt;&amp;quot;Add Source&amp;quot;&lt;/strong&gt; button again and select &lt;strong&gt;S3&lt;/strong&gt; from the list of sources.&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Configure the S3 Source for MinIO&lt;/strong&gt;:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Use the following settings to connect the &lt;code&gt;lake&lt;/code&gt; bucket as a secondary source.&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;&lt;strong&gt;General Settings&lt;/strong&gt;:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Name&lt;/strong&gt;: Set the source name to &lt;code&gt;lake&lt;/code&gt;.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Credentials&lt;/strong&gt;: Choose &lt;strong&gt;AWS access key&lt;/strong&gt;.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Access Key&lt;/strong&gt;: Set to &lt;code&gt;admin&lt;/code&gt; (MinIO username).&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Secret Key&lt;/strong&gt;: Set to &lt;code&gt;password&lt;/code&gt; (MinIO password).&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Encrypt Connection&lt;/strong&gt;: Uncheck this option since MinIO is running locally.&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;&lt;strong&gt;Advanced Options&lt;/strong&gt;:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Enable Compatibility Mode&lt;/strong&gt;: Set to &lt;code&gt;true&lt;/code&gt; to ensure compatibility with MinIO.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Root Path&lt;/strong&gt;: Set to &lt;code&gt;/lake&lt;/code&gt; (the bucket name for general storage).&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;&lt;strong&gt;Connection Properties&lt;/strong&gt;:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;fs.s3a.path.style.access&lt;/strong&gt;: Set this to &lt;code&gt;true&lt;/code&gt;.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;fs.s3a.endpoint&lt;/strong&gt;: Set to &lt;code&gt;minio:9000&lt;/code&gt;.&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Save the Source&lt;/strong&gt;: After filling out the configuration, click &lt;strong&gt;Save&lt;/strong&gt;. The &lt;code&gt;lake&lt;/code&gt; bucket is now accessible in Dremio, and you can query the raw data stored in this bucket.&lt;/p&gt;
&lt;/li&gt;
&lt;/ol&gt;
&lt;h3&gt;Next Steps&lt;/h3&gt;
&lt;p&gt;With both sources connected, you now have access to structured, versioned data in the &lt;code&gt;lakehouse&lt;/code&gt; bucket and general-purpose data in the &lt;code&gt;lake&lt;/code&gt; bucket. In the next section, we’ll explore creating and querying Apache Iceberg tables in Dremio to see how easy it is to get started with data lakehouse workflows.&lt;/p&gt;
&lt;h2&gt;Running Transactions on Apache Iceberg Tables and Inspecting the Storage&lt;/h2&gt;
&lt;p&gt;With our environment set up and sources connected, we’re ready to perform some transactions on an Apache Iceberg table in Dremio. After creating and inserting data, we’ll inspect MinIO to see how Dremio stores files in the &lt;code&gt;lakehouse&lt;/code&gt; bucket. Additionally, we’ll make a &lt;code&gt;curl&lt;/code&gt; request to Nessie to check the catalog state, confirming our transactions.&lt;/p&gt;
&lt;h3&gt;Step 1: Creating an Iceberg Table in Dremio&lt;/h3&gt;
&lt;ol&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Open the SQL Editor&lt;/strong&gt; in Dremio:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;In the Dremio UI, select &lt;strong&gt;SQL Runner&lt;/strong&gt; from the menu on the left.&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Set the Context to Nessie&lt;/strong&gt;:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;In the SQL editor, click on &lt;strong&gt;Context&lt;/strong&gt; (top right of the editor) and set it to our Nessie source &lt;code&gt;lakehouse&lt;/code&gt;. If you don&apos;t do this then you&apos;ll need to include fully qualified table names in your queries like &lt;code&gt;lakehouse.customers&lt;/code&gt;.&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Create an Iceberg Table&lt;/strong&gt;:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Run the following SQL to create a new table named &lt;code&gt;customers&lt;/code&gt; in the &lt;code&gt;lakehouse&lt;/code&gt; bucket:&lt;pre&gt;&lt;code class=&quot;language-sql&quot;&gt;CREATE TABLE customers (
  id INT,
  first_name VARCHAR,
  last_name VARCHAR,
  age INT
) PARTITION BY (truncate(1, last_name));
&lt;/code&gt;&lt;/pre&gt;
&lt;/li&gt;
&lt;li&gt;This SQL creates an Apache Iceberg table with a partition on the first letter of &lt;code&gt;last_name&lt;/code&gt;. The partitioning is handled by Apache Iceberg’s &lt;strong&gt;Hidden Partitioning&lt;/strong&gt; feature, which allows for advanced partitioning without additional columns in the schema.&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Insert Data into the Table&lt;/strong&gt;:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Now, add some sample data to the &lt;code&gt;customers&lt;/code&gt; table:&lt;pre&gt;&lt;code class=&quot;language-sql&quot;&gt;INSERT INTO customers (id, first_name, last_name, age) VALUES
(1, &apos;John&apos;, &apos;Doe&apos;, 28),
(2, &apos;Jane&apos;, &apos;Smith&apos;, 34),
(3, &apos;Alice&apos;, &apos;Johnson&apos;, 22),
(4, &apos;Bob&apos;, &apos;Williams&apos;, 45),
(5, &apos;Charlie&apos;, &apos;Brown&apos;, 30);
&lt;/code&gt;&lt;/pre&gt;
&lt;/li&gt;
&lt;li&gt;This will insert five records into the &lt;code&gt;customers&lt;/code&gt; table, each automatically stored and partitioned in the &lt;code&gt;lakehouse&lt;/code&gt; bucket.&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;/ol&gt;
&lt;h3&gt;Step 2: Inspecting Files in MinIO&lt;/h3&gt;
&lt;p&gt;With data inserted into the &lt;code&gt;customers&lt;/code&gt; table, let’s take a look at MinIO to verify the files were created as expected.&lt;/p&gt;
&lt;ol&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Open MinIO&lt;/strong&gt;:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Go to &lt;a href=&quot;http://localhost:9001&quot;&gt;http://localhost:9001&lt;/a&gt; in your browser, and log in with:
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Username&lt;/strong&gt;: &lt;code&gt;admin&lt;/code&gt;&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Password&lt;/strong&gt;: &lt;code&gt;password&lt;/code&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Navigate to the &lt;code&gt;lakehouse&lt;/code&gt; Bucket&lt;/strong&gt;:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;From the MinIO dashboard, click on &lt;strong&gt;Buckets&lt;/strong&gt; and select the &lt;code&gt;lakehouse&lt;/code&gt; bucket.&lt;/li&gt;
&lt;li&gt;Inside the &lt;code&gt;lakehouse&lt;/code&gt; bucket, you should see a directory for the &lt;code&gt;customers&lt;/code&gt; table.&lt;/li&gt;
&lt;li&gt;Browse through the folders to locate the partitioned files based on the &lt;code&gt;last_name&lt;/code&gt; column. You’ll find subfolders that store the data by partition, along with metadata files that track the state of the table.&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;/ol&gt;
&lt;p&gt;This inspection verifies that Dremio is writing data to the &lt;code&gt;lakehouse&lt;/code&gt; bucket in Apache Iceberg format, which organizes the data into Parquet files and metadata files.&lt;/p&gt;
&lt;h3&gt;Step 3: Checking the State of the Nessie Catalog with &lt;code&gt;curl&lt;/code&gt;&lt;/h3&gt;
&lt;p&gt;Now, let’s make a &lt;code&gt;curl&lt;/code&gt; request to the Nessie catalog to confirm that the &lt;code&gt;customers&lt;/code&gt; table was created successfully and that its metadata is stored correctly.&lt;/p&gt;
&lt;ol&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Open a Terminal&lt;/strong&gt;:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;In your terminal, run the following command to view the contents of the main branch in Nessie:&lt;pre&gt;&lt;code class=&quot;language-bash&quot;&gt;curl -X GET &amp;quot;http://localhost:19120/api/v2/trees/main/entries&amp;quot;
&lt;/code&gt;&lt;/pre&gt;
&lt;/li&gt;
&lt;li&gt;This command retrieves a list of all entries (tables) in the &lt;code&gt;main&lt;/code&gt; branch of the Nessie catalog.&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Review the Response&lt;/strong&gt;:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;The JSON response will contain details about the &lt;code&gt;customers&lt;/code&gt; table. You should see an entry indicating the presence of &lt;code&gt;customers&lt;/code&gt; in the catalog, confirming that the table is tracked in Nessie.&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Inspect Specific Commit History (Optional)&lt;/strong&gt;:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;To view the specific commit history for transactions on this branch, you can run:&lt;pre&gt;&lt;code class=&quot;language-bash&quot;&gt;curl -X GET &amp;quot;http://localhost:19120/api/v2/trees/tree/main/log&amp;quot; \
     -H &amp;quot;Content-Type: application/json&amp;quot;
&lt;/code&gt;&lt;/pre&gt;
&lt;/li&gt;
&lt;li&gt;This command shows a log of all changes made on the &lt;code&gt;main&lt;/code&gt; branch, providing a Git-like commit history for your data transactions.&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;/ol&gt;
&lt;h3&gt;Next Steps&lt;/h3&gt;
&lt;p&gt;Now that you have verified your transactions and inspected the storage, you can confidently work with Apache Iceberg tables in Dremio, knowing that both the data and metadata are tracked in the Nessie catalog and accessible in MinIO. In the next section, we’ll explore making additional table modifications, like updating partitioning rules, and see how Apache Iceberg handles these changes seamlessly.&lt;/p&gt;
&lt;h2&gt;Modifying the Apache Iceberg Table Schema and Partitioning&lt;/h2&gt;
&lt;p&gt;With our initial &lt;code&gt;customers&lt;/code&gt; table set up in Dremio, we can take advantage of Apache Iceberg’s flexibility to make schema and partition modifications without requiring a data rewrite. In this section, we’ll add a new column to the table, adjust partitioning, and observe how these changes reflect in MinIO and the Nessie catalog.&lt;/p&gt;
&lt;h3&gt;Step 1: Adding a New Column&lt;/h3&gt;
&lt;p&gt;Suppose we want to add a new column to store customer email addresses. We can easily update the table schema with the following &lt;code&gt;ALTER TABLE&lt;/code&gt; statement:&lt;/p&gt;
&lt;ol&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Open the SQL Editor&lt;/strong&gt; in Dremio:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Navigate back to the &lt;strong&gt;SQL Runner&lt;/strong&gt;.&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Add the Column&lt;/strong&gt;:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Run the following SQL to add an &lt;code&gt;email&lt;/code&gt; column to the &lt;code&gt;customers&lt;/code&gt; table:&lt;pre&gt;&lt;code class=&quot;language-sql&quot;&gt;ALTER TABLE customers
ADD COLUMNS (email VARCHAR);
&lt;/code&gt;&lt;/pre&gt;
&lt;/li&gt;
&lt;li&gt;This command adds the &lt;code&gt;email&lt;/code&gt; column to the existing table without affecting the existing data.&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Verify the Column Addition&lt;/strong&gt;:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;After running the command, you can confirm the addition by querying the &lt;code&gt;customers&lt;/code&gt; table in Dremio:&lt;pre&gt;&lt;code class=&quot;language-sql&quot;&gt;SELECT * FROM customers;
&lt;/code&gt;&lt;/pre&gt;
&lt;/li&gt;
&lt;li&gt;You’ll see an &lt;code&gt;email&lt;/code&gt; column now appears, ready for data to be added.&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;/ol&gt;
&lt;h3&gt;Step 2: Updating Partitioning Rules&lt;/h3&gt;
&lt;p&gt;Iceberg allows for flexible partitioning rules through &lt;strong&gt;Partition Evolution&lt;/strong&gt;, meaning we can change how data is partitioned without rewriting all existing data. Let’s add a new partition rule that organizes data based on the first letter of the &lt;code&gt;first_name&lt;/code&gt; as well.&lt;/p&gt;
&lt;ol&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Add a Partition Field&lt;/strong&gt;:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;To partition data by the first letter of &lt;code&gt;first_name&lt;/code&gt;, use the following SQL:&lt;pre&gt;&lt;code class=&quot;language-sql&quot;&gt;ALTER TABLE customers
ADD PARTITION FIELD truncate(1, first_name);
&lt;/code&gt;&lt;/pre&gt;
&lt;/li&gt;
&lt;li&gt;This command instructs Iceberg to partition any new data by both the first letters of &lt;code&gt;last_name&lt;/code&gt; and &lt;code&gt;first_name&lt;/code&gt;.&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Insert Additional Data to Test the New Partitioning&lt;/strong&gt;:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Let’s insert some more records to see how the new partition structure organizes the data:&lt;pre&gt;&lt;code class=&quot;language-sql&quot;&gt;INSERT INTO customers (id, first_name, last_name, age, email) VALUES
(6, &apos;Emily&apos;, &apos;Adams&apos;, 29, &apos;emily.adams@example.com&apos;),
(7, &apos;Frank&apos;, &apos;Baker&apos;, 35, &apos;frank.baker@example.com&apos;),
(8, &apos;Grace&apos;, &apos;Clark&apos;, 41, &apos;grace.clark@example.com&apos;);
&lt;/code&gt;&lt;/pre&gt;
&lt;/li&gt;
&lt;li&gt;This data will be partitioned according to both &lt;code&gt;first_name&lt;/code&gt; and &lt;code&gt;last_name&lt;/code&gt;, following the new rules we set.&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;/ol&gt;
&lt;h3&gt;Step 3: Inspect the New Partitions in MinIO&lt;/h3&gt;
&lt;ol&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Open MinIO&lt;/strong&gt; and navigate to the &lt;code&gt;lakehouse&lt;/code&gt; bucket:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Go to &lt;a href=&quot;http://localhost:9001&quot;&gt;http://localhost:9001&lt;/a&gt;, and log in with:
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Username&lt;/strong&gt;: &lt;code&gt;admin&lt;/code&gt;&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Password&lt;/strong&gt;: &lt;code&gt;password&lt;/code&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Locate the Updated &lt;code&gt;customers&lt;/code&gt; Folder&lt;/strong&gt;:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Within the &lt;code&gt;lakehouse&lt;/code&gt; bucket, locate the &lt;code&gt;customers&lt;/code&gt; table folder.&lt;/li&gt;
&lt;li&gt;Open the folder structure to view the newly created subfolders, representing the partitioning by &lt;code&gt;last_name&lt;/code&gt; and &lt;code&gt;first_name&lt;/code&gt; that we configured. You should see the additional folders and Parquet files for each new partition based on &lt;code&gt;first_name&lt;/code&gt;.&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;/ol&gt;
&lt;h3&gt;Step 4: Confirm the Changes in Nessie with &lt;code&gt;curl&lt;/code&gt;&lt;/h3&gt;
&lt;p&gt;Finally, let’s make a &lt;code&gt;curl&lt;/code&gt; request to the Nessie catalog to verify that the schema and partitioning changes are recorded in the catalog’s metadata.&lt;/p&gt;
&lt;ol&gt;
&lt;li&gt;&lt;strong&gt;Open a Terminal&lt;/strong&gt; and run the following command to check the schema:&lt;pre&gt;&lt;code class=&quot;language-bash&quot;&gt;curl -X GET &amp;quot;http://localhost:19120/api/v2/trees/main/history&amp;quot;
&lt;/code&gt;&lt;/pre&gt;
&lt;/li&gt;
&lt;/ol&gt;
&lt;p&gt;This will return a JSON response listing the recent commits to the &lt;code&gt;main&lt;/code&gt; branch, including the schema and partitioning updates.&lt;/p&gt;
&lt;h3&gt;Summary&lt;/h3&gt;
&lt;p&gt;We’ve successfully modified the schema and partitioning of an Apache Iceberg table in Dremio, and we can observe these changes directly in MinIO’s file structure and the Nessie catalog’s metadata. This example demonstrates the flexibility of Iceberg in managing evolving data schemas and partitioning strategies in real-time, without requiring downtime or data rewrites. In the next section, we’ll explore how to utilize Iceberg’s version control capabilities for branching and merging datasets within the Nessie catalog.&lt;/p&gt;
&lt;h2&gt;Branching and Merging with Nessie: Version Control for Data&lt;/h2&gt;
&lt;p&gt;One of the powerful features of using Nessie with Apache Iceberg is its Git-like branching and merging functionality. Branching allows you to create isolated environments for data modifications, which can then be merged back into the main branch once verified. This section will walk you through creating a branch, performing data modifications within that branch, and then merging those changes back to the main branch.&lt;/p&gt;
&lt;h3&gt;Step 1: Creating a Branch&lt;/h3&gt;
&lt;p&gt;Let’s start by creating a new branch in Nessie. This branch will allow us to perform data transactions without impacting the main data branch, ideal for testing and experimenting.&lt;/p&gt;
&lt;ol&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Open the SQL Editor&lt;/strong&gt; in Dremio.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Create a New Branch&lt;/strong&gt;:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Run the following SQL to create a new branch named &lt;code&gt;development&lt;/code&gt; in the &lt;code&gt;lakehouse&lt;/code&gt; catalog:&lt;pre&gt;&lt;code class=&quot;language-sql&quot;&gt;CREATE BRANCH development IN lakehouse;
&lt;/code&gt;&lt;/pre&gt;
&lt;/li&gt;
&lt;li&gt;This command creates a new branch in the Nessie catalog, providing an isolated environment for data changes.&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Switch to the Development Branch&lt;/strong&gt;:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Now, let’s set our context to the &lt;code&gt;development&lt;/code&gt; branch either using the context selector or using the following sql before any queries so that any changes we make only affect this branch:&lt;pre&gt;&lt;code class=&quot;language-sql&quot;&gt;USE BRANCH development IN lakehouse;
&lt;/code&gt;&lt;/pre&gt;
&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;/ol&gt;
&lt;h3&gt;Step 2: Performing Data Modifications on the Branch&lt;/h3&gt;
&lt;p&gt;With the &lt;code&gt;development&lt;/code&gt; branch active, let’s modify the &lt;code&gt;customers&lt;/code&gt; table by adding new data. This data will remain isolated on the &lt;code&gt;development&lt;/code&gt; branch until we choose to merge it back to &lt;code&gt;main&lt;/code&gt;.&lt;/p&gt;
&lt;ol&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Insert Additional Records&lt;/strong&gt;:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Run the following SQL to add new entries to the &lt;code&gt;customers&lt;/code&gt; table (make sure to either use the context selector or use the &lt;code&gt;use branch&lt;/code&gt; sql before any queries so that any changes we make only affect this branch):&lt;pre&gt;&lt;code class=&quot;language-sql&quot;&gt;INSERT INTO customers (id, first_name, last_name, age, email) VALUES
(9, &apos;Holly&apos;, &apos;Grant&apos;, 31, &apos;holly.grant@example.com&apos;),
(10, &apos;Ian&apos;, &apos;Young&apos;, 27, &apos;ian.young@example.com&apos;),
(11, &apos;Jack&apos;, &apos;Diaz&apos;, 39, &apos;jack.diaz@example.com&apos;);
&lt;/code&gt;&lt;/pre&gt;
&lt;/li&gt;
&lt;li&gt;These records are added to the &lt;code&gt;customers&lt;/code&gt; table on the &lt;code&gt;development&lt;/code&gt; branch only, meaning they won’t affect the main branch until merged.&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Verify the Records in the Development Branch&lt;/strong&gt;:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;You can verify the new records by running:&lt;pre&gt;&lt;code class=&quot;language-sql&quot;&gt;SELECT * FROM customers AT BRANCH development;
SELECT * FROM customers AT BRANCH main;
&lt;/code&gt;&lt;/pre&gt;
&lt;/li&gt;
&lt;li&gt;This query will display the data, including the recently inserted records, as it is within the context of the &lt;code&gt;development&lt;/code&gt; and &lt;code&gt;main&lt;/code&gt; branches.&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;/ol&gt;
&lt;h3&gt;Step 3: Merging Changes Back to the Main Branch&lt;/h3&gt;
&lt;p&gt;Once satisfied with the changes in &lt;code&gt;development&lt;/code&gt;, we can merge the &lt;code&gt;development&lt;/code&gt; branch back into &lt;code&gt;main&lt;/code&gt;, making these records available to all users accessing the main branch.&lt;/p&gt;
&lt;ol&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Switch to the Main Branch&lt;/strong&gt;:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;First, change the context back to the &lt;code&gt;main&lt;/code&gt; branch:&lt;pre&gt;&lt;code class=&quot;language-sql&quot;&gt;USE BRANCH main IN lakehouse;
&lt;/code&gt;&lt;/pre&gt;
&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Merge the Development Branch&lt;/strong&gt;:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Now, merge the &lt;code&gt;development&lt;/code&gt; branch into &lt;code&gt;main&lt;/code&gt; using the following SQL:&lt;pre&gt;&lt;code class=&quot;language-sql&quot;&gt;MERGE BRANCH development INTO main IN lakehouse;
&lt;/code&gt;&lt;/pre&gt;
&lt;/li&gt;
&lt;li&gt;This command brings all changes from &lt;code&gt;development&lt;/code&gt; into &lt;code&gt;main&lt;/code&gt;, adding the new records to the main version of the &lt;code&gt;customers&lt;/code&gt; table.&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Verify the Merge&lt;/strong&gt;:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;To confirm the records are now in &lt;code&gt;main&lt;/code&gt;, run:&lt;pre&gt;&lt;code class=&quot;language-sql&quot;&gt;SELECT * FROM customers AT BRANCH main;
&lt;/code&gt;&lt;/pre&gt;
&lt;/li&gt;
&lt;li&gt;You should see all records, including those added in the &lt;code&gt;development&lt;/code&gt; branch, are now present in the &lt;code&gt;main&lt;/code&gt; branch.&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;/ol&gt;
&lt;h3&gt;Step 4: Verifying the Branching Activity in Nessie with &lt;code&gt;curl&lt;/code&gt;&lt;/h3&gt;
&lt;p&gt;You can use &lt;code&gt;curl&lt;/code&gt; commands to check the branch status and view commit logs in Nessie, providing additional validation of the branching and merging activity.&lt;/p&gt;
&lt;ol&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;List Branches&lt;/strong&gt;:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Run the following &lt;code&gt;curl&lt;/code&gt; command to list all branches in the &lt;code&gt;lakehouse&lt;/code&gt; catalog:&lt;pre&gt;&lt;code class=&quot;language-bash&quot;&gt;curl -X GET &amp;quot;http://localhost:19120/api/v2/trees/&amp;quot;
&lt;/code&gt;&lt;/pre&gt;
&lt;/li&gt;
&lt;li&gt;The response will include the &lt;code&gt;main&lt;/code&gt; and &lt;code&gt;development&lt;/code&gt; branches, confirming the branch creation.&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Check the Commit Log&lt;/strong&gt;:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;
&lt;p&gt;To view a log of commits, including the merge from &lt;code&gt;development&lt;/code&gt; to &lt;code&gt;main&lt;/code&gt;, run:&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-bash&quot;&gt;curl -X GET &amp;quot;http://localhost:19120/api/v2/trees/main/history&amp;quot;

curl -X GET &amp;quot;http://localhost:19120/api/v2/trees/development/history&amp;quot;
&lt;/code&gt;&lt;/pre&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;This log will show each commit, giving you a clear view of data versioning over time.&lt;/p&gt;
&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;/ol&gt;
&lt;h3&gt;Summary&lt;/h3&gt;
&lt;p&gt;Branching and merging in Nessie allows you to safely experiment with data modifications in an isolated environment, integrating those changes back into the main dataset only when ready. This workflow is invaluable for testing data updates, creating data snapshots, or managing changes for compliance purposes. In the next section, we’ll explore how to use Nessie tags to mark important states in your data, further enhancing data version control.&lt;/p&gt;
&lt;h2&gt;Tagging Important States with Nessie: Creating Data Snapshots&lt;/h2&gt;
&lt;p&gt;In addition to branching, Nessie also offers the ability to tag specific states of your data, making it easy to create snapshots at critical moments. Tags allow you to mark key data versions—such as a quarterly report cutoff or pre-migration data state—so you can refer back to them if needed.&lt;/p&gt;
&lt;p&gt;In this section, we’ll walk through creating tags in Nessie to capture the current state of the data and explore how to use tags for historical analysis or recovery.&lt;/p&gt;
&lt;h3&gt;Step 1: Creating a Tag&lt;/h3&gt;
&lt;p&gt;Let’s create a tag on the &lt;code&gt;main&lt;/code&gt; branch to mark an important point in the dataset, such as the completion of initial data loading. This tag will serve as a snapshot that we can return to if necessary.&lt;/p&gt;
&lt;ol&gt;
&lt;li&gt;&lt;strong&gt;Open the SQL Editor&lt;/strong&gt; in Dremio.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Create a Tag&lt;/strong&gt;:
&lt;ul&gt;
&lt;li&gt;Run the following SQL command to create a tag called &lt;code&gt;initial_load&lt;/code&gt; on the &lt;code&gt;main&lt;/code&gt; branch:&lt;pre&gt;&lt;code class=&quot;language-sql&quot;&gt;CREATE TAG initial_load AT BRANCH main IN lakehouse;
&lt;/code&gt;&lt;/pre&gt;
&lt;/li&gt;
&lt;li&gt;This tag marks the state of all tables in the &lt;code&gt;lakehouse&lt;/code&gt; catalog on the &lt;code&gt;main&lt;/code&gt; branch at the current moment, capturing the data exactly as it is now.&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;/ol&gt;
&lt;h3&gt;Step 2: Modifying the Data on the Main Branch&lt;/h3&gt;
&lt;p&gt;To understand the usefulness of tags, let’s make a few changes to the &lt;code&gt;customers&lt;/code&gt; table on the &lt;code&gt;main&lt;/code&gt; branch. Later, we can use the tag to compare or even restore to the original dataset state if needed.&lt;/p&gt;
&lt;ol&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Insert Additional Records&lt;/strong&gt;:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Add some new data to the &lt;code&gt;customers&lt;/code&gt; table to simulate further data processing:&lt;pre&gt;&lt;code class=&quot;language-sql&quot;&gt;INSERT INTO customers (id, first_name, last_name, age, email) VALUES
(12, &apos;Kate&apos;, &apos;Morgan&apos;, 45, &apos;kate.morgan@example.com&apos;),
(13, &apos;Luke&apos;, &apos;Rogers&apos;, 33, &apos;luke.rogers@example.com&apos;);
&lt;/code&gt;&lt;/pre&gt;
&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Verify Changes&lt;/strong&gt;:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Run the following query to confirm that the new records have been added:&lt;pre&gt;&lt;code class=&quot;language-sql&quot;&gt;SELECT * FROM customers;
&lt;/code&gt;&lt;/pre&gt;
&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;/ol&gt;
&lt;h3&gt;Step 3: Accessing Data from a Specific Tag&lt;/h3&gt;
&lt;p&gt;Tags in Nessie allow you to view the dataset as it was at the time the tag was created. To access the data at the &lt;code&gt;initial_load&lt;/code&gt; state, we can specify the tag as the reference point in our queries.&lt;/p&gt;
&lt;ol&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Query the Data Using the Tag&lt;/strong&gt;:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Use the following SQL command to switch to the &lt;code&gt;initial_load&lt;/code&gt; tag and view the dataset as it was at that point:&lt;pre&gt;&lt;code class=&quot;language-sql&quot;&gt;USE TAG initial_load IN lakehouse;
SELECT * FROM customers;
&lt;/code&gt;&lt;/pre&gt;
&lt;/li&gt;
&lt;li&gt;This query will display the &lt;code&gt;customers&lt;/code&gt; table as it was when the &lt;code&gt;initial_load&lt;/code&gt; tag was created, without the new records that were added afterward.&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Return to the Main Branch&lt;/strong&gt;:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Once you are done exploring the &lt;code&gt;initial_load&lt;/code&gt; state, switch back to the &lt;code&gt;main&lt;/code&gt; branch to continue working with the latest data:&lt;pre&gt;&lt;code class=&quot;language-sql&quot;&gt;USE BRANCH main IN lakehouse;
&lt;/code&gt;&lt;/pre&gt;
&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;/ol&gt;
&lt;h3&gt;Step 4: Verifying the Tag Creation with &lt;code&gt;curl&lt;/code&gt;&lt;/h3&gt;
&lt;p&gt;To verify the tag’s existence in the Nessie catalog, we can make a &lt;code&gt;curl&lt;/code&gt; request to list all tags, including &lt;code&gt;initial_load&lt;/code&gt;.&lt;/p&gt;
&lt;ol&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;List Tags&lt;/strong&gt;:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Run the following &lt;code&gt;curl&lt;/code&gt; command to retrieve all tags in the &lt;code&gt;lakehouse&lt;/code&gt; catalog:&lt;pre&gt;&lt;code class=&quot;language-bash&quot;&gt;curl -X GET &amp;quot;http://localhost:19120/api/v2/trees/tags&amp;quot; \
     -H &amp;quot;Content-Type: application/json&amp;quot;
&lt;/code&gt;&lt;/pre&gt;
&lt;/li&gt;
&lt;li&gt;The JSON response will list all tags, including the &lt;code&gt;initial_load&lt;/code&gt; tag you created.&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Review Tag Details&lt;/strong&gt; (Optional):&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;To get detailed information about the &lt;code&gt;initial_load&lt;/code&gt; tag, including its exact commit reference, you can use:&lt;pre&gt;&lt;code class=&quot;language-bash&quot;&gt;curl -X GET &amp;quot;http://localhost:19120/api/v2/trees/tags/initial_load&amp;quot; \
     -H &amp;quot;Content-Type: application/json&amp;quot;
&lt;/code&gt;&lt;/pre&gt;
&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;/ol&gt;
&lt;h3&gt;Summary&lt;/h3&gt;
&lt;p&gt;Tags in Nessie provide a reliable way to snapshot important states of your data. By creating tags at critical points, you can easily access previous states of your data, helping to support data auditing, historical reporting, and data recovery. In the next section, we’ll cover querying the Apache Iceberg Metadata tables.&lt;/p&gt;
&lt;h2&gt;Exploring Iceberg Metadata Tables in Dremio&lt;/h2&gt;
&lt;p&gt;Iceberg metadata tables offer insights into the underlying structure and evolution of your data. These tables contain information about data files, snapshots, partition details, and more, allowing you to track changes, troubleshoot issues, and optimize queries. Dremio makes querying Iceberg metadata simple, giving you valuable context on your data lakehouse.&lt;/p&gt;
&lt;p&gt;In this section, we’ll explore the following Iceberg metadata tables:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;code&gt;table_files&lt;/code&gt;: Lists data files and their statistics.&lt;/li&gt;
&lt;li&gt;&lt;code&gt;table_history&lt;/code&gt;: Displays historical snapshots.&lt;/li&gt;
&lt;li&gt;&lt;code&gt;table_manifests&lt;/code&gt;: Shows metadata about manifest files.&lt;/li&gt;
&lt;li&gt;&lt;code&gt;table_partitions&lt;/code&gt;: Provides details on partitions.&lt;/li&gt;
&lt;li&gt;&lt;code&gt;table_snapshot&lt;/code&gt;: Shows information on each snapshot.&lt;/li&gt;
&lt;/ul&gt;
&lt;h3&gt;Step 1: Querying Data File Metadata with &lt;code&gt;table_files&lt;/code&gt;&lt;/h3&gt;
&lt;p&gt;The &lt;code&gt;table_files&lt;/code&gt; metadata table provides details on each data file in the table, such as the file path, size, record count, and more. This is useful for understanding storage distribution and optimizing queries.&lt;/p&gt;
&lt;ol&gt;
&lt;li&gt;&lt;strong&gt;Query the Data Files&lt;/strong&gt;:
&lt;ul&gt;
&lt;li&gt;Run the following SQL command to retrieve data file information for the &lt;code&gt;customers&lt;/code&gt; table:&lt;pre&gt;&lt;code class=&quot;language-sql&quot;&gt;SELECT * FROM TABLE(table_files(&apos;customers&apos;));
&lt;/code&gt;&lt;/pre&gt;
&lt;/li&gt;
&lt;li&gt;You’ll see results with columns like &lt;code&gt;file_path&lt;/code&gt;, &lt;code&gt;file_size_in_bytes&lt;/code&gt;, &lt;code&gt;record_count&lt;/code&gt;, and more, giving insights into each file&apos;s specifics.&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;/ol&gt;
&lt;h3&gt;Step 2: Exploring Table History with &lt;code&gt;table_history&lt;/code&gt;&lt;/h3&gt;
&lt;p&gt;Iceberg tracks the history of a table’s snapshots, which allows you to review past states or even perform time-travel queries. The &lt;code&gt;table_history&lt;/code&gt; table displays each snapshot’s ID and timestamp.&lt;/p&gt;
&lt;ol&gt;
&lt;li&gt;&lt;strong&gt;Query the Table History&lt;/strong&gt;:
&lt;ul&gt;
&lt;li&gt;Use the following SQL to retrieve the history of the &lt;code&gt;customers&lt;/code&gt; table:&lt;pre&gt;&lt;code class=&quot;language-sql&quot;&gt;SELECT * FROM TABLE(table_history(&apos;customers&apos;));
&lt;/code&gt;&lt;/pre&gt;
&lt;/li&gt;
&lt;li&gt;This query will return a list of snapshots, showing when each snapshot was created (&lt;code&gt;made_current_at&lt;/code&gt;), the &lt;code&gt;snapshot_id&lt;/code&gt;, and any &lt;code&gt;parent_id&lt;/code&gt; linking to previous snapshots.&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;/ol&gt;
&lt;h3&gt;Step 3: Analyzing Manifests with &lt;code&gt;table_manifests&lt;/code&gt;&lt;/h3&gt;
&lt;p&gt;Manifest files are metadata files in Iceberg that track changes in data files. The &lt;code&gt;table_manifests&lt;/code&gt; table lets you inspect details like the number of files added or removed per snapshot, helping you monitor data evolution and resource usage.&lt;/p&gt;
&lt;ol&gt;
&lt;li&gt;&lt;strong&gt;Query Manifest Metadata&lt;/strong&gt;:
&lt;ul&gt;
&lt;li&gt;Run the following SQL to view manifest metadata for the &lt;code&gt;customers&lt;/code&gt; table:&lt;pre&gt;&lt;code class=&quot;language-sql&quot;&gt;SELECT * FROM TABLE(table_manifests(&apos;customers&apos;));
&lt;/code&gt;&lt;/pre&gt;
&lt;/li&gt;
&lt;li&gt;The results will include fields like &lt;code&gt;path&lt;/code&gt;, &lt;code&gt;added_data_files_count&lt;/code&gt;, and &lt;code&gt;deleted_data_files_count&lt;/code&gt;, which show how each manifest contributes to the table’s state.&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;/ol&gt;
&lt;h3&gt;Step 4: Reviewing Partition Information with &lt;code&gt;table_partitions&lt;/code&gt;&lt;/h3&gt;
&lt;p&gt;The &lt;code&gt;table_partitions&lt;/code&gt; table provides details on each partition in the table, including the number of records and files in each partition. This helps with understanding how data is distributed across partitions and can be used to fine-tune partitioning strategies.&lt;/p&gt;
&lt;ol&gt;
&lt;li&gt;&lt;strong&gt;Query Partition Statistics&lt;/strong&gt;:
&lt;ul&gt;
&lt;li&gt;Run the following query to get partition statistics for the &lt;code&gt;customers&lt;/code&gt; table:&lt;pre&gt;&lt;code class=&quot;language-sql&quot;&gt;SELECT * FROM TABLE(table_partitions(&apos;customers&apos;));
&lt;/code&gt;&lt;/pre&gt;
&lt;/li&gt;
&lt;li&gt;You’ll see fields such as &lt;code&gt;partition&lt;/code&gt;, &lt;code&gt;record_count&lt;/code&gt;, and &lt;code&gt;file_count&lt;/code&gt;, which show the breakdown of data across partitions, helping identify skewed partitions or performance bottlenecks.&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;/ol&gt;
&lt;h3&gt;Step 5: Examining Snapshots with &lt;code&gt;table_snapshot&lt;/code&gt;&lt;/h3&gt;
&lt;p&gt;The &lt;code&gt;table_snapshot&lt;/code&gt; table provides a summary of each snapshot, including the operation (e.g., &lt;code&gt;append&lt;/code&gt;, &lt;code&gt;overwrite&lt;/code&gt;), the commit timestamp, and any manifest files associated with the snapshot.&lt;/p&gt;
&lt;ol&gt;
&lt;li&gt;&lt;strong&gt;Query Snapshot Information&lt;/strong&gt;:
&lt;ul&gt;
&lt;li&gt;Run the following SQL to see snapshot details for the &lt;code&gt;customers&lt;/code&gt; table:&lt;pre&gt;&lt;code class=&quot;language-sql&quot;&gt;SELECT * FROM TABLE(table_snapshot(&apos;customers&apos;));
&lt;/code&gt;&lt;/pre&gt;
&lt;/li&gt;
&lt;li&gt;The result will include fields like &lt;code&gt;committed_at&lt;/code&gt;, &lt;code&gt;operation&lt;/code&gt;, and &lt;code&gt;summary&lt;/code&gt;, providing a high-level view of each snapshot and its impact on the table.&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;/ol&gt;
&lt;h3&gt;Using Metadata for Time-Travel Queries&lt;/h3&gt;
&lt;p&gt;The Iceberg metadata tables also support time-travel queries, enabling you to query the data as it was at a specific snapshot or timestamp. This can be especially useful for auditing, troubleshooting, or recreating analysis from past data states.&lt;/p&gt;
&lt;ol&gt;
&lt;li&gt;&lt;strong&gt;Perform a Time-Travel Query&lt;/strong&gt;:
&lt;ul&gt;
&lt;li&gt;Suppose you want to view the data in the &lt;code&gt;customers&lt;/code&gt; table at a specific snapshot. First, retrieve the &lt;code&gt;snapshot_id&lt;/code&gt; using the &lt;code&gt;table_history&lt;/code&gt; or &lt;code&gt;table_snapshot&lt;/code&gt; table.&lt;/li&gt;
&lt;li&gt;Then, run a query like the following to access data at that snapshot:&lt;pre&gt;&lt;code class=&quot;language-sql&quot;&gt;SELECT * FROM customers AT SNAPSHOT &apos;&amp;lt;snapshot_id&amp;gt;&apos;;
&lt;/code&gt;&lt;/pre&gt;
&lt;/li&gt;
&lt;li&gt;Replace &lt;code&gt;&amp;lt;snapshot_id&amp;gt;&lt;/code&gt; with the ID from the metadata tables to view the data as it was at that specific point.&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;/ol&gt;
&lt;h3&gt;Summary&lt;/h3&gt;
&lt;p&gt;Iceberg metadata tables in Dremio provide a wealth of information on table structure, partitioning, and versioning. These tables are essential for monitoring table evolution, diagnosing performance issues, and executing advanced analytics tasks like time travel.&lt;/p&gt;
&lt;h2&gt;Conclusion&lt;/h2&gt;
&lt;p&gt;Congratulations! You’ve just set up a powerful data lakehouse environment on your laptop with Apache Iceberg, Dremio, and Nessie, and explored hands-on techniques for managing and analyzing data. By leveraging the strengths of these open-source tools, you now have the flexibility of data lakes with the performance and reliability of data warehouses—right on your local machine.&lt;/p&gt;
&lt;p&gt;From creating and querying Iceberg tables to managing branches and snapshots with Nessie’s Git-like controls, you’ve seen how this stack can simplify complex data workflows. Using Dremio’s intuitive interface, you connected sources, ran queries, explored metadata, and learned how to use Iceberg&apos;s versioning and partitioning capabilities for powerful insights. Iceberg metadata tables also provide detailed information on data structure, making it easy to track changes, optimize storage, and even run time-travel queries.&lt;/p&gt;
&lt;p&gt;This hands-on setup is just the beginning. As your data grows, you can explore Dremio’s cloud deployment options and advanced features like reflections and incremental refreshes for scaling analytics. By mastering this foundational environment, you’re well-prepared to build efficient, scalable data lakehouse solutions that balance data accessibility, cost savings, and performance.&lt;/p&gt;
&lt;p&gt;If you enjoyed this experience, consider diving deeper into Dremio Cloud or &lt;a href=&quot;https://www.dremio.com/blog/evaluating-dremio-deploying-a-single-node-instance-on-a-vm/?utm_source=ev_externalblog&amp;amp;utm_medium=influencer&amp;amp;utm_campaign=handson10minutes&amp;amp;utm_content=alexmerced&quot;&gt;exploring further capabilities with Iceberg and Nessie by deploying a self-managed single node instance&lt;/a&gt;. Happy querying!&lt;/p&gt;
</content:encoded><author>Alex Merced</author></item><item><title>Data Modeling - Entities and Events</title><link>https://iceberglakehouse.com/posts/2024-10-data-modeling-events-entities/</link><guid isPermaLink="true">https://iceberglakehouse.com/posts/2024-10-data-modeling-events-entities/</guid><description>
Structuring data thoughtfully is critical for both operational efficiency and analytical value. Data modeling helps us define the relationships, cons...</description><pubDate>Wed, 30 Oct 2024 09:00:00 GMT</pubDate><content:encoded>&lt;p&gt;Structuring data thoughtfully is critical for both operational efficiency and analytical value. Data modeling helps us define the relationships, constraints, and organization of data within our systems. One of the key decisions in data modeling is choosing between modeling for events or entities. Both approaches offer unique insights, but deciding when to use each can make or break the effectiveness of a data platform.&lt;/p&gt;
&lt;p&gt;In this blog, we’ll explore:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;The core differences between events and entities in data modeling&lt;/li&gt;
&lt;li&gt;When to model for events versus entities&lt;/li&gt;
&lt;li&gt;Practical considerations and tips for structuring both event and entity models&lt;/li&gt;
&lt;/ul&gt;
&lt;h2&gt;What are Events and Entities in Data Modeling?&lt;/h2&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Entities&lt;/strong&gt; are the core objects or concepts we want to capture in a data model, such as “customer,” “product,” or “order.” Entities generally have attributes that describe their current state, and they’re often represented by records in databases, forming the foundation for operational data.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Events&lt;/strong&gt; are records of actions or changes that occur over time, such as “customer purchases product,” “order is shipped,” or “user clicks on ad.” Events capture a point-in-time action or change and are typically structured with attributes that describe the context, like a timestamp, user ID, and details of the interaction.&lt;/li&gt;
&lt;/ul&gt;
&lt;h2&gt;When to Model for Entities&lt;/h2&gt;
&lt;p&gt;Entity-based modeling is common for systems that need to manage the current state of real-world objects. Think of it as a way to describe &amp;quot;what exists&amp;quot; at any given time. Here are some scenarios when entity modeling works well:&lt;/p&gt;
&lt;ol&gt;
&lt;li&gt;&lt;strong&gt;Operational Reporting&lt;/strong&gt;: When you need a snapshot of the current state, such as an inventory of products or a list of active users.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Master Data Management (MDM)&lt;/strong&gt;: For centralizing important business data, like customers, products, and vendors, ensuring consistent information across the organization.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Relational Data&lt;/strong&gt;: When it’s essential to maintain relationships between entities, such as the connection between customers and orders, entity modeling helps define and enforce these relationships through foreign keys or join tables.&lt;/li&gt;
&lt;/ol&gt;
&lt;h3&gt;Design Considerations&lt;/h3&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Unique Identifiers&lt;/strong&gt;: Use primary keys to ensure each entity has a unique identifier, supporting reliable lookups and references.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Attribute Consistency&lt;/strong&gt;: Define data types and constraints for each attribute to ensure data integrity.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Explicit Relationships&lt;/strong&gt;: Use foreign keys or association tables to explicitly model relationships between entities, making it easier to query connected data.&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;By focusing on current states and clearly defined relationships, entity modeling enables consistent, reliable data management for applications and reporting.&lt;/p&gt;
&lt;h2&gt;When to Model for Events&lt;/h2&gt;
&lt;p&gt;Event-based modeling is beneficial when you need to track activities over time. Events provide a record of actions and changes, allowing for deeper insights into patterns, trends, and user behaviors. Here are some scenarios when event modeling works well:&lt;/p&gt;
&lt;ol&gt;
&lt;li&gt;&lt;strong&gt;Customer Journey Tracking&lt;/strong&gt;: By recording each action a customer takes—such as logging in, browsing products, or making a purchase—you can build a comprehensive view of their journey and behavior patterns.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Real-Time Analytics&lt;/strong&gt;: In scenarios like fraud detection or monitoring application performance, a continuous stream of events allows for timely insights and anomaly detection.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;System Monitoring&lt;/strong&gt;: Capturing logs, metrics, and performance indicators from systems helps in monitoring health, diagnosing issues, and improving performance through historical trends.&lt;/li&gt;
&lt;/ol&gt;
&lt;h3&gt;Design Considerations&lt;/h3&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Timestamps&lt;/strong&gt;: Each event should have a timestamp to establish when the action occurred, which is critical for sequencing and time-based analysis.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Unique Event IDs&lt;/strong&gt;: Use unique IDs to avoid duplicates and ensure traceability.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Contextual Attributes&lt;/strong&gt;: Include relevant attributes, such as user or session IDs, to tie events back to the entities involved, enriching the analysis with contextual data.&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;Event modeling enables a time-series approach, capturing the &amp;quot;when&amp;quot; and &amp;quot;what happened,&amp;quot; allowing businesses to understand user behavior and trends in a dynamic, ongoing way.&lt;/p&gt;
&lt;h2&gt;Modeling Events vs. Entities: Key Differences&lt;/h2&gt;
&lt;p&gt;Understanding the core differences between event and entity modeling can help clarify when to use each approach. While entities capture the current state of key objects, events capture the actions that affect those objects over time. Here’s a quick comparison:&lt;/p&gt;
&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Aspect&lt;/th&gt;
&lt;th&gt;Entity Model&lt;/th&gt;
&lt;th&gt;Event Model&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Purpose&lt;/td&gt;
&lt;td&gt;Describe current state of objects&lt;/td&gt;
&lt;td&gt;Capture actions or changes over time&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Typical Attributes&lt;/td&gt;
&lt;td&gt;Static (e.g., name, type, category)&lt;/td&gt;
&lt;td&gt;Dynamic (e.g., timestamp, event type, status)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Granularity&lt;/td&gt;
&lt;td&gt;One row per entity&lt;/td&gt;
&lt;td&gt;Multiple rows per entity, one per event&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Example Use Case&lt;/td&gt;
&lt;td&gt;Product catalog, customer list&lt;/td&gt;
&lt;td&gt;Clickstream, transaction history&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Schema Evolution&lt;/td&gt;
&lt;td&gt;Slow-changing, handles updates infrequently&lt;/td&gt;
&lt;td&gt;Flexible, new event types can be added easily&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;
&lt;p&gt;By differentiating between the stable attributes of entities and the dynamic, timestamped nature of events, you can create a model that reflects both the current state and the historical actions within your data ecosystem. This approach supports a more comprehensive analysis, enabling better decision-making and richer insights.&lt;/p&gt;
&lt;h2&gt;Blending Events and Entities for Comprehensive Analysis&lt;/h2&gt;
&lt;p&gt;In many systems, combining event and entity models provides a more complete picture of both the current state and historical actions. For instance:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;E-commerce Analytics&lt;/strong&gt;: Track events like “user clicks,” “adds to cart,” and “makes a purchase” while also modeling entities like “user,” “product,” and “order.” Together, these models offer insights into customer behavior and product popularity.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;User Behavior Analysis&lt;/strong&gt;: In social media platforms, users are entities, while their actions (such as likes, comments, and shares) are events. Combining these perspectives enables understanding of both user attributes and engagement patterns.&lt;/li&gt;
&lt;/ul&gt;
&lt;h3&gt;Approach to Combined Modeling&lt;/h3&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Star Schema&lt;/strong&gt;: Use a star schema with entities as dimensions and events as fact tables to simplify relational analysis. Entities serve as the dimensions describing core objects, while events are stored in a central fact table to represent actions over time.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Layered Storage in Data Lakehouses&lt;/strong&gt;: For a data lakehouse, consider storing events as time-series data and entities as slowly changing dimensions. This setup allows flexible querying and joins as needed, balancing real-time and historical analysis.&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;By blending event and entity models, you can leverage the strengths of each: entities for understanding the present and events for tracking change, creating a more robust foundation for both operational and analytical use cases.&lt;/p&gt;
&lt;h2&gt;Practical Tips for Event and Entity Modeling&lt;/h2&gt;
&lt;ol&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Define Clear Boundaries&lt;/strong&gt;: Distinguish between data that represents &amp;quot;what exists&amp;quot; (entities) and data that represents &amp;quot;what happens&amp;quot; (events). For instance, customer information belongs to an entity model, while purchase transactions are better suited to an event model.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Use Schema-On-Read for Events&lt;/strong&gt;: Event data often benefits from a schema-on-read approach, especially in data lakes, where schemas are applied at query time. This flexibility allows you to adjust schema requirements as new events or attributes are introduced.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Partition and Index Event Data&lt;/strong&gt;: As event data grows rapidly, partitioning by time (such as by day or month) and indexing on frequently queried fields (like timestamps or user IDs) can significantly improve query performance, particularly for time-series analysis.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Consider Data Retention Policies&lt;/strong&gt;: Define how long you need to retain event versus entity data. Events can accumulate quickly and might only need to be stored for a set period, whereas entities may require long-term storage for operational consistency.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Handle Schema Evolution Carefully&lt;/strong&gt;: Plan for schema evolution in both event and entity models to avoid compatibility issues. This is especially important when adding or modifying attributes over time, ensuring consistency in historical and current data.&lt;/p&gt;
&lt;/li&gt;
&lt;/ol&gt;
&lt;p&gt;By applying these tips, you can build data models that are flexible, efficient, and scalable, supporting both immediate and future analytics needs.&lt;/p&gt;
&lt;h2&gt;Final Thoughts&lt;/h2&gt;
&lt;p&gt;Both events and entities have unique roles in data modeling, and understanding when to use each is crucial for building effective data platforms. Entity models help capture the current state of essential business objects, while event models record the actions and changes that occur over time. Together, they enable a more comprehensive view of both the &amp;quot;what&amp;quot; and the &amp;quot;when&amp;quot; of your data, supporting a range of use cases from real-time analytics to historical trend analysis.&lt;/p&gt;
&lt;p&gt;In many cases, a hybrid approach that combines events and entities will offer the most value, providing a snapshot of the present state alongside a timeline of interactions. This dual perspective not only strengthens operational reporting but also deepens insights into user behaviors and business processes.&lt;/p&gt;
&lt;p&gt;By understanding these fundamental modeling strategies and applying best practices, you can design a data model that is both adaptable and insightful—one that meets the analytical needs of today and scales with the demands of tomorrow.&lt;/p&gt;
</content:encoded><author>Alex Merced</author></item><item><title>All About  Parquet Part 01 - An Introduction</title><link>https://iceberglakehouse.com/posts/2024-10-all-about-parquet-part-01/</link><guid isPermaLink="true">https://iceberglakehouse.com/posts/2024-10-all-about-parquet-part-01/</guid><description>
- [Free Copy of Apache Iceberg the Definitive Guide](https://hello.dremio.com/wp-apache-iceberg-the-definitive-guide-reg.html?utm_source=alexmerced&amp;u...</description><pubDate>Mon, 21 Oct 2024 09:00:00 GMT</pubDate><content:encoded>&lt;ul&gt;
&lt;li&gt;&lt;a href=&quot;https://hello.dremio.com/wp-apache-iceberg-the-definitive-guide-reg.html?utm_source=alexmerced&amp;amp;utm_medium=external_blog&amp;amp;utm_campaign=allaboutparquet&quot;&gt;Free Copy of Apache Iceberg the Definitive Guide&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&quot;https://hello.dremio.com/webcast-an-apache-iceberg-lakehouse-crash-course-reg.html?utm_source=alexmerced&amp;amp;utm_medium=external_blog&amp;amp;utm_campaign=allaboutparquet&quot;&gt;Free Apache Iceberg Crash Course&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&quot;https://www.youtube.com/watch?v=SIriNcVIGJQ&amp;amp;list=PLsLAVBjQJO0p0Yq1fLkoHvt2lEJj5pcYe&quot;&gt;Iceberg Lakehouse Engineering Video Playlist&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;Managing and processing large datasets efficiently is crucial for many organizations. One of the key factors in data efficiency is the format in which data is stored and retrieved. Among the numerous file formats available, &lt;strong&gt;Apache Parquet&lt;/strong&gt; has emerged as a popular choice, particularly in big data and cloud-based environments. But what exactly is the Parquet file format, and why is it so widely adopted? In this post, we’ll introduce you to the key concepts behind Parquet, its structure, and why it has become a go-to solution for data engineers and analysts alike.&lt;/p&gt;
&lt;h2&gt;What is Parquet?&lt;/h2&gt;
&lt;p&gt;Parquet is an &lt;strong&gt;open-source, columnar storage file format&lt;/strong&gt; designed for efficient data storage and retrieval. Unlike row-based formats (like CSV or JSON), Parquet organizes data by columns rather than rows, making it highly efficient for analytical workloads. However, Parquet is used with various processing engines such as Apache Spark, Dremio, and Presto, and it works seamlessly with cloud platforms like AWS S3, Google Cloud Storage, and Azure.&lt;/p&gt;
&lt;h2&gt;Why Use Parquet?&lt;/h2&gt;
&lt;p&gt;The design of Parquet provides several key benefits that make it ideal for large-scale data processing:&lt;/p&gt;
&lt;ol&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Efficient Compression&lt;/strong&gt;&lt;br&gt;
Parquet’s columnar format allows for highly efficient compression. Since data is stored by column, similar values are grouped together, making compression algorithms far more effective compared to row-based formats. This can significantly reduce the storage footprint of your datasets.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Faster Queries&lt;/strong&gt;&lt;br&gt;
Columnar storage enables faster query execution for analytical workloads. When executing a query, Parquet allows data processing engines to scan only the columns relevant to the query, rather than reading the entire dataset. This reduces the amount of data that needs to be read, resulting in faster query times.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Schema Evolution&lt;/strong&gt;&lt;br&gt;
Parquet supports schema evolution, which means you can modify the structure of your data (e.g., adding or removing columns) without breaking existing applications. This flexibility is particularly useful in dynamic environments where data structures evolve over time.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Cross-Platform Compatibility&lt;/strong&gt;&lt;br&gt;
Parquet is compatible with multiple languages and tools, including Python, Java, C++, and many data processing frameworks. This makes it an excellent choice for multi-tool environments where data needs to be processed by different systems.&lt;/p&gt;
&lt;/li&gt;
&lt;/ol&gt;
&lt;h2&gt;The Difference Between Row-Based and Columnar Formats&lt;/h2&gt;
&lt;p&gt;To fully understand the benefits of Parquet, it&apos;s essential to grasp the distinction between row-based and columnar file formats.&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Row-based formats&lt;/strong&gt; store all the fields of a record together in sequence. Formats like CSV or JSON are row-based. These are suitable for transactional systems where entire rows need to be read and written frequently.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Columnar formats&lt;/strong&gt;, like Parquet, store each column of a dataset together. This approach is advantageous for analytical workloads, where operations like aggregations or filters are performed on individual columns.&lt;/p&gt;
&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;For example, in a dataset with millions of rows and many columns, if you only need to perform analysis on one or two columns, Parquet allows you to read just those columns, avoiding the need to scan the entire dataset.&lt;/p&gt;
&lt;h2&gt;Key Features of Parquet&lt;/h2&gt;
&lt;p&gt;Parquet is packed with features that make it well-suited for a wide range of data use cases:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Columnar Storage&lt;/strong&gt;: As mentioned, the format stores data column-wise, making it ideal for read-heavy, analytical queries.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Efficient Compression&lt;/strong&gt;: Parquet supports multiple compression algorithms (Snappy, Gzip, Brotli) that significantly reduce data size.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Splittable Files&lt;/strong&gt;: Parquet files are splittable, meaning large files can be divided into smaller chunks for parallel processing.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Rich Data Types&lt;/strong&gt;: Parquet supports complex nested data types, such as arrays, structs, and maps, allowing for flexible schema designs.&lt;/li&gt;
&lt;/ul&gt;
&lt;h2&gt;When to Use Parquet&lt;/h2&gt;
&lt;p&gt;Parquet is an excellent choice for scenarios where:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;You have large datasets that need to be processed for analytics.&lt;/li&gt;
&lt;li&gt;Your queries often target specific columns in a dataset rather than entire rows.&lt;/li&gt;
&lt;li&gt;You need efficient compression to reduce storage costs.&lt;/li&gt;
&lt;li&gt;You&apos;re working in a distributed data environment, such as Hadoop, Spark, or cloud-based data lakes.&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;However, Parquet may not be ideal for small, frequent updates or transactional systems where row-based formats are more suitable.&lt;/p&gt;
&lt;h2&gt;Conclusion&lt;/h2&gt;
&lt;p&gt;The Apache Parquet file format is a powerful tool for efficiently storing and querying large datasets. With its columnar storage design, Parquet provides superior compression, faster query execution, and flexibility through schema evolution. These advantages make it a preferred choice for big data processing and cloud environments.&lt;/p&gt;
&lt;p&gt;In the upcoming parts of this blog series, we’ll dive deeper into Parquet’s architecture, how it handles compression, encoding, and how you can work with Parquet in various tools like Python, Spark, and Dremio.&lt;/p&gt;
&lt;p&gt;Stay tuned for the next post in this series: &lt;strong&gt;Parquet&apos;s Columnar Storage Model&lt;/strong&gt;.&lt;/p&gt;
</content:encoded><author>Alex Merced</author></item><item><title>All About  Parquet Part 02 - Parquet&apos;s Columnar Storage Model</title><link>https://iceberglakehouse.com/posts/2024-10-all-about-parquet-part-02/</link><guid isPermaLink="true">https://iceberglakehouse.com/posts/2024-10-all-about-parquet-part-02/</guid><description>
- [Free Copy of Apache Iceberg the Definitive Guide](https://hello.dremio.com/wp-apache-iceberg-the-definitive-guide-reg.html?utm_source=alexmerced&amp;u...</description><pubDate>Mon, 21 Oct 2024 09:00:00 GMT</pubDate><content:encoded>&lt;ul&gt;
&lt;li&gt;&lt;a href=&quot;https://hello.dremio.com/wp-apache-iceberg-the-definitive-guide-reg.html?utm_source=alexmerced&amp;amp;utm_medium=external_blog&amp;amp;utm_campaign=allaboutparquet&quot;&gt;Free Copy of Apache Iceberg the Definitive Guide&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&quot;https://hello.dremio.com/webcast-an-apache-iceberg-lakehouse-crash-course-reg.html?utm_source=alexmerced&amp;amp;utm_medium=external_blog&amp;amp;utm_campaign=allaboutparquet&quot;&gt;Free Apache Iceberg Crash Course&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&quot;https://www.youtube.com/watch?v=SIriNcVIGJQ&amp;amp;list=PLsLAVBjQJO0p0Yq1fLkoHvt2lEJj5pcYe&quot;&gt;Iceberg Lakehouse Engineering Video Playlist&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;In the first post of this series, we introduced the Apache Parquet file format and touched upon one of its key features—columnar storage. Now, we’ll take a deeper dive into what this columnar storage model is, how it works, and why it’s so efficient for big data analytics. Understanding Parquet&apos;s columnar architecture is key to leveraging its full potential in optimizing data storage and query performance.&lt;/p&gt;
&lt;h2&gt;What is Columnar Storage?&lt;/h2&gt;
&lt;p&gt;Columnar storage means that instead of storing rows of data together, the data for each column is stored separately. This might seem counterintuitive at first, but it has major benefits for certain types of workloads, particularly those where you’re analyzing or aggregating specific columns rather than accessing entire rows.&lt;/p&gt;
&lt;p&gt;In a row-based format like CSV or JSON, data is written and read one row at a time. Each row stores all fields together in sequence. On the other hand, in a columnar format like Parquet, all values for a single column are stored together. For instance, if you have a dataset with columns for &lt;code&gt;Name&lt;/code&gt;, &lt;code&gt;Age&lt;/code&gt;, and &lt;code&gt;Salary&lt;/code&gt;, all the values for the &lt;code&gt;Name&lt;/code&gt; column are stored in one block, all the values for the &lt;code&gt;Age&lt;/code&gt; column are stored in another, and so on.&lt;/p&gt;
&lt;h2&gt;Why is Columnar Storage Efficient?&lt;/h2&gt;
&lt;p&gt;The efficiency of columnar storage becomes clear when we consider the type of operations typically performed on large datasets in analytics. Let’s break down the advantages.&lt;/p&gt;
&lt;h3&gt;1. &lt;strong&gt;Faster Query Performance&lt;/strong&gt;&lt;/h3&gt;
&lt;p&gt;Columnar storage shines when your queries focus on a subset of columns. For example, if you want to calculate the average salary of employees in a large dataset, Parquet allows you to scan just the &lt;code&gt;Salary&lt;/code&gt; column without reading the entire dataset.&lt;/p&gt;
&lt;p&gt;In a row-based format, even though you&apos;re only interested in one column, the system has to read all the data in every row to retrieve the values for that column. This results in a lot of unnecessary I/O operations, slowing down query performance. With Parquet, only the columns you need are read, making queries significantly faster.&lt;/p&gt;
&lt;h3&gt;2. &lt;strong&gt;Better Compression&lt;/strong&gt;&lt;/h3&gt;
&lt;p&gt;Parquet&apos;s columnar structure also improves compression. Since similar data types are stored together, compression algorithms can be applied more effectively. For example, if a column contains repeated values or data that follows a consistent pattern (such as dates or integers), it can be compressed more efficiently.&lt;/p&gt;
&lt;p&gt;By grouping similar values together, columnar formats enable algorithms like &lt;strong&gt;dictionary encoding&lt;/strong&gt; or &lt;strong&gt;run-length encoding&lt;/strong&gt; to achieve high compression ratios. This leads to smaller file sizes, which means reduced storage costs and faster data transfers.&lt;/p&gt;
&lt;h3&gt;3. &lt;strong&gt;Efficient Aggregation&lt;/strong&gt;&lt;/h3&gt;
&lt;p&gt;Columnar storage is ideal for aggregation queries, such as calculating sums, averages, or counts. These types of operations often focus on specific columns. With Parquet, only the relevant columns need to be read into memory, which not only improves query speed but also reduces the overall resource usage.&lt;/p&gt;
&lt;h3&gt;4. &lt;strong&gt;Batch Processing and Parallelization&lt;/strong&gt;&lt;/h3&gt;
&lt;p&gt;Another benefit of Parquet’s columnar model is that it enables better parallel processing. Since columns are stored independently, data processing engines like Apache Spark can read different columns in parallel, further speeding up query execution. This makes Parquet a great fit for distributed computing environments, where parallelism is key to achieving high performance.&lt;/p&gt;
&lt;h2&gt;How Parquet Organizes Data&lt;/h2&gt;
&lt;p&gt;Understanding how Parquet organizes data internally can help you fine-tune how you store and query your datasets.&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Columns and Row Groups&lt;/strong&gt;: Parquet organizes data into &lt;strong&gt;row groups&lt;/strong&gt;, which contain chunks of column data. A row group contains all the data for a subset of rows, but the data for each column is stored separately. This allows for efficient I/O when reading subsets of rows or columns.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Pages&lt;/strong&gt;: Within each column chunk, data is further divided into &lt;strong&gt;pages&lt;/strong&gt;. Parquet uses pages to store column data more granularly, which helps optimize compression and read performance. Each page is typically a few megabytes in size, and Parquet stores statistics about the data in each page, making it easier to skip irrelevant pages during query execution.&lt;/p&gt;
&lt;/li&gt;
&lt;/ul&gt;
&lt;h2&gt;Use Cases for Columnar Storage&lt;/h2&gt;
&lt;p&gt;Columnar storage formats like Parquet are most effective in the following scenarios:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Analytics-Heavy Workloads&lt;/strong&gt;: If your workload involves a lot of analytical queries (e.g., calculating averages, filtering by certain columns), columnar formats will provide significant performance gains.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Big Data Environments&lt;/strong&gt;: Parquet is commonly used in distributed data environments where large datasets are stored in cloud data lakes (e.g., AWS S3, Google Cloud Storage). It works seamlessly with frameworks like Apache Spark and Presto, which are built to process data at scale.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Data Warehousing&lt;/strong&gt;: When designing data warehouses, storing data in Parquet allows you to run complex analytical queries efficiently while reducing storage costs due to Parquet’s high compression.&lt;/p&gt;
&lt;/li&gt;
&lt;/ul&gt;
&lt;h2&gt;When Not to Use Columnar Storage&lt;/h2&gt;
&lt;p&gt;While columnar storage offers significant advantages for read-heavy, analytical workloads, it may not be the best option for all use cases. For example, &lt;strong&gt;transactional systems&lt;/strong&gt; that involve frequent, small updates to data (like an online store&apos;s transaction log) may perform better with row-based formats, which are optimized for write-heavy operations. In such cases, the overhead of reading and writing data in columnar format may outweigh its benefits.&lt;/p&gt;
&lt;h2&gt;Conclusion&lt;/h2&gt;
&lt;p&gt;Parquet’s columnar storage model is what makes it a powerful tool for big data analytics. By organizing data by columns, Parquet allows for faster query performance, better compression, and more efficient aggregation. It’s designed to excel in environments where read-heavy workloads dominate and when your queries often target specific columns rather than entire datasets.&lt;/p&gt;
&lt;p&gt;In the next blog post, we’ll dive deeper into the &lt;strong&gt;file structure&lt;/strong&gt; of Parquet, exploring how data is organized into row groups, pages, and columns to optimize both storage and retrieval.&lt;/p&gt;
&lt;p&gt;Stay tuned for part 3: &lt;strong&gt;Parquet File Structure: Pages, Row Groups, and Columns&lt;/strong&gt;.&lt;/p&gt;
</content:encoded><author>Alex Merced</author></item><item><title>All About  Parquet Part 03 - Parquet File Structure | Pages, Row Groups, and Columns</title><link>https://iceberglakehouse.com/posts/2024-10-all-about-parquet-part-03/</link><guid isPermaLink="true">https://iceberglakehouse.com/posts/2024-10-all-about-parquet-part-03/</guid><description>
- [Free Copy of Apache Iceberg the Definitive Guide](https://hello.dremio.com/wp-apache-iceberg-the-definitive-guide-reg.html?utm_source=alexmerced&amp;u...</description><pubDate>Mon, 21 Oct 2024 09:00:00 GMT</pubDate><content:encoded>&lt;ul&gt;
&lt;li&gt;&lt;a href=&quot;https://hello.dremio.com/wp-apache-iceberg-the-definitive-guide-reg.html?utm_source=alexmerced&amp;amp;utm_medium=external_blog&amp;amp;utm_campaign=allaboutparquet&quot;&gt;Free Copy of Apache Iceberg the Definitive Guide&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&quot;https://hello.dremio.com/webcast-an-apache-iceberg-lakehouse-crash-course-reg.html?utm_source=alexmerced&amp;amp;utm_medium=external_blog&amp;amp;utm_campaign=allaboutparquet&quot;&gt;Free Apache Iceberg Crash Course&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&quot;https://www.youtube.com/watch?v=SIriNcVIGJQ&amp;amp;list=PLsLAVBjQJO0p0Yq1fLkoHvt2lEJj5pcYe&quot;&gt;Iceberg Lakehouse Engineering Video Playlist&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;In the previous post, we explored the benefits of Parquet’s columnar storage model. Now, let’s delve deeper into the internal structure of a Parquet file. Understanding how Parquet organizes data into &lt;strong&gt;pages&lt;/strong&gt;, &lt;strong&gt;row groups&lt;/strong&gt;, and &lt;strong&gt;columns&lt;/strong&gt; will give you valuable insights into how Parquet achieves its efficiency in storage and query execution. This knowledge will also help you make informed decisions when working with Parquet files in your data pipelines.&lt;/p&gt;
&lt;h2&gt;The Hierarchical Structure of Parquet&lt;/h2&gt;
&lt;p&gt;Parquet uses a hierarchical structure to store data, consisting of three key components:&lt;/p&gt;
&lt;ol&gt;
&lt;li&gt;&lt;strong&gt;Row Groups&lt;/strong&gt;&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Columns&lt;/strong&gt;&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Pages&lt;/strong&gt;&lt;/li&gt;
&lt;/ol&gt;
&lt;p&gt;These components work together to enable Parquet’s ability to store large datasets while optimizing for efficient read and write operations.&lt;/p&gt;
&lt;h3&gt;1. Row Groups&lt;/h3&gt;
&lt;p&gt;A &lt;strong&gt;row group&lt;/strong&gt; is a horizontal partition of data in a Parquet file. It contains all the column data for a subset of rows. Think of a row group as a container that holds the data for a chunk of rows. Each row group can be processed independently, allowing Parquet to perform parallel processing and read specific sections of the data without needing to load the entire dataset into memory.&lt;/p&gt;
&lt;h4&gt;Why Row Groups Matter&lt;/h4&gt;
&lt;p&gt;Row groups are crucial for performance. When querying data, especially in distributed systems like Apache Spark or Dremio, the ability to read only the row groups relevant to a query greatly improves efficiency. By splitting the dataset into row groups, Parquet minimizes the amount of data scanned during query execution, reducing both I/O and compute costs.&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Row Group Size&lt;/strong&gt;: A typical row group size is set based on the expected query pattern and memory limitations of your processing engine. A smaller row group size allows for more parallelism, but increases the number of read operations. A larger row group size reduces the number of I/O operations but may increase memory usage during query execution.&lt;/li&gt;
&lt;/ul&gt;
&lt;h3&gt;2. Columns Within Row Groups&lt;/h3&gt;
&lt;p&gt;Within each row group, the data is stored column-wise. Each column in a row group is called a &lt;strong&gt;column chunk&lt;/strong&gt;. These column chunks hold the actual data values for each column in that row group.&lt;/p&gt;
&lt;p&gt;The columnar organization of data within row groups allows Parquet to take advantage of &lt;strong&gt;columnar compression&lt;/strong&gt; and query optimization techniques. As we mentioned in the previous blog, Parquet can skip reading entire columns that aren’t relevant to a query, further improving performance.&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Column Compression&lt;/strong&gt;: Since similar data types are stored together in a column chunk, Parquet can apply compression techniques such as &lt;strong&gt;dictionary encoding&lt;/strong&gt; or &lt;strong&gt;run-length encoding&lt;/strong&gt;, which work particularly well on columns with repeated values or patterns.&lt;/li&gt;
&lt;/ul&gt;
&lt;h3&gt;3. Pages: The Smallest Unit of Data&lt;/h3&gt;
&lt;p&gt;Within each column chunk, data is further divided into &lt;strong&gt;pages&lt;/strong&gt;, which are the smallest unit of data storage in Parquet. Pages help break down column chunks into more manageable sizes, making data more accessible and enabling better compression.&lt;/p&gt;
&lt;p&gt;There are two types of pages in Parquet:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Data Pages&lt;/strong&gt;: These contain the actual values for a column.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Index Pages&lt;/strong&gt;: These store metadata such as min and max values for a range of data, which can be used for filtering during query execution. By storing statistics about the data, Parquet can skip reading entire pages that don’t match the query, speeding up execution.&lt;/li&gt;
&lt;/ul&gt;
&lt;h4&gt;Page Size and Its Impact&lt;/h4&gt;
&lt;p&gt;The page size in a Parquet file plays an important role in balancing read and write performance. Larger pages reduce the overhead of managing metadata but may lead to slower reads if the page contains irrelevant data. Smaller pages provide better granularity for skipping irrelevant data during queries, but they come with higher metadata overhead.&lt;/p&gt;
&lt;p&gt;By default, Parquet sets the page size to a few megabytes, but this can be configured based on the specific needs of your workload.&lt;/p&gt;
&lt;h2&gt;The Role of Metadata in Parquet Files&lt;/h2&gt;
&lt;p&gt;Parquet files also store extensive metadata at multiple levels (file, row group, and page). This metadata contains useful information, such as:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Column statistics&lt;/strong&gt;: Min, max, and null counts for each column.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Compression schemes&lt;/strong&gt;: The compression algorithm used for each column chunk.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Schema&lt;/strong&gt;: The structure of the data, including data types and field names.&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;This metadata plays a crucial role in query optimization. For example, the column statistics allow query engines to skip row groups or pages that don’t contain data relevant to the query, significantly improving query performance.&lt;/p&gt;
&lt;h3&gt;File Metadata&lt;/h3&gt;
&lt;p&gt;At the file level, Parquet stores global metadata that describes the overall structure of the file, such as the number of row groups, the file schema, and encoding information for each column.&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Footer&lt;/strong&gt;: Parquet stores this file-level metadata in the footer of the file, which allows data processing engines to quickly read the structure of the file without scanning the entire dataset. This structure ensures that the metadata is accessible without having to read the entire file first, enabling fast schema discovery and data exploration.&lt;/li&gt;
&lt;/ul&gt;
&lt;h3&gt;Row Group Metadata&lt;/h3&gt;
&lt;p&gt;Each row group also has its own metadata, which describes the columns it contains, the number of rows, and statistics for each column chunk. This enables efficient querying by allowing Parquet readers to filter out row groups that don’t meet the query conditions.&lt;/p&gt;
&lt;h2&gt;Optimizing Parquet File Structure&lt;/h2&gt;
&lt;p&gt;When working with Parquet files, optimizing the structure of your files based on the expected query patterns can lead to better performance. Here are some tips:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Row Group Size&lt;/strong&gt;: Adjust the row group size based on the memory capacity of your processing engine. If your engine has limited memory, smaller row groups might help avoid memory issues. Larger row groups can be beneficial when you need to minimize I/O operations.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Page Size&lt;/strong&gt;: Tuning the page size can improve compression and query performance. Smaller page sizes are better for queries that involve filters, as they allow more granular data skipping.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Compression and Encoding&lt;/strong&gt;: Selecting the right compression algorithm and encoding scheme for your data type can make a significant difference in file size and query speed. For example, dictionary encoding is a good choice for columns with many repeated values.&lt;/p&gt;
&lt;/li&gt;
&lt;/ul&gt;
&lt;h2&gt;Conclusion&lt;/h2&gt;
&lt;p&gt;The hierarchical structure of Parquet files—organized into row groups, columns, and pages—enables efficient storage and fast data access. By organizing data this way, Parquet minimizes unnecessary reads and maximizes the potential for parallel processing and compression.&lt;/p&gt;
&lt;p&gt;Understanding how these components interact helps you optimize your data storage and querying processes, ensuring that your data pipelines run as efficiently as possible.&lt;/p&gt;
&lt;p&gt;In the next blog post, we’ll explore &lt;strong&gt;schema evolution&lt;/strong&gt; in Parquet, diving into how Parquet handles changes in data structures over time and why this flexibility is key in dynamic data environments.&lt;/p&gt;
&lt;p&gt;Stay tuned for part 4: &lt;strong&gt;Schema Evolution in Parquet&lt;/strong&gt;.&lt;/p&gt;
</content:encoded><author>Alex Merced</author></item><item><title>All About  Parquet Part 04 - Schema Evolution in Parquet</title><link>https://iceberglakehouse.com/posts/2024-10-all-about-parquet-part-04/</link><guid isPermaLink="true">https://iceberglakehouse.com/posts/2024-10-all-about-parquet-part-04/</guid><description>
- [Free Copy of Apache Iceberg the Definitive Guide](https://hello.dremio.com/wp-apache-iceberg-the-definitive-guide-reg.html?utm_source=alexmerced&amp;u...</description><pubDate>Mon, 21 Oct 2024 09:00:00 GMT</pubDate><content:encoded>&lt;ul&gt;
&lt;li&gt;&lt;a href=&quot;https://hello.dremio.com/wp-apache-iceberg-the-definitive-guide-reg.html?utm_source=alexmerced&amp;amp;utm_medium=external_blog&amp;amp;utm_campaign=allaboutparquet&quot;&gt;Free Copy of Apache Iceberg the Definitive Guide&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&quot;https://hello.dremio.com/webcast-an-apache-iceberg-lakehouse-crash-course-reg.html?utm_source=alexmerced&amp;amp;utm_medium=external_blog&amp;amp;utm_campaign=allaboutparquet&quot;&gt;Free Apache Iceberg Crash Course&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&quot;https://www.youtube.com/watch?v=SIriNcVIGJQ&amp;amp;list=PLsLAVBjQJO0p0Yq1fLkoHvt2lEJj5pcYe&quot;&gt;Iceberg Lakehouse Engineering Video Playlist&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;When working with large datasets, schema changes—whether it’s adding new fields, modifying data types, or removing columns—are inevitable. This is where &lt;strong&gt;schema evolution&lt;/strong&gt; comes into play. In this post, we’ll dive into how Parquet handles schema changes and why this flexibility is essential in dynamic data environments. We’ll also explore how Parquet&apos;s schema evolution compares to other file formats and the practical implications for data engineers.&lt;/p&gt;
&lt;h2&gt;What is Schema Evolution?&lt;/h2&gt;
&lt;p&gt;In data management, a &lt;strong&gt;schema&lt;/strong&gt; defines the structure of your data, including the types, names, and organization of fields in a dataset. &lt;strong&gt;Schema evolution&lt;/strong&gt; refers to the ability to handle changes in the schema over time without breaking compatibility with the data that’s already stored. In other words, schema evolution allows you to modify the structure of your dataset without needing to rewrite or discard existing data.&lt;/p&gt;
&lt;p&gt;In Parquet, schema evolution is supported in a way that maintains backward and forward compatibility, allowing applications to continue reading data even when the schema changes. This is particularly useful in situations where data models evolve as new features are added, or as datasets are refined.&lt;/p&gt;
&lt;h2&gt;How Schema Evolution Works in Parquet&lt;/h2&gt;
&lt;p&gt;Parquet’s ability to handle schema evolution is one of its key advantages. When a Parquet file is written, the schema of the data is embedded in the file’s metadata. This schema is checked when data is read, ensuring that any discrepancies between the stored data and the expected structure are handled gracefully.&lt;/p&gt;
&lt;h3&gt;Common Schema Evolution Scenarios&lt;/h3&gt;
&lt;p&gt;Here are some common schema evolution scenarios and how Parquet handles them:&lt;/p&gt;
&lt;h3&gt;1. &lt;strong&gt;Adding New Columns&lt;/strong&gt;&lt;/h3&gt;
&lt;p&gt;One of the most common schema changes is the addition of new columns to a dataset. For example, imagine you have a Parquet file that originally contains the columns &lt;code&gt;Name&lt;/code&gt;, &lt;code&gt;Age&lt;/code&gt;, and &lt;code&gt;Salary&lt;/code&gt;. Later, you decide to add a &lt;code&gt;Department&lt;/code&gt; column.&lt;/p&gt;
&lt;p&gt;In this case, Parquet handles the new column without any issues. Older Parquet files that do not have the &lt;code&gt;Department&lt;/code&gt; column will simply read that column as &lt;code&gt;null&lt;/code&gt; when queried. This is known as &lt;strong&gt;forward compatibility&lt;/strong&gt;, where the old data remains readable even after the schema has been updated.&lt;/p&gt;
&lt;h3&gt;2. &lt;strong&gt;Removing Columns&lt;/strong&gt;&lt;/h3&gt;
&lt;p&gt;In some cases, you may want to remove a column that is no longer relevant. If you remove a column from the schema, Parquet will continue to read the old data, but the removed column will not be included in queries. This is known as &lt;strong&gt;backward compatibility&lt;/strong&gt;, meaning that even though the schema has changed, the old data can still be accessed.&lt;/p&gt;
&lt;p&gt;However, be cautious when removing columns, as some downstream applications or queries may still rely on that data. Parquet ensures that no data is lost, but the removed column will no longer appear in new data written after the schema change.&lt;/p&gt;
&lt;h3&gt;3. &lt;strong&gt;Changing Data Types&lt;/strong&gt;&lt;/h3&gt;
&lt;p&gt;Changing the data type of an existing column can be trickier, but Parquet provides mechanisms to handle this scenario. If you change the data type of a column (for example, changing an &lt;code&gt;int&lt;/code&gt; column to a &lt;code&gt;float&lt;/code&gt;), Parquet ensures that old data can still be read by performing necessary type conversions.&lt;/p&gt;
&lt;p&gt;While this approach preserves compatibility, it&apos;s important to note that changing data types can sometimes lead to unexpected results in queries, especially if precision is lost during conversion. It&apos;s always a good practice to carefully consider the implications of changing data types.&lt;/p&gt;
&lt;h3&gt;4. &lt;strong&gt;Renaming Columns&lt;/strong&gt;&lt;/h3&gt;
&lt;p&gt;Renaming a column is another common schema change. Parquet does not natively support renaming columns, but you can achieve this by adding a new column with the desired name and removing the old column. As a result, the renamed column will appear as a new addition in the schema, and older files will treat it as a missing column (reading it as &lt;code&gt;null&lt;/code&gt;).&lt;/p&gt;
&lt;p&gt;While this is not true &amp;quot;schema evolution&amp;quot; in the traditional sense, it is a common workaround in systems that rely on Parquet.&lt;/p&gt;
&lt;h3&gt;5. &lt;strong&gt;Reordering Columns&lt;/strong&gt;&lt;/h3&gt;
&lt;p&gt;In Parquet, the order of columns in the schema does not affect the ability to read the data. This means that if you change the order of columns, Parquet will still be able to read the file without any issues. Column order is not enforced when querying, allowing flexibility in how data is structured.&lt;/p&gt;
&lt;h2&gt;Schema Evolution in Other Formats&lt;/h2&gt;
&lt;p&gt;Compared to other file formats like CSV or Avro, Parquet’s schema evolution capabilities are particularly robust:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;CSV&lt;/strong&gt;: Since CSV lacks a formal schema definition, it doesn’t support schema evolution. If the structure of your CSV file changes, you’ll need to rewrite the entire file or deal with errors when parsing the data.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Avro&lt;/strong&gt;: Like Parquet, Avro supports schema evolution. However, Avro focuses on row-based storage, making it more suitable for transactional systems than analytical workloads. Parquet’s columnar nature makes it more efficient for large-scale analytics, particularly when the schema evolves over time.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;ORC&lt;/strong&gt;: ORC, another columnar storage format, also supports schema evolution. However, Parquet is generally considered more flexible and is widely used in a variety of data processing systems.&lt;/p&gt;
&lt;/li&gt;
&lt;/ul&gt;
&lt;h2&gt;Best Practices for Schema Evolution&lt;/h2&gt;
&lt;p&gt;Here are a few best practices to follow when working with schema evolution in Parquet:&lt;/p&gt;
&lt;ol&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Plan for Schema Changes Early&lt;/strong&gt;&lt;br&gt;
It’s always a good idea to anticipate potential schema changes when designing your data models. Adding new columns or changing data types is easier to manage if your data model is flexible from the start.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Use Nullable Fields&lt;/strong&gt;&lt;br&gt;
Adding new columns to a dataset is one of the most common schema changes. By making new fields nullable, you ensure that old data remains compatible with the updated schema.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Test Schema Changes in Staging Environments&lt;/strong&gt;&lt;br&gt;
Before deploying schema changes to production, test them in a staging environment. This allows you to catch potential issues related to backward or forward compatibility before they impact production systems.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Document Schema Changes&lt;/strong&gt;&lt;br&gt;
Keep detailed documentation of schema changes, especially if you are working in a team. This ensures that everyone understands the evolution of the data model and how to handle older versions of the data.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Leverage Data Catalogs&lt;/strong&gt;&lt;br&gt;
Using a data catalog or schema registry can help manage schema evolution across multiple Parquet files and datasets. Tools like Apache Hive Metastore or Nessie Catalog allow you to track schema versions and ensure compatibility.&lt;/p&gt;
&lt;/li&gt;
&lt;/ol&gt;
&lt;h2&gt;Conclusion&lt;/h2&gt;
&lt;p&gt;Schema evolution is a powerful feature of the Parquet file format, enabling data engineers to adapt to changing data models without losing compatibility with existing datasets. By supporting the addition, removal, and modification of columns, Parquet provides flexibility and ensures that data remains accessible even as it evolves.&lt;/p&gt;
&lt;p&gt;Understanding how Parquet handles schema evolution allows you to build data pipelines that are resilient to change, helping you future-proof your data architecture.&lt;/p&gt;
&lt;p&gt;In the next post, we’ll explore the various &lt;strong&gt;compression techniques&lt;/strong&gt; used in Parquet and how they help reduce file sizes while improving query performance.&lt;/p&gt;
&lt;p&gt;Stay tuned for part 5: &lt;strong&gt;Compression Techniques in Parquet&lt;/strong&gt;.&lt;/p&gt;
</content:encoded><author>Alex Merced</author></item><item><title>All About  Parquet Part 05 - Compression Techniques in Parquet</title><link>https://iceberglakehouse.com/posts/2024-10-all-about-parquet-part-05/</link><guid isPermaLink="true">https://iceberglakehouse.com/posts/2024-10-all-about-parquet-part-05/</guid><description>
- [Free Copy of Apache Iceberg the Definitive Guide](https://hello.dremio.com/wp-apache-iceberg-the-definitive-guide-reg.html?utm_source=alexmerced&amp;u...</description><pubDate>Mon, 21 Oct 2024 09:00:00 GMT</pubDate><content:encoded>&lt;ul&gt;
&lt;li&gt;&lt;a href=&quot;https://hello.dremio.com/wp-apache-iceberg-the-definitive-guide-reg.html?utm_source=alexmerced&amp;amp;utm_medium=external_blog&amp;amp;utm_campaign=allaboutparquet&quot;&gt;Free Copy of Apache Iceberg the Definitive Guide&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&quot;https://hello.dremio.com/webcast-an-apache-iceberg-lakehouse-crash-course-reg.html?utm_source=alexmerced&amp;amp;utm_medium=external_blog&amp;amp;utm_campaign=allaboutparquet&quot;&gt;Free Apache Iceberg Crash Course&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&quot;https://www.youtube.com/watch?v=SIriNcVIGJQ&amp;amp;list=PLsLAVBjQJO0p0Yq1fLkoHvt2lEJj5pcYe&quot;&gt;Iceberg Lakehouse Engineering Video Playlist&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;One of the key benefits of using the Parquet file format is its ability to compress data efficiently, reducing storage costs while maintaining fast query performance. Parquet’s columnar storage model enables highly effective compression, as data of the same type is stored together, allowing compression algorithms to work more effectively. In this post, we’ll explore the various &lt;strong&gt;compression techniques&lt;/strong&gt; supported by Parquet, how they work, and how to choose the right one for your data.&lt;/p&gt;
&lt;h2&gt;Why Compression Matters&lt;/h2&gt;
&lt;p&gt;Compression is crucial for managing large datasets. By reducing the size of the data on disk, compression not only saves storage space but also improves query performance by reducing the amount of data that needs to be read from disk and transferred over networks.&lt;/p&gt;
&lt;p&gt;Parquet’s columnar storage format further enhances the efficiency of compression by storing similar data together, which often results in higher compression ratios than row-based formats. But not all compression algorithms are created equal—different techniques have varying impacts on file size, read/write performance, and CPU usage.&lt;/p&gt;
&lt;h2&gt;Compression Algorithms Supported by Parquet&lt;/h2&gt;
&lt;p&gt;Parquet supports several widely-used compression algorithms, each with its own strengths and weaknesses. Here are the main compression options you can use when writing Parquet files:&lt;/p&gt;
&lt;h3&gt;1. &lt;strong&gt;Snappy&lt;/strong&gt;&lt;/h3&gt;
&lt;p&gt;&lt;strong&gt;Snappy&lt;/strong&gt; is one of the most popular compression algorithms used in Parquet due to its speed and reasonable compression ratio. It was developed by Google to provide a fast and lightweight compression method that is optimized for both speed and efficiency.&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Pros&lt;/strong&gt;: Fast compression and decompression, making it ideal for real-time queries and analytics workloads.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Cons&lt;/strong&gt;: Provides a moderate compression ratio compared to other algorithms, meaning that it may not reduce file sizes as much as more aggressive compression methods.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Use Case&lt;/strong&gt;: Snappy is a good choice when you prioritize performance and need to process data quickly, especially for interactive queries where speed is more important than achieving the smallest file size.&lt;/li&gt;
&lt;/ul&gt;
&lt;h3&gt;2. &lt;strong&gt;Gzip&lt;/strong&gt;&lt;/h3&gt;
&lt;p&gt;&lt;strong&gt;Gzip&lt;/strong&gt; is a compression algorithm known for providing a high compression ratio, but it is slower than Snappy when it comes to both compressing and decompressing data. It is widely used in systems where saving storage space is a priority.&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Pros&lt;/strong&gt;: Provides better compression ratios compared to Snappy, resulting in smaller file sizes.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Cons&lt;/strong&gt;: Slower to compress and decompress data, making it less suitable for time-sensitive or interactive queries.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Use Case&lt;/strong&gt;: Gzip is a good option when you need to reduce storage costs significantly and query performance is less of a concern, such as for archiving data or when working with large, infrequently accessed datasets.&lt;/li&gt;
&lt;/ul&gt;
&lt;h3&gt;3. &lt;strong&gt;Brotli&lt;/strong&gt;&lt;/h3&gt;
&lt;p&gt;&lt;strong&gt;Brotli&lt;/strong&gt; is a newer compression algorithm developed by Google that offers even higher compression ratios than Gzip, with better performance. It is increasingly used in scenarios where both file size and decompression speed are important.&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Pros&lt;/strong&gt;: Higher compression ratios than Gzip and better decompression speed, making it a good balance between file size reduction and read performance.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Cons&lt;/strong&gt;: Slower to compress data compared to Snappy or Gzip, but faster to decompress than Gzip.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Use Case&lt;/strong&gt;: Brotli is an excellent choice for compressing large datasets where both read performance and storage efficiency are important, such as in data lakes or cloud storage systems.&lt;/li&gt;
&lt;/ul&gt;
&lt;h3&gt;4. &lt;strong&gt;Zstandard (ZSTD)&lt;/strong&gt;&lt;/h3&gt;
&lt;p&gt;&lt;strong&gt;Zstandard (ZSTD)&lt;/strong&gt; is a modern compression algorithm that provides high compression ratios with fast decompression speeds. ZSTD has gained popularity in recent years due to its versatility and ability to be tuned for both speed and compression ratio.&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Pros&lt;/strong&gt;: Provides a very good balance between compression speed, decompression speed, and file size reduction. ZSTD can be adjusted to favor either speed or compression ratio based on specific requirements.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Cons&lt;/strong&gt;: Requires more configuration compared to simpler algorithms like Snappy or Gzip.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Use Case&lt;/strong&gt;: ZSTD is ideal for scenarios where you need high compression ratios and fast decompression, such as for optimizing storage in data lakes while maintaining fast query performance.&lt;/li&gt;
&lt;/ul&gt;
&lt;h3&gt;5. &lt;strong&gt;LZO&lt;/strong&gt;&lt;/h3&gt;
&lt;p&gt;&lt;strong&gt;LZO&lt;/strong&gt; is another lightweight compression algorithm that focuses on fast decompression and is often used in real-time processing systems. However, it generally provides lower compression ratios compared to other algorithms like Gzip or Brotli.&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Pros&lt;/strong&gt;: Very fast decompression, making it suitable for real-time analytics and streaming data processing.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Cons&lt;/strong&gt;: Lower compression ratios, which can result in larger file sizes compared to other algorithms.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Use Case&lt;/strong&gt;: LZO is a good choice when you need extremely fast data access and compression is less of a concern, such as in streaming applications or low-latency analytics.&lt;/li&gt;
&lt;/ul&gt;
&lt;h2&gt;Choosing the Right Compression Algorithm&lt;/h2&gt;
&lt;p&gt;Selecting the right compression algorithm for your Parquet files depends on your specific use case and the balance you want to achieve between compression efficiency and performance. Here are some considerations to help guide your decision:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Query Speed vs. File Size&lt;/strong&gt;: If your workload requires fast query performance, prioritize algorithms like Snappy or ZSTD that decompress quickly, even if they provide slightly larger file sizes. If storage space is more important, algorithms like Gzip or Brotli may be better suited due to their higher compression ratios.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Data Type and Repetition&lt;/strong&gt;: Some compression algorithms work better on certain data types. For example, dictionary encoding combined with Gzip or Brotli works well on columns with many repeated values. Snappy or LZO might be better for columns with highly variable data.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Storage Costs&lt;/strong&gt;: For workloads where storage costs are a primary concern (e.g., archiving large datasets), Gzip and Brotli will provide the smallest file sizes, which can lead to significant cost savings in cloud storage environments.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Real-Time Processing&lt;/strong&gt;: For real-time analytics or systems where low-latency access to data is critical, Snappy or LZO should be the preferred options due to their fast decompression speeds.&lt;/p&gt;
&lt;/li&gt;
&lt;/ul&gt;
&lt;h2&gt;Combining Compression with Encoding&lt;/h2&gt;
&lt;p&gt;In addition to choosing a compression algorithm, Parquet allows you to pair compression with various encoding techniques, such as &lt;strong&gt;dictionary encoding&lt;/strong&gt; or &lt;strong&gt;run-length encoding (RLE)&lt;/strong&gt;. This combination can further optimize storage efficiency, especially for columns with repetitive values.&lt;/p&gt;
&lt;p&gt;For example:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Dictionary Encoding&lt;/strong&gt;: Works well with columns that contain many repeated values, like categorical data. Pairing dictionary encoding with Gzip or ZSTD can lead to significant reductions in file size.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Run-Length Encoding (RLE)&lt;/strong&gt;: This encoding is particularly useful for columns with consecutive repeated values, such as timestamps or sequences. Combining RLE with a high-compression algorithm like Brotli can achieve very high compression ratios.&lt;/li&gt;
&lt;/ul&gt;
&lt;h2&gt;Conclusion&lt;/h2&gt;
&lt;p&gt;Compression is a critical aspect of managing large datasets, and Parquet’s support for multiple compression algorithms allows you to optimize your data storage and processing based on the specific needs of your workload. Whether you prioritize query performance with Snappy or aim for maximum storage efficiency with Gzip or Brotli, Parquet’s flexibility ensures that you can strike the right balance between speed and file size.&lt;/p&gt;
&lt;p&gt;In the next post, we’ll explore &lt;strong&gt;encoding techniques&lt;/strong&gt; in Parquet, diving deeper into how encoding works and how it complements compression for efficient data storage.&lt;/p&gt;
&lt;p&gt;Stay tuned for part 6: &lt;strong&gt;Encoding in Parquet: Optimizing for Storage&lt;/strong&gt;.&lt;/p&gt;
</content:encoded><author>Alex Merced</author></item><item><title>All About  Parquet Part 06 - Encoding in Parquet | Optimizing for Storage</title><link>https://iceberglakehouse.com/posts/2024-10-all-about-parquet-part-06/</link><guid isPermaLink="true">https://iceberglakehouse.com/posts/2024-10-all-about-parquet-part-06/</guid><description>
- [Free Copy of Apache Iceberg the Definitive Guide](https://hello.dremio.com/wp-apache-iceberg-the-definitive-guide-reg.html?utm_source=alexmerced&amp;u...</description><pubDate>Mon, 21 Oct 2024 09:00:00 GMT</pubDate><content:encoded>&lt;ul&gt;
&lt;li&gt;&lt;a href=&quot;https://hello.dremio.com/wp-apache-iceberg-the-definitive-guide-reg.html?utm_source=alexmerced&amp;amp;utm_medium=external_blog&amp;amp;utm_campaign=allaboutparquet&quot;&gt;Free Copy of Apache Iceberg the Definitive Guide&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&quot;https://hello.dremio.com/webcast-an-apache-iceberg-lakehouse-crash-course-reg.html?utm_source=alexmerced&amp;amp;utm_medium=external_blog&amp;amp;utm_campaign=allaboutparquet&quot;&gt;Free Apache Iceberg Crash Course&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&quot;https://www.youtube.com/watch?v=SIriNcVIGJQ&amp;amp;list=PLsLAVBjQJO0p0Yq1fLkoHvt2lEJj5pcYe&quot;&gt;Iceberg Lakehouse Engineering Video Playlist&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;In the last blog, we explored the various compression techniques supported by Parquet to reduce file size and improve query performance. But compression alone isn’t enough to maximize storage efficiency. Parquet also utilizes &lt;strong&gt;encoding techniques&lt;/strong&gt; to further optimize how data is stored, especially for columns with repetitive or predictable patterns. In this post, we’ll dive into how encoding works in Parquet, the different types of encoding it supports, and how to use them to reduce storage footprint while maintaining performance.&lt;/p&gt;
&lt;h2&gt;What is Encoding in Parquet?&lt;/h2&gt;
&lt;p&gt;&lt;strong&gt;Encoding&lt;/strong&gt; is the process of transforming data into a more efficient format to save space without losing information. In Parquet, encoding is applied to column data before compression. While compression algorithms focus on reducing redundancy at the byte level, encoding techniques work on the logical structure of the data, particularly for columns with repeating or predictable values.&lt;/p&gt;
&lt;p&gt;By using encoding in combination with compression, Parquet achieves smaller file sizes and faster query performance. The choice of encoding is determined by the characteristics of the data in each column. Let’s take a look at the most common encoding techniques used in Parquet.&lt;/p&gt;
&lt;h2&gt;Types of Encoding in Parquet&lt;/h2&gt;
&lt;p&gt;Parquet supports several encoding techniques, each designed for specific types of data patterns. Here are the most commonly used ones:&lt;/p&gt;
&lt;h3&gt;1. &lt;strong&gt;Dictionary Encoding&lt;/strong&gt;&lt;/h3&gt;
&lt;p&gt;&lt;strong&gt;Dictionary encoding&lt;/strong&gt; is one of the most effective techniques for columns that contain repeated values. It works by creating a dictionary of unique values and then replacing each value in the column with a reference to the dictionary. This significantly reduces the amount of data stored, especially for categorical data.&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;How It Works&lt;/strong&gt;: For a column that contains many repeated values (e.g., a &amp;quot;Department&amp;quot; column with repeated entries like &amp;quot;Sales,&amp;quot; &amp;quot;Marketing,&amp;quot; etc.), Parquet creates a dictionary of these unique values. Each value in the original column is then replaced with a small integer that refers to its position in the dictionary. The dictionary itself is stored once per column, making it very efficient.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Use Case&lt;/strong&gt;: Dictionary encoding is highly effective for columns with a limited number of unique values (e.g., categorical data, zip codes, or status flags).&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Pros&lt;/strong&gt;: Reduces storage size significantly for columns with repeated values, especially when paired with compression algorithms like Gzip or Brotli.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Cons&lt;/strong&gt;: May not be as effective for columns with a high number of unique values.&lt;/p&gt;
&lt;/li&gt;
&lt;/ul&gt;
&lt;h3&gt;2. &lt;strong&gt;Run-Length Encoding (RLE)&lt;/strong&gt;&lt;/h3&gt;
&lt;p&gt;&lt;strong&gt;Run-Length Encoding (RLE)&lt;/strong&gt; is another powerful technique for compressing columns with consecutive repeating values. It works by storing the value once along with the number of times it repeats, instead of storing the repeated value multiple times.&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;How It Works&lt;/strong&gt;: If a column contains long sequences of the same value (e.g., a &amp;quot;Status&amp;quot; column where many consecutive rows have the status &amp;quot;Active&amp;quot;), RLE stores the value once and records the number of times it repeats, rather than writing the value for each row. For example, instead of storing &amp;quot;Active&amp;quot; 100 times, RLE stores &amp;quot;Active: 100&amp;quot;.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Use Case&lt;/strong&gt;: RLE is ideal for columns with consecutive repeated values, such as status flags, binary values, or sorted columns.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Pros&lt;/strong&gt;: Very effective at reducing file size for columns with repeated or sorted data.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Cons&lt;/strong&gt;: Less effective on columns with highly variable data.&lt;/p&gt;
&lt;/li&gt;
&lt;/ul&gt;
&lt;h3&gt;3. &lt;strong&gt;Bit-Packing&lt;/strong&gt;&lt;/h3&gt;
&lt;p&gt;&lt;strong&gt;Bit-packing&lt;/strong&gt; is an encoding technique that reduces the number of bits used to store small integers. Instead of storing each integer as a fixed-size 32-bit or 64-bit value, bit-packing stores each integer in the smallest number of bits necessary to represent it. This is particularly useful for columns that contain small integers, such as IDs or categorical data with a limited number of categories.&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;How It Works&lt;/strong&gt;: If a column contains small integer values (e.g., a column with values ranging from 0 to 10), Parquet will use only 4 bits per value instead of 32 or 64 bits. This greatly reduces the amount of space required to store the data.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Use Case&lt;/strong&gt;: Bit-packing is effective for columns containing small integer values, such as IDs, ratings, or categorical data with a limited range of possible values.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Pros&lt;/strong&gt;: Reduces the number of bits used for small integers, leading to smaller file sizes.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Cons&lt;/strong&gt;: Less effective for columns with large integer values or wide ranges of possible values.&lt;/p&gt;
&lt;/li&gt;
&lt;/ul&gt;
&lt;h3&gt;4. &lt;strong&gt;Delta Encoding&lt;/strong&gt;&lt;/h3&gt;
&lt;p&gt;&lt;strong&gt;Delta encoding&lt;/strong&gt; is used to store differences between consecutive values rather than storing the full values themselves. This works well for columns where values are close together or follow a predictable pattern, such as timestamps, IDs, or monotonically increasing numbers.&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;How It Works&lt;/strong&gt;: Instead of storing the full value for each row, delta encoding stores the difference between each consecutive value and the previous one. For example, if a timestamp column contains values like 10, 12, 14, 16, delta encoding would store 10, 2, 2, 2, where each subsequent value is the difference from the previous one.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Use Case&lt;/strong&gt;: Delta encoding is effective for columns with ordered or predictable data patterns, such as timestamps, sequence numbers, or sorted columns.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Pros&lt;/strong&gt;: Greatly reduces file size for columns with predictable patterns or ordered values.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Cons&lt;/strong&gt;: Less effective for columns with random or unordered data.&lt;/p&gt;
&lt;/li&gt;
&lt;/ul&gt;
&lt;h3&gt;5. &lt;strong&gt;Plain Encoding&lt;/strong&gt;&lt;/h3&gt;
&lt;p&gt;&lt;strong&gt;Plain encoding&lt;/strong&gt; is the default encoding method in Parquet and is used for columns where no other encoding is more effective. It simply stores the values as they are, without any additional compression or optimization.&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;How It Works&lt;/strong&gt;: For columns where values vary greatly or where no pattern is detectable, plain encoding stores the values as-is. This encoding method is often used for strings, floating-point numbers, and other complex data types that do not benefit from the other encoding techniques.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Use Case&lt;/strong&gt;: Plain encoding is used for columns where no significant reduction in size can be achieved through other encoding methods.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Pros&lt;/strong&gt;: Simple and effective when no patterns or repetition exist in the data.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Cons&lt;/strong&gt;: Offers no additional compression or size reduction.&lt;/p&gt;
&lt;/li&gt;
&lt;/ul&gt;
&lt;h2&gt;Combining Encoding with Compression&lt;/h2&gt;
&lt;p&gt;The true power of Parquet comes from combining encoding with compression. For example, using &lt;strong&gt;dictionary encoding&lt;/strong&gt; for a column with many repeated values, followed by &lt;strong&gt;Gzip&lt;/strong&gt; compression, can lead to significant reductions in file size. Similarly, &lt;strong&gt;run-length encoding&lt;/strong&gt; paired with &lt;strong&gt;ZSTD&lt;/strong&gt; compression works well for columns with repeated sequences.&lt;/p&gt;
&lt;p&gt;Here are some common pairings of encoding and compression techniques:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Dictionary Encoding + Gzip&lt;/strong&gt;: Effective for categorical data or columns with repeated values.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Run-Length Encoding + Brotli&lt;/strong&gt;: Works well for sorted or repeating columns, such as status flags or binary values.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Delta Encoding + ZSTD&lt;/strong&gt;: Ideal for columns with ordered values, like timestamps or sequence numbers.&lt;/li&gt;
&lt;/ul&gt;
&lt;h2&gt;Optimizing Encoding for Performance&lt;/h2&gt;
&lt;p&gt;While encoding can reduce file size, it’s important to balance encoding choices with query performance. Certain encoding techniques, such as dictionary encoding, can improve query speed by reducing the amount of data that needs to be scanned. However, overly aggressive encoding can sometimes lead to slower read performance if it adds too much complexity to the decoding process.&lt;/p&gt;
&lt;p&gt;Here are some tips for optimizing encoding in Parquet:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Test Encoding with Your Queries&lt;/strong&gt;: Different workloads may benefit from different encoding techniques. Test how your queries perform with various encoding options to find the best balance between file size and performance.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Use Statistics to Skip Data&lt;/strong&gt;: Parquet files store column-level statistics (such as min/max values) that can help query engines skip irrelevant data. Pairing encoding with Parquet’s built-in statistics allows for faster query execution.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Leverage Columnar Design&lt;/strong&gt;: Since Parquet stores data column-wise, different columns can use different encoding techniques based on their data patterns. Optimize encoding for each column based on its characteristics.&lt;/li&gt;
&lt;/ul&gt;
&lt;h2&gt;Conclusion&lt;/h2&gt;
&lt;p&gt;Encoding is a powerful tool for optimizing storage and performance in Parquet files. By choosing the right encoding technique for each column, you can reduce file size while maintaining fast query performance. Whether you’re working with categorical data, ordered values, or repeated patterns, Parquet’s flexible encoding options allow you to tailor your data storage to fit your workload’s specific needs.&lt;/p&gt;
&lt;p&gt;In the next post, we’ll dive into how &lt;strong&gt;metadata&lt;/strong&gt; is used in Parquet files to further optimize data retrieval and improve query efficiency.&lt;/p&gt;
&lt;p&gt;Stay tuned for part 7: &lt;strong&gt;Metadata in Parquet: Improving Data Efficiency&lt;/strong&gt;.&lt;/p&gt;
</content:encoded><author>Alex Merced</author></item><item><title>All About  Parquet Part 07 - Metadata in Parquet | Improving Data Efficiency</title><link>https://iceberglakehouse.com/posts/2024-10-all-about-parquet-part-07/</link><guid isPermaLink="true">https://iceberglakehouse.com/posts/2024-10-all-about-parquet-part-07/</guid><description>
- [Free Copy of Apache Iceberg the Definitive Guide](https://hello.dremio.com/wp-apache-iceberg-the-definitive-guide-reg.html?utm_source=alexmerced&amp;u...</description><pubDate>Mon, 21 Oct 2024 09:00:00 GMT</pubDate><content:encoded>&lt;ul&gt;
&lt;li&gt;&lt;a href=&quot;https://hello.dremio.com/wp-apache-iceberg-the-definitive-guide-reg.html?utm_source=alexmerced&amp;amp;utm_medium=external_blog&amp;amp;utm_campaign=allaboutparquet&quot;&gt;Free Copy of Apache Iceberg the Definitive Guide&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&quot;https://hello.dremio.com/webcast-an-apache-iceberg-lakehouse-crash-course-reg.html?utm_source=alexmerced&amp;amp;utm_medium=external_blog&amp;amp;utm_campaign=allaboutparquet&quot;&gt;Free Apache Iceberg Crash Course&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&quot;https://www.youtube.com/watch?v=SIriNcVIGJQ&amp;amp;list=PLsLAVBjQJO0p0Yq1fLkoHvt2lEJj5pcYe&quot;&gt;Iceberg Lakehouse Engineering Video Playlist&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;In the previous posts, we’ve covered how Parquet optimizes storage through columnar storage, compression, and encoding. Now, let’s explore another essential feature that sets Parquet apart: &lt;strong&gt;metadata&lt;/strong&gt;. Metadata in Parquet plays a crucial role in improving data efficiency, enabling faster queries and optimized storage. In this post, we’ll dive into the different types of metadata stored in Parquet files, how metadata improves query performance, and best practices for leveraging metadata in your data pipelines.&lt;/p&gt;
&lt;h2&gt;What is Metadata in Parquet?&lt;/h2&gt;
&lt;p&gt;In Parquet, &lt;strong&gt;metadata&lt;/strong&gt; refers to information about the data stored within the file. This information includes things like the structure of the file (schema), statistics about the data, compression details, and more. Metadata is stored at various levels in a Parquet file: file-level, row group-level, and column-level.&lt;/p&gt;
&lt;p&gt;By storing rich metadata alongside the actual data, Parquet allows query engines to make decisions about which data to read, which rows to skip, and how to optimize query execution without scanning the entire dataset.&lt;/p&gt;
&lt;h2&gt;Types of Metadata in Parquet&lt;/h2&gt;
&lt;p&gt;Parquet files store metadata at three levels:&lt;/p&gt;
&lt;ol&gt;
&lt;li&gt;&lt;strong&gt;File-level metadata&lt;/strong&gt;: Information about the overall file, such as schema and version information.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Row group-level metadata&lt;/strong&gt;: Statistics about subsets of rows (row groups), such as row count, column sizes, and compression.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Column-level metadata&lt;/strong&gt;: Detailed statistics about individual columns, such as minimum and maximum values, null counts, and data types.&lt;/li&gt;
&lt;/ol&gt;
&lt;p&gt;Let’s take a closer look at each type of metadata and how it improves performance.&lt;/p&gt;
&lt;h3&gt;1. File-Level Metadata&lt;/h3&gt;
&lt;p&gt;File-level metadata describes the structure of the entire Parquet file. This includes:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Schema&lt;/strong&gt;: The schema defines the structure of the data, including column names, data types, and the hierarchical structure of nested fields.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Number of row groups&lt;/strong&gt;: This specifies how many row groups are stored in the file. Each row group contains data for a specific range of rows.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Version information&lt;/strong&gt;: This indicates which version of the Parquet format was used to write the file, ensuring compatibility with different readers.&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;File-level metadata is stored in the &lt;strong&gt;footer&lt;/strong&gt; of the Parquet file, which means it is read first when opening the file. Query engines can use this information to understand the overall structure of the data and determine how to process it efficiently.&lt;/p&gt;
&lt;h3&gt;2. Row Group-Level Metadata&lt;/h3&gt;
&lt;p&gt;Parquet files are divided into &lt;strong&gt;row groups&lt;/strong&gt;, and each row group contains a horizontal partition of the data (i.e., a subset of rows). Row group-level metadata provides summary information about the rows contained in each row group, including:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Row count&lt;/strong&gt;: The number of rows stored in each row group.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Column chunk sizes&lt;/strong&gt;: The size of each column chunk within the row group, which is useful for estimating the cost of reading specific columns.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Compression and encoding details&lt;/strong&gt;: Information about the compression algorithm and encoding technique used for each column in the row group.&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;This metadata allows query engines to skip entire row groups if they’re irrelevant to the query. For example, if a query is filtering for rows where a specific column’s value falls within a certain range, the engine can skip row groups where the column’s values do not meet the filter criteria.&lt;/p&gt;
&lt;h3&gt;3. Column-Level Metadata (Statistics)&lt;/h3&gt;
&lt;p&gt;Perhaps the most powerful type of metadata in Parquet is &lt;strong&gt;column-level statistics&lt;/strong&gt;. These statistics provide detailed information about the values stored in each column and include:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Minimum and Maximum Values&lt;/strong&gt;: The minimum and maximum values for each column. This allows query engines to quickly eliminate irrelevant data by skipping over row groups or pages that do not match query conditions.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Null Counts&lt;/strong&gt;: The number of null values in each column, which helps optimize queries that filter based on null values.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Distinct Count&lt;/strong&gt;: Some implementations may include distinct count metadata for columns, which can help in estimating cardinality for query optimization.&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;These statistics are stored both at the &lt;strong&gt;row group&lt;/strong&gt; level and at the &lt;strong&gt;page&lt;/strong&gt; level, giving query engines fine-grained control over which data to read and which data to skip.&lt;/p&gt;
&lt;h3&gt;4. File Footer&lt;/h3&gt;
&lt;p&gt;The file footer in a Parquet file is where all the metadata is stored. When a query engine accesses a Parquet file, it first reads the footer to understand the file structure, row group layout, and column statistics. This enables query optimization before even touching the actual data.&lt;/p&gt;
&lt;h2&gt;How Metadata Improves Query Performance&lt;/h2&gt;
&lt;p&gt;One of the biggest advantages of Parquet’s rich metadata is its ability to enable &lt;strong&gt;predicate pushdown&lt;/strong&gt;. Predicate pushdown is the process of applying filter conditions (predicates) as early as possible in the query execution process to minimize the amount of data that needs to be read.&lt;/p&gt;
&lt;p&gt;For example, consider a query that filters for rows where the value in the &lt;code&gt;Age&lt;/code&gt; column is greater than 30. With Parquet’s column-level metadata, the query engine can use the &lt;strong&gt;min/max&lt;/strong&gt; statistics to skip entire row groups or pages where the &lt;code&gt;Age&lt;/code&gt; column’s maximum value is less than or equal to 30. This significantly reduces the amount of data read from disk, resulting in faster query execution.&lt;/p&gt;
&lt;h3&gt;Other Ways Metadata Optimizes Queries&lt;/h3&gt;
&lt;ul&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Column Pruning&lt;/strong&gt;: Since Parquet is a columnar format, queries that only require specific columns can skip over irrelevant columns. The metadata helps identify which columns are needed for the query and ensures that only those columns are read.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Row Group Skipping&lt;/strong&gt;: If a query involves filtering based on a column’s value, Parquet’s row group-level metadata allows the query engine to skip entire row groups that do not match the filter condition. This reduces the number of rows that need to be scanned.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Page Skipping&lt;/strong&gt;: In addition to skipping row groups, metadata stored at the page level (within row groups) allows fine-grained control, letting the query engine skip pages that do not match query conditions.&lt;/p&gt;
&lt;/li&gt;
&lt;/ul&gt;
&lt;h2&gt;Best Practices for Leveraging Metadata in Parquet&lt;/h2&gt;
&lt;p&gt;To maximize the benefits of Parquet’s metadata, consider these best practices:&lt;/p&gt;
&lt;ol&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Choose Appropriate Row Group Size&lt;/strong&gt;: Row groups are the primary unit of parallelism and skipping in Parquet. For large datasets, selecting the right row group size is important to balance between performance and memory usage. Smaller row groups allow more precise skipping, but can increase the overhead of metadata management.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Enable Statistics Collection&lt;/strong&gt;: Ensure that Parquet writers are configured to collect column statistics, as this enables features like predicate pushdown and page skipping. Most modern processing frameworks (like Apache Spark, Dremio, and Hive) enable statistics collection by default.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Optimize for Query Patterns&lt;/strong&gt;: If your workload involves frequent filtering on specific columns, consider sorting the data based on those columns. This can make min/max statistics more effective for skipping irrelevant data.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Compress Metadata&lt;/strong&gt;: While Parquet metadata is typically small, compressing it (especially for large files with many row groups) can further optimize storage and improve read performance.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Utilize Data Catalogs&lt;/strong&gt;: When managing a large number of Parquet files, tools like Apache Hive Metastore or Nessie Catalog can help track and manage schema evolution and metadata across multiple datasets, making query optimization more effective.&lt;/p&gt;
&lt;/li&gt;
&lt;/ol&gt;
&lt;h2&gt;Conclusion&lt;/h2&gt;
&lt;p&gt;Metadata is one of the most powerful features of Parquet, enabling efficient storage and fast queries through predicate pushdown, column pruning, and row group skipping. By leveraging the rich metadata stored in Parquet files, you can drastically improve query performance and reduce the amount of data that needs to be read from disk.&lt;/p&gt;
&lt;p&gt;In the next post, we’ll explore &lt;strong&gt;reading and writing Parquet files in Python&lt;/strong&gt;, using libraries like PyArrow and FastParquet to demonstrate how to work with Parquet files programmatically.&lt;/p&gt;
&lt;p&gt;Stay tuned for part 8: &lt;strong&gt;Reading and Writing Parquet Files in Python&lt;/strong&gt;.&lt;/p&gt;
</content:encoded><author>Alex Merced</author></item><item><title>All About  Parquet Part 08 - Reading and Writing Parquet Files in Python</title><link>https://iceberglakehouse.com/posts/2024-10-all-about-parquet-part-08/</link><guid isPermaLink="true">https://iceberglakehouse.com/posts/2024-10-all-about-parquet-part-08/</guid><description>
- [Free Copy of Apache Iceberg the Definitive Guide](https://hello.dremio.com/wp-apache-iceberg-the-definitive-guide-reg.html?utm_source=alexmerced&amp;u...</description><pubDate>Mon, 21 Oct 2024 09:00:00 GMT</pubDate><content:encoded>&lt;ul&gt;
&lt;li&gt;&lt;a href=&quot;https://hello.dremio.com/wp-apache-iceberg-the-definitive-guide-reg.html?utm_source=alexmerced&amp;amp;utm_medium=external_blog&amp;amp;utm_campaign=allaboutparquet&quot;&gt;Free Copy of Apache Iceberg the Definitive Guide&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&quot;https://hello.dremio.com/webcast-an-apache-iceberg-lakehouse-crash-course-reg.html?utm_source=alexmerced&amp;amp;utm_medium=external_blog&amp;amp;utm_campaign=allaboutparquet&quot;&gt;Free Apache Iceberg Crash Course&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&quot;https://www.youtube.com/watch?v=SIriNcVIGJQ&amp;amp;list=PLsLAVBjQJO0p0Yq1fLkoHvt2lEJj5pcYe&quot;&gt;Iceberg Lakehouse Engineering Video Playlist&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;In previous posts, we explored the internal workings of the Parquet format and how it optimizes storage and performance. Now, it&apos;s time to dive into the practical side: how to &lt;strong&gt;read and write Parquet files in Python&lt;/strong&gt;. With libraries like &lt;strong&gt;PyArrow&lt;/strong&gt; and &lt;strong&gt;FastParquet&lt;/strong&gt;, Python makes working with Parquet easy and efficient. In this post, we’ll walk through how to use these tools to handle Parquet files, covering both reading from and writing to Parquet.&lt;/p&gt;
&lt;h2&gt;Why Use Parquet in Python?&lt;/h2&gt;
&lt;p&gt;Parquet has become a go-to format for handling large datasets, especially in the data engineering and analytics world. Here’s why you might choose Parquet over other formats (like CSV or JSON) when working in Python:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Efficient Data Storage&lt;/strong&gt;: Parquet&apos;s columnar format reduces file sizes and speeds up queries, making it ideal for large datasets.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Interoperability&lt;/strong&gt;: Parquet works seamlessly with distributed data processing tools like Apache Spark, Dremio, and Hadoop, as well as cloud storage services like AWS S3.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Schema Enforcement&lt;/strong&gt;: Parquet supports structured data and schema enforcement, ensuring data consistency.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Compatibility&lt;/strong&gt;: Python libraries like &lt;strong&gt;PyArrow&lt;/strong&gt; and &lt;strong&gt;FastParquet&lt;/strong&gt; make it easy to integrate Parquet with popular Python data science tools like Pandas.&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;Let’s start by looking at two of the most popular libraries for working with Parquet in Python: &lt;strong&gt;PyArrow&lt;/strong&gt; and &lt;strong&gt;FastParquet&lt;/strong&gt;.&lt;/p&gt;
&lt;h2&gt;PyArrow: A Complete Parquet Solution&lt;/h2&gt;
&lt;p&gt;&lt;strong&gt;PyArrow&lt;/strong&gt; is part of the Apache Arrow project and provides full support for Parquet. It’s widely used for reading and writing Parquet files and works seamlessly with other Arrow libraries. Let’s walk through how to use PyArrow to read and write Parquet files.&lt;/p&gt;
&lt;h3&gt;Installing PyArrow&lt;/h3&gt;
&lt;p&gt;First, you’ll need to install PyArrow. You can do this with pip:&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-bash&quot;&gt;pip install pyarrow
Writing Parquet Files with PyArrow
Writing data to a Parquet file using PyArrow is straightforward. Below is an example of how to write a Pandas DataFrame to Parquet:

python
Copy code
import pandas as pd
import pyarrow as pa
import pyarrow.parquet as pq

# Create a sample DataFrame
df = pd.DataFrame({
    &apos;Name&apos;: [&apos;Alice&apos;, &apos;Bob&apos;, &apos;Charlie&apos;],
    &apos;Age&apos;: [25, 30, 35],
    &apos;Salary&apos;: [50000, 60000, 70000]
})

# Convert the DataFrame to an Arrow Table
table = pa.Table.from_pandas(df)

# Write the Arrow Table to a Parquet file
pq.write_table(table, &apos;sample.parquet&apos;)
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;In this example, we first create a Pandas DataFrame, convert it to an Arrow Table (using pa.Table.from_pandas), and then write it to a Parquet file using pq.write_table. The resulting file will be a compressed, efficient Parquet file that can be easily queried and processed.&lt;/p&gt;
&lt;h3&gt;Reading Parquet Files with PyArrow&lt;/h3&gt;
&lt;p&gt;Reading Parquet files with PyArrow is just as simple. You can load the data from a Parquet file into a Pandas DataFrame as follows:&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-python&quot;&gt;# Read the Parquet file into an Arrow Table
table = pq.read_table(&apos;sample.parquet&apos;)

# Convert the Arrow Table to a Pandas DataFrame
df = table.to_pandas()

print(df)
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;This code reads the Parquet file into an Arrow Table using &lt;code&gt;pq.read_table&lt;/code&gt;, then converts it into a Pandas DataFrame using &lt;code&gt;to_pandas&lt;/code&gt;. You can then manipulate the data in Pandas as usual.&lt;/p&gt;
&lt;h3&gt;PyArrow and Partitioned Datasets&lt;/h3&gt;
&lt;p&gt;In many real-world use cases, especially in data lakes, Parquet files are partitioned by columns to improve query performance. PyArrow makes it easy to work with partitioned datasets:&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-python&quot;&gt;# Write partitioned Parquet files
pq.write_to_dataset(table, root_path=&apos;dataset/&apos;, partition_cols=[&apos;Age&apos;])

# Read a partitioned dataset
table = pq.ParquetDataset(&apos;dataset/&apos;).read()
df = table.to_pandas()

print(df)
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;In this example, we write the dataset into multiple Parquet files partitioned by the Age column. You can later read the entire partitioned dataset as a single table and convert it back to a Pandas DataFrame.&lt;/p&gt;
&lt;h2&gt;FastParquet: A Lightweight Alternative&lt;/h2&gt;
&lt;p&gt;FastParquet is another popular library for working with Parquet files in Python. It’s optimized for speed and integrates well with Pandas. While PyArrow provides a more comprehensive set of features, FastParquet offers a faster and more lightweight solution for common tasks.&lt;/p&gt;
&lt;h3&gt;Installing FastParquet&lt;/h3&gt;
&lt;p&gt;You can install FastParquet using pip:&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-bash&quot;&gt;pip install fastparquet
&lt;/code&gt;&lt;/pre&gt;
&lt;h3&gt;Writing Parquet Files with FastParquet&lt;/h3&gt;
&lt;p&gt;Writing a Parquet file using FastParquet is very similar to PyArrow:&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-python&quot;&gt;import pandas as pd
import fastparquet as fp

# Create a sample DataFrame
df = pd.DataFrame({
    &apos;Name&apos;: [&apos;Alice&apos;, &apos;Bob&apos;, &apos;Charlie&apos;],
    &apos;Age&apos;: [25, 30, 35],
    &apos;Salary&apos;: [50000, 60000, 70000]
})

# Write the DataFrame to a Parquet file
fp.write(&apos;sample_fp.parquet&apos;, df)
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Here, we directly write the Pandas DataFrame to a Parquet file using FastParquet’s write function.&lt;/p&gt;
&lt;h3&gt;Reading Parquet Files with FastParquet&lt;/h3&gt;
&lt;p&gt;Reading Parquet files with FastParquet is just as easy:&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-python&quot;&gt;# Read the Parquet file into a Pandas DataFrame
df = fp.ParquetFile(&apos;sample_fp.parquet&apos;).to_pandas()

print(df)
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;FastParquet allows you to quickly load Parquet files into Pandas DataFrames, making it ideal for use in data science and analytics workflows.&lt;/p&gt;
&lt;h3&gt;FastParquet and Partitioned Datasets&lt;/h3&gt;
&lt;p&gt;FastParquet also supports reading and writing partitioned datasets:&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-python&quot;&gt;# Write partitioned Parquet files
fp.write(&apos;dataset_fp/&apos;, df, partition_on=[&apos;Age&apos;])

# Read a partitioned dataset
df = fp.ParquetFile(&apos;dataset_fp/&apos;).to_pandas()

print(df)
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;In this example, we partition the dataset by the Age column and later read it back into a Pandas DataFrame.&lt;/p&gt;
&lt;h2&gt;PyArrow vs. FastParquet: Which to Choose?&lt;/h2&gt;
&lt;p&gt;Both PyArrow and FastParquet are excellent options for working with Parquet files in Python, but they have different strengths:&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;PyArrow:&lt;/strong&gt; Offers full support for the Parquet format and works seamlessly with the broader Apache Arrow ecosystem. It’s the better choice for complex use cases, such as working with partitioned datasets or using advanced compression and encoding options.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;FastParquet:&lt;/strong&gt; Is faster and lighter, making it a great option for simple tasks, such as reading and writing Parquet files in data science workflows. It’s often more performant when dealing with small-to-medium datasets.&lt;/p&gt;
&lt;p&gt;Ultimately, the choice between the two depends on your specific use case. If you’re working with large-scale data in distributed systems or need advanced features like schema evolution or deep integration with Arrow, go with PyArrow. If you need a fast, lightweight solution for reading and writing Parquet files in day-to-day data analysis, FastParquet is a great option.&lt;/p&gt;
&lt;h2&gt;Conclusion&lt;/h2&gt;
&lt;p&gt;Python provides excellent libraries for reading and writing Parquet files, with PyArrow and FastParquet being two of the most popular options. Whether you need advanced features like partitioning and schema handling (PyArrow) or a lightweight, fast solution for simple file manipulation (FastParquet), both libraries offer robust support for the Parquet format.&lt;/p&gt;
&lt;p&gt;In the next post, we’ll explore how Parquet fits into modern data lake architectures and how it powers data lakehouses with technologies like Apache Iceberg and Delta Lake.&lt;/p&gt;
&lt;p&gt;Stay tuned for part 9: &lt;strong&gt;Parquet in Data Lake Architectures&lt;/strong&gt;.&lt;/p&gt;
</content:encoded><author>Alex Merced</author></item><item><title>All About  Parquet Part 09 - Parquet in Data Lake Architectures</title><link>https://iceberglakehouse.com/posts/2024-10-all-about-parquet-part-09/</link><guid isPermaLink="true">https://iceberglakehouse.com/posts/2024-10-all-about-parquet-part-09/</guid><description>
- [Free Copy of Apache Iceberg the Definitive Guide](https://hello.dremio.com/wp-apache-iceberg-the-definitive-guide-reg.html?utm_source=alexmerced&amp;u...</description><pubDate>Mon, 21 Oct 2024 09:00:00 GMT</pubDate><content:encoded>&lt;ul&gt;
&lt;li&gt;&lt;a href=&quot;https://hello.dremio.com/wp-apache-iceberg-the-definitive-guide-reg.html?utm_source=alexmerced&amp;amp;utm_medium=external_blog&amp;amp;utm_campaign=allaboutparquet&quot;&gt;Free Copy of Apache Iceberg the Definitive Guide&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&quot;https://hello.dremio.com/webcast-an-apache-iceberg-lakehouse-crash-course-reg.html?utm_source=alexmerced&amp;amp;utm_medium=external_blog&amp;amp;utm_campaign=allaboutparquet&quot;&gt;Free Apache Iceberg Crash Course&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&quot;https://www.youtube.com/watch?v=SIriNcVIGJQ&amp;amp;list=PLsLAVBjQJO0p0Yq1fLkoHvt2lEJj5pcYe&quot;&gt;Iceberg Lakehouse Engineering Video Playlist&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;As data volumes grow and the need for scalable analytics increases, &lt;strong&gt;data lakes&lt;/strong&gt; have emerged as a critical solution for organizations looking to store large datasets in their raw format. At the heart of these data lakes, &lt;strong&gt;Parquet&lt;/strong&gt; has become a go-to file format due to its efficiency, flexibility, and ability to scale with modern big data systems. In this post, we’ll explore the role of Parquet in &lt;strong&gt;data lake architectures&lt;/strong&gt;, how it powers modern &lt;strong&gt;data lakehouses&lt;/strong&gt;, and why it is so well-suited for cloud-based, distributed environments.&lt;/p&gt;
&lt;h2&gt;What is a Data Lake?&lt;/h2&gt;
&lt;p&gt;A &lt;strong&gt;data lake&lt;/strong&gt; is a centralized repository that allows organizations to store structured, semi-structured, and unstructured data at any scale. Unlike traditional databases or data warehouses, data lakes don’t enforce a strict schema on incoming data. Instead, data is ingested in its raw form, allowing for flexibility in how the data is stored and processed.&lt;/p&gt;
&lt;p&gt;The key benefits of data lakes include:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Scalability&lt;/strong&gt;: Data lakes can scale to petabytes of data, making them ideal for organizations with large datasets.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Cost Efficiency&lt;/strong&gt;: Storing data in its raw form in cheaper storage (such as cloud object stores) is more cost-effective than using expensive relational databases or data warehouses.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Flexibility&lt;/strong&gt;: Data lakes can handle a wide variety of data types, from raw logs and JSON files to structured CSV and Parquet files.&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;Parquet plays a crucial role in optimizing the performance of these data lakes by providing a &lt;strong&gt;highly efficient, columnar file format&lt;/strong&gt; that improves both storage and query performance.&lt;/p&gt;
&lt;h2&gt;Why Parquet is Ideal for Data Lakes&lt;/h2&gt;
&lt;p&gt;The reasons Parquet is a preferred file format for data lakes boil down to several key features:&lt;/p&gt;
&lt;h3&gt;1. &lt;strong&gt;Columnar Storage&lt;/strong&gt;&lt;/h3&gt;
&lt;p&gt;Parquet’s &lt;strong&gt;columnar storage model&lt;/strong&gt; allows it to store data by column rather than by row, which is particularly useful in analytics workloads. In most analytical queries, only a subset of columns is needed. Parquet’s columnar format means that only the relevant columns are read, reducing I/O and speeding up queries.&lt;/p&gt;
&lt;p&gt;For example, if your dataset contains 100 columns but you only need to run a query on 5 columns, Parquet allows you to access just those 5 columns, making queries more efficient.&lt;/p&gt;
&lt;h3&gt;2. &lt;strong&gt;Efficient Compression&lt;/strong&gt;&lt;/h3&gt;
&lt;p&gt;Parquet supports multiple &lt;strong&gt;compression algorithms&lt;/strong&gt; (e.g., Snappy, Gzip, Brotli), allowing it to reduce the size of large datasets stored in data lakes. Given that data lakes often store petabytes of data, reducing storage costs is a top priority. Parquet’s efficient compression helps organizations minimize storage usage without sacrificing performance.&lt;/p&gt;
&lt;h3&gt;3. &lt;strong&gt;Schema Evolution&lt;/strong&gt;&lt;/h3&gt;
&lt;p&gt;As datasets in data lakes evolve, the ability to handle &lt;strong&gt;schema evolution&lt;/strong&gt; is critical. Parquet supports schema evolution, allowing new fields to be added or existing fields to be removed without requiring a complete rewrite of the data. This flexibility is essential for maintaining backward and forward compatibility as data structures change over time.&lt;/p&gt;
&lt;h3&gt;4. &lt;strong&gt;Distributed Processing Compatibility&lt;/strong&gt;&lt;/h3&gt;
&lt;p&gt;Data lakes are often built on top of distributed processing frameworks like &lt;strong&gt;Apache Spark&lt;/strong&gt;, &lt;strong&gt;Presto&lt;/strong&gt;, &lt;strong&gt;Dremio&lt;/strong&gt;, and &lt;strong&gt;Apache Flink&lt;/strong&gt;. Parquet is natively supported by these systems, enabling efficient processing of data stored in Parquet files. Its columnar format works well with distributed systems, allowing parallel processing of different columns and row groups across multiple nodes.&lt;/p&gt;
&lt;h3&gt;5. &lt;strong&gt;Partitioning and Predicate Pushdown&lt;/strong&gt;&lt;/h3&gt;
&lt;p&gt;Parquet supports &lt;strong&gt;partitioning&lt;/strong&gt;—a key feature in data lakes. Partitioning means that datasets are divided into smaller, more manageable chunks based on the values of certain columns (e.g., partitioning data by date). When queries are run on partitioned Parquet data, query engines can skip over entire partitions that do not match the query, drastically improving performance.&lt;/p&gt;
&lt;p&gt;In addition to partitioning, Parquet’s &lt;strong&gt;predicate pushdown&lt;/strong&gt; capability allows query engines to apply filters (predicates) directly at the file or row group level, avoiding the need to read unnecessary data. This is particularly useful in large-scale environments where minimizing data read is crucial to maintaining performance.&lt;/p&gt;
&lt;h2&gt;Parquet and the Data Lakehouse&lt;/h2&gt;
&lt;p&gt;In recent years, a new architecture has emerged that builds on the strengths of data lakes while addressing some of their limitations: the &lt;strong&gt;data lakehouse&lt;/strong&gt;. A data lakehouse combines the flexibility of data lakes with the performance and data management features of traditional data warehouses. Parquet plays a central role in enabling data lakehouses by serving as the foundational file format.&lt;/p&gt;
&lt;h3&gt;How Parquet Fits Into Data Lakehouses&lt;/h3&gt;
&lt;p&gt;Data lakehouses leverage Parquet to provide the following benefits:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Transactional Capabilities&lt;/strong&gt;: Data lakehouses often use transactional layers like &lt;strong&gt;Apache Iceberg&lt;/strong&gt;, &lt;strong&gt;Delta Lake&lt;/strong&gt;, or &lt;strong&gt;Apache Hudi&lt;/strong&gt; to provide ACID (Atomicity, Consistency, Isolation, Durability) guarantees on top of the Parquet format. This allows for &lt;strong&gt;time-travel queries&lt;/strong&gt;, versioning, and consistent reads, features that are crucial for enterprise-grade data management.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Efficient Query Performance&lt;/strong&gt;: Lakehouses use &lt;strong&gt;Parquet&lt;/strong&gt; as their default storage format due to its columnar design and compression capabilities. Combined with features like &lt;strong&gt;data reflections&lt;/strong&gt; (in Dremio) and &lt;strong&gt;materialized views&lt;/strong&gt;, Parquet files in a data lakehouse are optimized for high-performance queries.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Data Governance&lt;/strong&gt;: Data lakehouses provide better data governance compared to traditional data lakes. Parquet, along with these additional transactional layers, allows for improved schema enforcement, auditing, and access controls, ensuring that data remains consistent and compliant with organizational policies.&lt;/p&gt;
&lt;/li&gt;
&lt;/ul&gt;
&lt;h3&gt;Parquet with Apache Iceberg, Delta Lake, and Hudi&lt;/h3&gt;
&lt;p&gt;&lt;strong&gt;Apache Iceberg&lt;/strong&gt;, &lt;strong&gt;Delta Lake&lt;/strong&gt;, and &lt;strong&gt;Apache Hudi&lt;/strong&gt; are all technologies that extend data lakes by adding ACID transactions, schema enforcement, and time-travel capabilities. Each of these technologies uses Parquet as a foundational file format for storing data:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Apache Iceberg&lt;/strong&gt;: Iceberg provides table formats for managing Parquet files at scale, supporting large datasets with features like partitioning, versioned data, and fast scans.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Delta Lake&lt;/strong&gt;: Delta Lake adds ACID transactions and time-travel features to data lakes, making it easier to manage large-scale Parquet datasets with consistent reads and writes.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Apache Hudi&lt;/strong&gt;: Hudi provides transactional write operations and version management for Parquet data stored in data lakes, ensuring that data remains queryable while handling schema changes and streaming ingestion.&lt;/p&gt;
&lt;/li&gt;
&lt;/ul&gt;
&lt;h2&gt;Best Practices for Using Parquet in Data Lakes&lt;/h2&gt;
&lt;p&gt;To get the most out of Parquet in data lake architectures, here are a few best practices:&lt;/p&gt;
&lt;h3&gt;1. &lt;strong&gt;Use Partitioning for Large Datasets&lt;/strong&gt;&lt;/h3&gt;
&lt;p&gt;Partitioning is essential when working with large datasets in data lakes. By partitioning data based on frequently queried columns (e.g., date, region), you can minimize the amount of data read during queries and improve overall performance.&lt;/p&gt;
&lt;h3&gt;2. &lt;strong&gt;Leverage Compression&lt;/strong&gt;&lt;/h3&gt;
&lt;p&gt;Compression is crucial for reducing storage costs in data lakes. Use compression algorithms like &lt;strong&gt;Snappy&lt;/strong&gt; for fast compression and decompression, or &lt;strong&gt;Gzip/Brotli&lt;/strong&gt; if you need to prioritize smaller file sizes. The choice of compression algorithm depends on the specific workload and storage requirements.&lt;/p&gt;
&lt;h3&gt;3. &lt;strong&gt;Enable Predicate Pushdown&lt;/strong&gt;&lt;/h3&gt;
&lt;p&gt;Predicate pushdown ensures that only relevant data is read from Parquet files during queries. Ensure that your data processing frameworks (e.g., Spark, Presto, Dremio) are configured to take advantage of predicate pushdown to skip over irrelevant data and improve query speeds.&lt;/p&gt;
&lt;h3&gt;4. &lt;strong&gt;Combine with Transactional Layers&lt;/strong&gt;&lt;/h3&gt;
&lt;p&gt;If you’re building a data lakehouse, consider using &lt;strong&gt;Apache Iceberg&lt;/strong&gt;, &lt;strong&gt;Delta Lake&lt;/strong&gt;, or &lt;strong&gt;Apache Hudi&lt;/strong&gt; to add transactional capabilities on top of your Parquet data. This enables ACID compliance, versioning, and time-travel queries, which are crucial for enterprise-level data management.&lt;/p&gt;
&lt;h2&gt;Conclusion&lt;/h2&gt;
&lt;p&gt;Parquet has become the backbone of modern data lake architectures due to its efficiency, flexibility, and compatibility with distributed processing systems. Whether you’re building a traditional data lake or a more advanced data lakehouse, Parquet provides the foundation for scalable, high-performance data storage.&lt;/p&gt;
&lt;p&gt;In the next post, we’ll explore &lt;strong&gt;performance tuning and best practices for optimizing Parquet&lt;/strong&gt; to ensure that your data pipelines are running at their best.&lt;/p&gt;
&lt;p&gt;Stay tuned for part 10: &lt;strong&gt;Performance Tuning and Best Practices with Parquet&lt;/strong&gt;.&lt;/p&gt;
</content:encoded><author>Alex Merced</author></item><item><title>All About  Parquet Part 10 - Performance Tuning and Best Practices with Parquet</title><link>https://iceberglakehouse.com/posts/2024-10-all-about-parquet-part-10/</link><guid isPermaLink="true">https://iceberglakehouse.com/posts/2024-10-all-about-parquet-part-10/</guid><description>
- [Free Copy of Apache Iceberg the Definitive Guide](https://hello.dremio.com/wp-apache-iceberg-the-definitive-guide-reg.html?utm_source=alexmerced&amp;u...</description><pubDate>Mon, 21 Oct 2024 09:00:00 GMT</pubDate><content:encoded>&lt;ul&gt;
&lt;li&gt;&lt;a href=&quot;https://hello.dremio.com/wp-apache-iceberg-the-definitive-guide-reg.html?utm_source=alexmerced&amp;amp;utm_medium=external_blog&amp;amp;utm_campaign=allaboutparquet&quot;&gt;Free Copy of Apache Iceberg the Definitive Guide&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&quot;https://hello.dremio.com/webcast-an-apache-iceberg-lakehouse-crash-course-reg.html?utm_source=alexmerced&amp;amp;utm_medium=external_blog&amp;amp;utm_campaign=allaboutparquet&quot;&gt;Free Apache Iceberg Crash Course&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&quot;https://www.youtube.com/watch?v=SIriNcVIGJQ&amp;amp;list=PLsLAVBjQJO0p0Yq1fLkoHvt2lEJj5pcYe&quot;&gt;Iceberg Lakehouse Engineering Video Playlist&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;Throughout this series, we’ve explored the many features that make &lt;strong&gt;Apache Parquet&lt;/strong&gt; a powerful and efficient file format for big data processing. In this final post, we’ll focus on &lt;strong&gt;performance tuning&lt;/strong&gt; and &lt;strong&gt;best practices&lt;/strong&gt; to help you optimize your Parquet workflows. Whether you’re working in a data lake, a data warehouse, or a data lakehouse, following these guidelines will help you get the most out of your Parquet data.&lt;/p&gt;
&lt;h2&gt;Why Performance Tuning Matters&lt;/h2&gt;
&lt;p&gt;When dealing with large datasets, even small inefficiencies can lead to significant slowdowns and increased costs. Properly tuning Parquet files can:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Improve query performance&lt;/strong&gt;: By reducing the amount of data read from disk and optimizing how data is processed, you can drastically speed up analytical queries.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Reduce storage costs&lt;/strong&gt;: Compression and partitioning techniques reduce storage usage, lowering the costs associated with cloud object storage or on-premise data infrastructure.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Enhance scalability&lt;/strong&gt;: By optimizing how data is structured and accessed, Parquet can scale efficiently as your data grows, supporting high-performance analytics on petabytes of data.&lt;/li&gt;
&lt;/ul&gt;
&lt;h2&gt;Best Practices for Optimizing Parquet&lt;/h2&gt;
&lt;p&gt;Let’s dive into some key strategies to optimize the performance of Parquet files in your data pipelines.&lt;/p&gt;
&lt;h3&gt;1. Choose the Right Row Group Size&lt;/h3&gt;
&lt;p&gt;&lt;strong&gt;Row groups&lt;/strong&gt; are the primary unit of storage and processing in Parquet files. Each row group contains data for a subset of rows, stored in column chunks. Choosing the right row group size is critical for performance:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Larger row groups&lt;/strong&gt; reduce metadata overhead and improve read performance by reducing the number of I/O operations required to scan the data. However, larger row groups can lead to higher memory consumption during query execution.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Smaller row groups&lt;/strong&gt; allow for better parallelism and more granular data skipping but may increase the amount of metadata and result in slower queries.&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;&lt;strong&gt;Best Practice&lt;/strong&gt;: Aim for a row group size of &lt;strong&gt;128 MB to 512 MB&lt;/strong&gt;, depending on your memory and processing resources. This range strikes a good balance between I/O efficiency and query parallelism in distributed systems like Apache Spark or Dremio.&lt;/p&gt;
&lt;h3&gt;2. Partition Your Data&lt;/h3&gt;
&lt;p&gt;Partitioning your Parquet data can significantly improve query performance by allowing query engines to &lt;strong&gt;skip over irrelevant partitions&lt;/strong&gt;. Partitioning divides a dataset into smaller files or folders based on the values of one or more columns, typically ones frequently used in queries (e.g., date, region, or product category).&lt;/p&gt;
&lt;p&gt;For example, if your dataset contains a &lt;code&gt;date&lt;/code&gt; column, partitioning by date will create folders for each date, allowing query engines to ignore entire date ranges that are not relevant to the query.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Best Practice&lt;/strong&gt;: Partition data by columns that are frequently used in filters and where the cardinality (the number of distinct values) is relatively low. Over-partitioning (too many small partitions) can lead to excessive file fragmentation, while under-partitioning can result in reading too much unnecessary data.&lt;/p&gt;
&lt;h3&gt;3. Leverage Compression Wisely&lt;/h3&gt;
&lt;p&gt;Parquet supports several compression algorithms, each with different trade-offs between &lt;strong&gt;compression ratio&lt;/strong&gt;, &lt;strong&gt;speed&lt;/strong&gt;, and &lt;strong&gt;CPU usage&lt;/strong&gt;. Choosing the right compression algorithm depends on your priorities:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Snappy&lt;/strong&gt;: Fast compression and decompression with a moderate compression ratio. Ideal for real-time analytics and interactive queries.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Gzip&lt;/strong&gt;: Higher compression ratio but slower, making it suitable for datasets where storage savings are prioritized over query speed.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Brotli&lt;/strong&gt;: Similar to Gzip but offers better decompression performance, useful when both storage and read performance are important.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Zstandard (ZSTD)&lt;/strong&gt;: Highly configurable with a good balance of speed and compression ratio, making it a strong option for both storage efficiency and performance.&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;&lt;strong&gt;Best Practice&lt;/strong&gt;: For most workloads, &lt;strong&gt;Snappy&lt;/strong&gt; strikes the right balance between speed and compression. Use &lt;strong&gt;Gzip&lt;/strong&gt; or &lt;strong&gt;Brotli&lt;/strong&gt; when storage costs are a major concern, and use &lt;strong&gt;ZSTD&lt;/strong&gt; if you need tunable performance to meet both storage and read requirements.&lt;/p&gt;
&lt;h3&gt;4. Use Predicate Pushdown&lt;/h3&gt;
&lt;p&gt;&lt;strong&gt;Predicate pushdown&lt;/strong&gt; allows query engines to filter data at the file or row group level, reducing the amount of data that needs to be scanned. Parquet supports &lt;strong&gt;min/max statistics&lt;/strong&gt; at the column and row group level, which allows query engines to skip entire row groups or pages that do not match the query filter.&lt;/p&gt;
&lt;p&gt;For example, if your query filters for rows where the &lt;code&gt;Age&lt;/code&gt; column is greater than 30, Parquet can skip row groups where the maximum value of &lt;code&gt;Age&lt;/code&gt; is less than or equal to 30.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Best Practice&lt;/strong&gt;: Ensure that your data processing frameworks (e.g., Apache Spark, Presto, Dremio) are configured to use predicate pushdown. Also, keep row group sizes large enough to ensure effective use of Parquet’s built-in statistics.&lt;/p&gt;
&lt;h3&gt;5. Optimize Encoding Strategies&lt;/h3&gt;
&lt;p&gt;Parquet supports a variety of encoding techniques that optimize how data is stored within each column, including &lt;strong&gt;dictionary encoding&lt;/strong&gt;, &lt;strong&gt;run-length encoding (RLE)&lt;/strong&gt;, and &lt;strong&gt;delta encoding&lt;/strong&gt;. The right encoding can significantly reduce file size and improve read performance:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Dictionary encoding&lt;/strong&gt;: Great for columns with repeated values, like categorical or ID columns. Reduces storage by replacing repeated values with references to a dictionary.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Run-length encoding (RLE)&lt;/strong&gt;: Ideal for columns with long runs of the same value, such as binary flags or sorted columns.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Delta encoding&lt;/strong&gt;: Works well for columns with values that are close together or increase in a predictable pattern, such as timestamps or IDs.&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;&lt;strong&gt;Best Practice&lt;/strong&gt;: Use &lt;strong&gt;dictionary encoding&lt;/strong&gt; for columns with a small number of distinct values, and &lt;strong&gt;RLE or delta encoding&lt;/strong&gt; for columns with sorted or sequential data. These optimizations can significantly reduce storage and improve query efficiency.&lt;/p&gt;
&lt;h3&gt;6. Avoid Small Files&lt;/h3&gt;
&lt;p&gt;In distributed data systems, small files can become a performance bottleneck. Each file carries metadata overhead and incurs an I/O cost to open and read, so working with too many small files can slow down query execution. This is a common issue in data lakes and lakehouses where data is ingested in small batches.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Best Practice&lt;/strong&gt;: Consolidate small files into larger Parquet files whenever possible. Aim for file sizes in the range of &lt;strong&gt;128 MB to 1 GB&lt;/strong&gt;, depending on your system’s memory and processing capacity. Tools like &lt;strong&gt;Apache Spark&lt;/strong&gt; or &lt;strong&gt;Apache Hudi&lt;/strong&gt; offer mechanisms for compaction to combine small files into larger ones.&lt;/p&gt;
&lt;h3&gt;7. Monitor and Optimize Data Layout&lt;/h3&gt;
&lt;p&gt;Data layout plays a crucial role in query performance. Sorting your data by frequently queried columns can improve the effectiveness of &lt;strong&gt;min/max statistics&lt;/strong&gt; and &lt;strong&gt;predicate pushdown&lt;/strong&gt;, allowing query engines to skip irrelevant data.&lt;/p&gt;
&lt;p&gt;For example, sorting a dataset by &lt;code&gt;timestamp&lt;/code&gt; can improve the performance of time-range queries, as Parquet can quickly skip over rows outside the specified time window.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Best Practice&lt;/strong&gt;: Sort your data by columns frequently used in filters or range queries. This improves the efficiency of Parquet’s statistics and query pruning mechanisms.&lt;/p&gt;
&lt;h3&gt;8. Use Transactional Layers for Consistency&lt;/h3&gt;
&lt;p&gt;In data lakehouse environments, you can use &lt;strong&gt;transactional table formats&lt;/strong&gt; like &lt;strong&gt;Apache Iceberg&lt;/strong&gt;, &lt;strong&gt;Delta Lake&lt;/strong&gt;, or &lt;strong&gt;Apache Hudi&lt;/strong&gt; on top of Parquet to enforce &lt;strong&gt;ACID (Atomicity, Consistency, Isolation, Durability)&lt;/strong&gt; transactions. These layers ensure data consistency during concurrent reads and writes, allow for schema evolution, and enable advanced features like &lt;strong&gt;time-travel queries&lt;/strong&gt; and &lt;strong&gt;snapshot isolation&lt;/strong&gt;.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Best Practice&lt;/strong&gt;: Implement a transactional table format if you need ACID guarantees, versioning, or schema management in your data lake. These layers provide additional optimization for managing large-scale Parquet data.&lt;/p&gt;
&lt;h2&gt;Conclusion&lt;/h2&gt;
&lt;p&gt;Parquet’s powerful combination of columnar storage, compression, and rich metadata makes it an ideal file format for large-scale data storage and analytics. By following best practices around row group sizing, partitioning, compression, and encoding, you can further optimize your Parquet workflows for both performance and cost efficiency.&lt;/p&gt;
&lt;p&gt;Whether you’re working in a cloud-based data lake, a data warehouse, or a modern data lakehouse, tuning your Parquet files ensures that your queries run faster, your storage footprint is minimized, and your data infrastructure scales effectively.&lt;/p&gt;
&lt;p&gt;This concludes our 10-part series on the Parquet file format. We hope this deep dive has given you a solid understanding of Parquet’s capabilities and how to harness them in your data engineering projects.&lt;/p&gt;
&lt;p&gt;Thank you for following along, and feel free to revisit any part of the series as you continue optimizing your Parquet workflows!&lt;/p&gt;
</content:encoded><author>Alex Merced</author></item><item><title>Orchestrating Airflow DAGs with GitHub Actions - A Lightweight Approach to Data Curation Across Spark, Dremio, and Snowflake</title><link>https://iceberglakehouse.com/posts/2024-10-github-actions-dbt-airflow-data/</link><guid isPermaLink="true">https://iceberglakehouse.com/posts/2024-10-github-actions-dbt-airflow-data/</guid><description>
- [Free Copy of Apache Iceberg the Definitive Guide](https://hello.dremio.com/wp-apache-iceberg-the-definitive-guide-reg.html?utm_source=alexmerced&amp;u...</description><pubDate>Sat, 19 Oct 2024 09:00:00 GMT</pubDate><content:encoded>&lt;ul&gt;
&lt;li&gt;&lt;a href=&quot;https://hello.dremio.com/wp-apache-iceberg-the-definitive-guide-reg.html?utm_source=alexmerced&amp;amp;utm_medium=external_blog&amp;amp;utm_campaign=githubactionsairflow&quot;&gt;Free Copy of Apache Iceberg the Definitive Guide&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&quot;https://hello.dremio.com/webcast-an-apache-iceberg-lakehouse-crash-course-reg.html?utm_source=alexmerced&amp;amp;utm_medium=external_blog&amp;amp;utm_campaign=githubactionsairflow&quot;&gt;Free Apache Iceberg Crash Course&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&quot;https://www.youtube.com/watch?v=SIriNcVIGJQ&amp;amp;list=PLsLAVBjQJO0p0Yq1fLkoHvt2lEJj5pcYe&quot;&gt;Iceberg Lakehouse Engineering Video Playlist&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;Maintaining a persistent Airflow deployment can often add significant overhead to data engineering teams, especially when orchestrating tasks across diverse systems. While Airflow is a powerful orchestration tool, the infrastructure required to keep it running 24/7 may not always be necessary, particularly for workflows that can be triggered on demand.&lt;/p&gt;
&lt;p&gt;In this blog, we&apos;ll explore how to use &lt;strong&gt;GitHub Actions&lt;/strong&gt; as a lightweight alternative to trigger &lt;strong&gt;Airflow DAGs&lt;/strong&gt;. By leveraging GitHub Actions, we avoid the need for a persistent Airflow deployment while still orchestrating complex data pipelines across external systems like &lt;strong&gt;Apache Spark&lt;/strong&gt;, &lt;strong&gt;Dremio&lt;/strong&gt;, and &lt;strong&gt;Snowflake&lt;/strong&gt;.&lt;/p&gt;
&lt;p&gt;The example we&apos;ll walk through involves:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Ingesting raw data into a data lake with Spark,&lt;/li&gt;
&lt;li&gt;Using &lt;strong&gt;Dremio&lt;/strong&gt; and &lt;strong&gt;dbt&lt;/strong&gt; to create bronze, silver, and gold layers for your data without data replication,&lt;/li&gt;
&lt;li&gt;Accelerating access to the gold layer using &lt;strong&gt;Dremio Reflections&lt;/strong&gt;, and&lt;/li&gt;
&lt;li&gt;Ingesting the final gold-layer data into &lt;strong&gt;Snowflake&lt;/strong&gt; for further analysis.&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;This approach allows you to curate data efficiently while reducing operational complexity, making it ideal for teams looking to streamline their data orchestration without sacrificing flexibility or performance.&lt;/p&gt;
&lt;h2&gt;Why Trigger Airflow with GitHub Actions?&lt;/h2&gt;
&lt;p&gt;Orchestrating workflows with Airflow traditionally requires a persistent deployment, which often involves setting up infrastructure, managing resources, and ensuring uptime. While this makes sense for teams running continuous or complex pipelines, it can become an unnecessary overhead for more straightforward, on-demand workflows.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;GitHub Actions&lt;/strong&gt; offers an elegant alternative, providing a lightweight and flexible way to trigger workflows directly from your version-controlled repository. Instead of maintaining an Airflow instance 24/7, you can set up GitHub Actions to trigger the execution of your Airflow DAGs only when necessary, leveraging cloud resources efficiently.&lt;/p&gt;
&lt;p&gt;Here’s why this approach is beneficial:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Reduced Infrastructure Overhead&lt;/strong&gt;: By running Airflow in an ephemeral environment triggered by GitHub Actions, you eliminate the need to manage a persistent Airflow deployment.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Version Control Integration&lt;/strong&gt;: Since GitHub Actions is tightly coupled with your repository, any code changes to your DAGs, dbt models, or other workflows can seamlessly trigger orchestration tasks.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Cost Efficiency&lt;/strong&gt;: You only spin up resources when necessary, optimizing cloud costs and avoiding expenses tied to idle infrastructure.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Flexibility&lt;/strong&gt;: GitHub Actions can integrate with a variety of external systems such as Apache Spark, Dremio, and Snowflake, allowing you to trigger specific data tasks from ingestion to transformation and loading.&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;With GitHub Actions, your Airflow DAGs become an extension of your repository, enabling streamlined and automated data orchestration with minimal setup.&lt;/p&gt;
&lt;h2&gt;Setting up GitHub Actions to Trigger Airflow DAGs&lt;/h2&gt;
&lt;p&gt;To trigger Airflow DAGs using GitHub Actions, we need to create a workflow that runs an Airflow instance inside a Docker container, executes the DAG, and then cleans up after the job is complete. This approach avoids maintaining a persistent Airflow deployment while still enabling orchestration across different systems.&lt;/p&gt;
&lt;h3&gt;Step 1: Define Your Airflow DAG&lt;/h3&gt;
&lt;p&gt;First, ensure your Airflow DAG is defined within your repository. The DAG will contain tasks that handle each stage of your pipeline, from data ingestion with Spark to data transformation with Dremio and loading into Snowflake. Here&apos;s an example of a simple Airflow DAG definition:&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-python&quot;&gt;from airflow import DAG
from airflow.operators.dummy_operator import DummyOperator
from airflow.operators.python_operator import PythonOperator
from datetime import datetime

def run_spark_job():
    # Logic for running the Spark job
    pass

def run_dbt_task():
    # Logic for running dbt transformations in Dremio
    pass

def load_into_snowflake():
    # Logic for loading the gold layer into Snowflake
    pass

dag = DAG(&apos;example_dag&apos;, description=&apos;A sample DAG&apos;,
          schedule_interval=&apos;@once&apos;, start_date=datetime(2024, 10, 1), catchup=False)

start = DummyOperator(task_id=&apos;start&apos;, dag=dag)
spark_task = PythonOperator(task_id=&apos;run_spark_job&apos;, python_callable=run_spark_job, dag=dag)
dbt_task = PythonOperator(task_id=&apos;run_dbt_task&apos;, python_callable=run_dbt_task, dag=dag)
snowflake_task = PythonOperator(task_id=&apos;load_into_snowflake&apos;, python_callable=load_into_snowflake, dag=dag)

start &amp;gt;&amp;gt; spark_task &amp;gt;&amp;gt; dbt_task &amp;gt;&amp;gt; snowflake_task
&lt;/code&gt;&lt;/pre&gt;
&lt;h3&gt;Step 2: Create a GitHub Actions Workflow&lt;/h3&gt;
&lt;p&gt;Next, you&apos;ll define a GitHub Actions workflow that will be triggered when certain conditions are met (e.g., a pull request or a scheduled run). This workflow will launch an ephemeral Airflow environment, execute the DAG, and then shut down the environment.&lt;/p&gt;
&lt;p&gt;Here’s an example of a GitHub Actions workflow file (&lt;code&gt;.github/workflows/trigger-airflow.yml&lt;/code&gt;):&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-yaml&quot;&gt;name: Trigger Airflow DAG

on:
  push:
    branches:
      - main

jobs:
  trigger-airflow:
    runs-on: ubuntu-latest

    steps:
      - name: Checkout repository
        uses: actions/checkout@v3

      - name: Set up Docker
        uses: docker/setup-buildx-action@v2

      - name: Start Airflow with Docker
        run: |
          docker-compose up -d  # Start Airflow containers defined in a docker-compose.yml file

      - name: Trigger Airflow DAG
        run: |
          docker exec -ti &amp;lt;airflow-webserver-container&amp;gt; airflow dags trigger example_dag

      - name: Clean up
        run: |
          docker-compose down  # Stop and remove Airflow containers
&lt;/code&gt;&lt;/pre&gt;
&lt;h3&gt;Step 3: Docker Compose for Airflow&lt;/h3&gt;
&lt;p&gt;You’ll also need a &lt;code&gt;docker-compose.yml&lt;/code&gt; file in your repository to define how to launch Airflow in a containerized environment:&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-yaml&quot;&gt;version: &apos;3&apos;
services:
  postgres:
    image: postgres:13
    environment:
      POSTGRES_USER: airflow
      POSTGRES_PASSWORD: airflow
      POSTGRES_DB: airflow
  webserver:
    image: apache/airflow:2.5.1
    environment:
      - AIRFLOW__CORE__SQL_ALCHEMY_CONN=postgresql+psycopg2://airflow:airflow@postgres/airflow
    ports:
      - &amp;quot;8080:8080&amp;quot;
    depends_on:
      - postgres
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;With this setup, the Airflow instance will run only when triggered by GitHub Actions, allowing you to execute your DAG tasks without maintaining a permanent deployment.&lt;/p&gt;
&lt;h3&gt;Step 4: Automating and Monitoring&lt;/h3&gt;
&lt;p&gt;After your workflow is triggered, GitHub Actions will orchestrate the process of starting Airflow, executing the DAG, and cleaning up. You can monitor the status of your DAGs and tasks directly within the GitHub Actions interface, making it easy to track your pipeline’s progress.&lt;/p&gt;
&lt;h2&gt;Ingesting Data into Your Lake with Apache Spark&lt;/h2&gt;
&lt;p&gt;The first step in our pipeline is ingesting raw data into a data lake using &lt;strong&gt;Apache Spark&lt;/strong&gt;. Spark, as a distributed computing engine, excels at handling large-scale data ingestion tasks, making it a popular choice for data engineering workflows. In this setup, rather than running Spark locally or within Docker containers, we’ll configure our Airflow DAG to submit PySpark jobs to a &lt;strong&gt;remote Spark cluster&lt;/strong&gt;. This cluster can be independently deployed or managed by services like &lt;strong&gt;AWS EMR&lt;/strong&gt; or &lt;strong&gt;Databricks&lt;/strong&gt;.&lt;/p&gt;
&lt;p&gt;Once triggered, the PySpark job will read raw data from an external source (such as a cloud storage service or database) and write the processed data into the data lake.&lt;/p&gt;
&lt;h4&gt;Step 1: Configuring the Airflow Task for Spark Ingestion&lt;/h4&gt;
&lt;p&gt;In the Airflow DAG, we define a task that submits a PySpark job to the remote Spark cluster. The code will use the PySpark library to establish a connection to the remote Spark master, perform data ingestion, and write the results into the data lake.&lt;/p&gt;
&lt;p&gt;Here’s an example of the &lt;code&gt;run_spark_job&lt;/code&gt; function that sends the Spark job to a remote cluster:&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-python&quot;&gt;def run_spark_job():
    # Example of a Spark job for data ingestion
    from pyspark.sql import SparkSession

    # Connect to a remote Spark cluster (e.g., AWS EMR, Databricks)
    spark = SparkSession.builder \
        .master(&amp;quot;spark://&amp;lt;remote-spark-master&amp;gt;:7077&amp;quot;) \
        .appName(&amp;quot;DataIngestion&amp;quot;) \
        .getOrCreate()

    # Read data from an external source (e.g., S3 bucket)
    raw_data = spark.read.format(&amp;quot;csv&amp;quot;).option(&amp;quot;header&amp;quot;, &amp;quot;true&amp;quot;).load(&amp;quot;s3://my-bucket/raw-data&amp;quot;)

    # Write the data to the &amp;quot;bronze&amp;quot; layer of the data lake in Parquet format
    raw_data.write.format(&amp;quot;parquet&amp;quot;).save(&amp;quot;s3://my-data-lake/bronze/raw_data&amp;quot;)

    # Stop the Spark session
    spark.stop()
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;In this example:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Connection to the Remote Cluster:&lt;/strong&gt; The master argument specifies the Spark master URL for your remote Spark cluster. This could be the endpoint of an AWS EMR cluster, Databricks, or a standalone Spark cluster.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Data Ingestion:&lt;/strong&gt; The task reads raw data from an external source (e.g., an S3 bucket) using Spark&apos;s read API and writes it in a columnar format (Parquet) to the bronze layer of the data lake.&lt;/li&gt;
&lt;/ul&gt;
&lt;h3&gt;Step 2: Configuring Remote Spark Cluster Access&lt;/h3&gt;
&lt;p&gt;Since we are submitting Spark jobs to a remote cluster, it&apos;s crucial that your Airflow tasks have the correct information about the Spark cluster. This configuration includes specifying the Spark master URL, cluster authentication details (if required), and any additional Spark configuration needed for interacting with the cluster (e.g., access keys for cloud storage).&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;For AWS EMR:&lt;/strong&gt; You would configure Airflow to submit Spark jobs via the EMR cluster&apos;s master node. Make sure to use the appropriate security settings and AWS credentials for accessing S3 and interacting with the EMR cluster.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;For Databricks:&lt;/strong&gt; Use the Databricks REST API to submit Spark jobs, or configure the Spark session to interact with Databricks directly.&lt;/p&gt;
&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;Here’s how you can adjust the Spark connection in Airflow for AWS EMR:&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-python&quot;&gt;spark = SparkSession.builder \
    .master(&amp;quot;yarn&amp;quot;) \
    .config(&amp;quot;spark.yarn.access.hadoopFileSystems&amp;quot;, &amp;quot;s3://&amp;lt;my-s3-bucket&amp;gt;&amp;quot;) \
    .appName(&amp;quot;DataIngestion&amp;quot;) \
    .getOrCreate()
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;For Databricks, you might use:&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-python&quot;&gt;spark = SparkSession.builder \
    .master(&amp;quot;databricks&amp;quot;) \
    .config(&amp;quot;spark.databricks.service.token&amp;quot;, &amp;quot;&amp;lt;your-databricks-token&amp;gt;&amp;quot;) \
    .getOrCreate()
&lt;/code&gt;&lt;/pre&gt;
&lt;h3&gt;Step 3: Automating Spark Job Submission via Airflow&lt;/h3&gt;
&lt;p&gt;Once your Airflow DAG is configured to submit PySpark jobs to the remote cluster, the workflow can be triggered automatically based on events such as code changes or data availability. Airflow will take care of scheduling and running the task, ensuring that your PySpark job is executed at the right time.&lt;/p&gt;
&lt;h3&gt;Key Benefits of Using a Remote Spark Cluster:&lt;/h3&gt;
&lt;ul&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Scalability:&lt;/strong&gt; By offloading the job to a remote Spark cluster, you take advantage of its distributed nature, enabling large-scale data ingestion and processing.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Flexibility:&lt;/strong&gt; Spark can handle a wide range of data formats (CSV, JSON, Parquet) and data sources (databases, cloud storage).&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Managed Infrastructure:&lt;/strong&gt; When using managed services like AWS EMR or Databricks, you don&apos;t need to manage and maintain the Spark cluster yourself, reducing operational overhead.
With the data successfully ingested into the bronze layer of your data lake, the next step in your pipeline is to transform and curate this data using other tools such as Dremio and dbt.&lt;/p&gt;
&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;&lt;em&gt;note: For smaller scale data where you may not want to to manage a Spark cluster, consider using the alexmerced/spark35nb to run Spark at a container within your github actions enviroment&lt;/em&gt;&lt;/p&gt;
&lt;h2&gt;Creating Bronze/Silver/Gold Views in Dremio with dbt&lt;/h2&gt;
&lt;p&gt;After ingesting data into the &lt;strong&gt;bronze&lt;/strong&gt; layer of your data lake using Spark, the next step is to curate and organize the data into &lt;strong&gt;silver&lt;/strong&gt; and &lt;strong&gt;gold&lt;/strong&gt; layers. This involves transforming raw data into cleaned, enriched datasets and ultimately into the most refined, ready-for-analysis forms. To avoid data duplication, we’ll leverage &lt;strong&gt;Dremio&lt;/strong&gt;&apos;s virtual datasets and &lt;strong&gt;dbt&lt;/strong&gt; for managing these transformations, while &lt;strong&gt;Dremio Reflections&lt;/strong&gt; will accelerate queries on the gold layer.&lt;/p&gt;
&lt;p&gt;&lt;a href=&quot;https://www.dremio.com/blog/value-of-dbt-with-dremio/?utm_source=alexmerced&amp;amp;utm_medium=external_blog&amp;amp;utm_campaign=githubactionsairflow&quot;&gt;Resources on using Dremio with dbt&lt;/a&gt;&lt;/p&gt;
&lt;h4&gt;Step 1: Defining Bronze, Silver, and Gold Layers&lt;/h4&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Bronze&lt;/strong&gt;: Raw, unprocessed data that’s ingested into the lake (as achieved in the previous Spark ingestion step).&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Silver&lt;/strong&gt;: Cleaned and partially processed data, prepared for business analysis.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Gold&lt;/strong&gt;: Fully transformed, aggregated data ready for reporting and advanced analytics.&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;Using &lt;strong&gt;Dremio&lt;/strong&gt;’s virtual datasets allows you to define each of these layers without physically copying data. Dremio provides a powerful semantic layer on top of your data lake, which, combined with dbt’s SQL-based transformations, enables easy curation of these layers.&lt;/p&gt;
&lt;h4&gt;Step 2: Configuring dbt to Transform Data in Dremio&lt;/h4&gt;
&lt;p&gt;We’ll use dbt (data build tool) to define the transformations that move data from the bronze layer to the silver and gold layers. This is done using SQL models in dbt, and Dremio acts as the engine that executes these transformations.&lt;/p&gt;
&lt;p&gt;Example dbt model for transforming bronze to silver:&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-sql&quot;&gt;-- models/silver_layer.sql
WITH bronze_data AS (
  SELECT *
  FROM my_lake.bronze.raw_data
)
SELECT 
  customer_id,
  order_date,
  total_amount,
  -- Additional transformations
  CASE 
    WHEN total_amount &amp;gt; 100 THEN &apos;high_value&apos;
    ELSE &apos;regular&apos;
  END AS customer_value_category
FROM bronze_data
WHERE order_status = &apos;completed&apos;;
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;This model:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Reads data from the bronze layer.&lt;/li&gt;
&lt;li&gt;Applies basic filtering and transformations.&lt;/li&gt;
&lt;li&gt;Outputs a cleaned dataset for the silver layer.&lt;/li&gt;
&lt;/ul&gt;
&lt;h3&gt;Step 3: Creating Virtual Views for Bronze, Silver, and Gold&lt;/h3&gt;
&lt;p&gt;In Dremio, dbt creates virtual datasets (views), meaning the data is not physically replicated at each stage of the pipeline. Instead, you define logical views that can be queried as needed. This reduces the need for storage while still allowing for efficient querying of each layer.&lt;/p&gt;
&lt;p&gt;In your Airflow DAG, you can add a task to trigger dbt transformations:&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-python&quot;&gt;def run_dbt_task():
    import subprocess
    # Run the dbt transformation
    subprocess.run([&amp;quot;dbt&amp;quot;, &amp;quot;run&amp;quot;], check=True)
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;This Airflow task will run the dbt transformation, applying changes to your Dremio virtual datasets, curating the data from bronze to silver, and then from silver to gold. &lt;em&gt;(Note: Make sure your dbt project is copied to  your Airflow environment, and that the dbt command is run in the directory where your dbt project is located.)&lt;/em&gt;&lt;/p&gt;
&lt;h3&gt;Step 4: Accelerating Queries with Dremio Reflections&lt;/h3&gt;
&lt;p&gt;To ensure fast access to the gold layer, you can enable Dremio Reflections. Reflections are Dremio’s optimization mechanism that pre-computes and caches the results of expensive queries, significantly improving query performance on large datasets.&lt;/p&gt;
&lt;p&gt;In your pipeline, after creating the gold layer with dbt, configure Dremio to create an Incremental Reflection on the gold layer dataset:&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-sql&quot;&gt;ALTER TABLE my_lake.gold_data
  CREATE RAW REFLECTION gold_accelerator USING DISPLAY (id,lastName,firstName,address,country)
    PARTITION BY (country)
    LOCALSORT BY (lastName);
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;This ensures that queries on the gold layer are accelerated, reducing response times and improving the performance of downstream analytics tasks.&lt;/p&gt;
&lt;h3&gt;Step 5: Automating the Transformation Process&lt;/h3&gt;
&lt;p&gt;Assuming the dbt job is part of your Airflow DAG, you can automate the transformation process by scheduling the dbt job to run when the Airflow Dag runs based on your Github Actions workflow. This ensures that your data is always up-to-date and ready for analysis.&lt;/p&gt;
&lt;h3&gt;Why Use dbt and Dremio for Data Transformation?&lt;/h3&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Data Virtualization:&lt;/strong&gt; Dremio’s virtual datasets allow you to define and query data layers without physically copying data.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Transformation Management:&lt;/strong&gt; dbt’s SQL-based transformation framework simplifies defining and maintaining transformations between data layers.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Performance Boost:&lt;/strong&gt; Dremio Reflections enable fast querying of curated data, making the process efficient for reporting and analytics.&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;With the gold layer now refined and optimized for fast querying. While you can run AI/ML and BI workloads directly from Dremio, you may still want some of your gold datasets in your data warehouse so the final step is to load this data into Snowflake for further business analysis.&lt;/p&gt;
&lt;h2&gt;Ingesting the Gold Layer into Snowflake via Dremio and Apache Arrow Flight&lt;/h2&gt;
&lt;p&gt;In this section, we’ll demonstrate how to ingest the &lt;strong&gt;gold layer&lt;/strong&gt; from &lt;strong&gt;Dremio&lt;/strong&gt; into &lt;strong&gt;Snowflake&lt;/strong&gt; using &lt;strong&gt;Apache Arrow Flight&lt;/strong&gt; and the &lt;strong&gt;dremio-simple-query&lt;/strong&gt; library. This method allows you to efficiently fetch large datasets from Dremio using Arrow Flight, convert them into a Pandas DataFrame, and load the results into Snowflake. This approach allows us to maxmize our leveraging of fast data retrieval from Dremio to feed pre-curated data into Snowflake for further analysis.&lt;/p&gt;
&lt;h4&gt;Step 1: Configuring the Airflow Task for Data Retrieval via Apache Arrow Flight&lt;/h4&gt;
&lt;p&gt;The Airflow task will connect to Dremio, retrieve the gold-layer dataset using &lt;strong&gt;Apache Arrow Flight&lt;/strong&gt;, convert the result into a Pandas DataFrame, and then ingest that data into Snowflake. Here&apos;s how you can define the Airflow task:&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-python&quot;&gt;def load_into_snowflake():
    import snowflake.connector
    from dremio_simple_query.connect import DremioConnection
    import pandas as pd

    # Dremio connection details
    token = &amp;quot;&amp;lt;your_dremio_token&amp;gt;&amp;quot;
    arrow_endpoint = &amp;quot;grpc://&amp;lt;dremio_instance&amp;gt;:32010&amp;quot;

    # Establish connection with Dremio via Apache Arrow Flight
    dremio = DremioConnection(token, arrow_endpoint)

    # Query to fetch the gold layer dataset from Dremio
    df = dremio.toPandas(&amp;quot;SELECT * FROM my_lake.gold.final_data;&amp;quot;)

    # Connect to Snowflake
    conn = snowflake.connector.connect(
        user=&apos;&amp;lt;your_user&amp;gt;&apos;,
        password=&apos;&amp;lt;your_password&amp;gt;&apos;,
        account=&apos;&amp;lt;your_account&amp;gt;&apos;,
        warehouse=&apos;&amp;lt;your_warehouse&amp;gt;&apos;,
        database=&apos;&amp;lt;your_database&amp;gt;&apos;,
        schema=&apos;&amp;lt;your_schema&amp;gt;&apos;
    )

    # Write the gold layer data to Snowflake
    cursor = conn.cursor()

    # Ingest the data using Snowflake&apos;s PUT and COPY INTO
    # Convert the DataFrame to a CSV for ingestion (or another format supported by Snowflake)
    df.to_csv(&amp;quot;/tmp/gold_data.csv&amp;quot;, index=False)

    cursor.execute(&amp;quot;PUT file:///tmp/gold_data.csv @my_stage&amp;quot;)
    cursor.execute(&amp;quot;&amp;quot;&amp;quot;
        COPY INTO snowflake_table
        FROM @my_stage/gold_data.csv
        FILE_FORMAT = (TYPE = &apos;CSV&apos;, FIELD_OPTIONALLY_ENCLOSED_BY = &apos;&amp;quot;&apos;);
    &amp;quot;&amp;quot;&amp;quot;)

    conn.commit()
    cursor.close()
    conn.close()
&lt;/code&gt;&lt;/pre&gt;
&lt;h3&gt;Step 2: Fetching Data from Dremio with Apache Arrow Flight&lt;/h3&gt;
&lt;p&gt;The key to this approach is using Apache Arrow Flight to pull data from Dremio efficiently. The dremio-simple-query library allows you to run SQL queries against Dremio and fetch results in formats that are easy to manipulate in Python, such as Arrow Tables or Pandas DataFrames.&lt;/p&gt;
&lt;p&gt;In the Airflow task, we use the &lt;code&gt;.toPandas()&lt;/code&gt; method to retrieve the gold layer dataset as a Pandas DataFrame:&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-python&quot;&gt;# Fetch data from Dremio using Arrow Flight
df = dremio.toPandas(&amp;quot;SELECT * FROM my_lake.gold.final_data;&amp;quot;)
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;This method ensures fast retrieval of large datasets, which can then be directly ingested into Snowflake. Keep in mind for extra large datasets, you can use the &lt;code&gt;toArrow&lt;/code&gt; method to get a recordBatchReader to processed the data in batches (iterate through each batch and process it).&lt;/p&gt;
&lt;h3&gt;Step 3: Ingesting the Data into Snowflake&lt;/h3&gt;
&lt;p&gt;Once the data is retrieved from Dremio, it’s converted into a format Snowflake can ingest. In this case, we’re using CSV for simplicity, but you can use other formats (such as Parquet) supported by both Snowflake and Dremio.&lt;/p&gt;
&lt;p&gt;The ingestion process involves:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Uploading the data to a Snowflake staging area using the PUT command.&lt;/li&gt;
&lt;li&gt;Copying the data into the target Snowflake table using the COPY INTO command.&lt;/li&gt;
&lt;/ul&gt;
&lt;pre&gt;&lt;code class=&quot;language-python&quot;&gt;cursor.execute(&amp;quot;PUT file:///tmp/gold_data.csv @my_stage&amp;quot;)
cursor.execute(&amp;quot;&amp;quot;&amp;quot;
    COPY INTO snowflake_table
    FROM @my_stage/gold_data.csv
    FILE_FORMAT = (TYPE = &apos;CSV&apos;, 
    FIELD_OPTIONALLY_ENCLOSED_BY = &apos;&amp;quot;&apos;);
&amp;quot;&amp;quot;&amp;quot;)
&lt;/code&gt;&lt;/pre&gt;
&lt;h3&gt;Step 4: Automating the Process with GitHub Actions&lt;/h3&gt;
&lt;p&gt;As with previous steps, the ingestion process can be fully automated and triggered by GitHub Actions by being part of your Airflow DAG. When the GitHub Actions workflow is triggered the Airflow DAG that retrieves the gold layer data from Dremio and loads it into Snowflake.&lt;/p&gt;
&lt;h3&gt;Why Use Apache Arrow Flight and Dremio for Data Ingestion?&lt;/h3&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;High-performance data retrieval:&lt;/strong&gt; Apache Arrow Flight allows for fast, efficient transfer of large datasets between Dremio and external systems.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Simplified data handling:&lt;/strong&gt; With the ability to retrieve data as Pandas DataFrames, it’s easy to manipulate and process the data before loading it into Snowflake.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Seamless integration:&lt;/strong&gt; Using Arrow Flight ensures that the data transfer between Dremio and Snowflake is both high-performing and streamlined, reducing data replication and improving the overall efficiency of your pipeline.&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;This completes the final step of your pipeline, with the gold layer data now available in Snowflake for reporting and analytics.&lt;/p&gt;
&lt;h2&gt;Ensuring Python Libraries, Environment Variables, DAGs, and dbt Projects are Accessible in the Airflow Container&lt;/h2&gt;
&lt;p&gt;When setting up an Airflow DAG to interact with external systems like &lt;strong&gt;Dremio&lt;/strong&gt;, &lt;strong&gt;Snowflake&lt;/strong&gt;, and &lt;strong&gt;Apache Spark&lt;/strong&gt;, it&apos;s essential to properly configure the environment to ensure all Python libraries, environment variables, DAG files, and &lt;strong&gt;dbt&lt;/strong&gt; projects are accessible inside the Dockerized Airflow environment. This section will guide you through configuring your environment to:&lt;/p&gt;
&lt;ol&gt;
&lt;li&gt;Install required Python libraries using &lt;strong&gt;Docker Compose&lt;/strong&gt;.&lt;/li&gt;
&lt;li&gt;Ensure &lt;strong&gt;environment variables&lt;/strong&gt; (e.g., API tokens, credentials) are securely passed from &lt;strong&gt;GitHub Secrets&lt;/strong&gt; to Docker containers.&lt;/li&gt;
&lt;li&gt;Make &lt;strong&gt;DAGs&lt;/strong&gt; from your repository available to the Airflow container.&lt;/li&gt;
&lt;li&gt;Copy and configure your &lt;strong&gt;dbt project&lt;/strong&gt; inside the Airflow environment.&lt;/li&gt;
&lt;/ol&gt;
&lt;h3&gt;Step 1: Setting up Docker Compose to Include Python Libraries&lt;/h3&gt;
&lt;p&gt;To ensure that all necessary Python dependencies (e.g., &lt;code&gt;dremio-simple-query&lt;/code&gt;, &lt;code&gt;snowflake-connector-python&lt;/code&gt;, &lt;code&gt;pandas&lt;/code&gt;, &lt;code&gt;dbt-core&lt;/code&gt;, &lt;code&gt;pyspark&lt;/code&gt;) are installed in the Airflow container, you can extend the default Airflow image using a custom &lt;code&gt;Dockerfile&lt;/code&gt; and adjust your &lt;code&gt;docker-compose.yml&lt;/code&gt; configuration to use this image.&lt;/p&gt;
&lt;p&gt;Here’s an example &lt;code&gt;Dockerfile&lt;/code&gt; that installs the required libraries:&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-Dockerfile&quot;&gt;# Use the official Airflow image
FROM apache/airflow:2.5.1

# Install required Python libraries
RUN pip install dremio-simple-query==&amp;lt;version&amp;gt; snowflake-connector-python==&amp;lt;version&amp;gt; pandas==&amp;lt;version&amp;gt; dbt-core==&amp;lt;version&amp;gt; pyspark==&amp;lt;version&amp;gt;

# Copy the dbt project into the container
COPY ./dbt_project /usr/local/airflow/dbt_project
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;In your &lt;code&gt;docker-compose.yml file&lt;/code&gt;, reference this custom image to ensure the Airflow container includes all the necessary dependencies:&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-yaml&quot;&gt;version: &apos;3&apos;
services:
  webserver:
    build: 
      context: .
      dockerfile: Dockerfile  # Use the custom Dockerfile for the Airflow container
    environment:
      - LOAD_EXAMPLES=no
      - EXECUTOR=LocalExecutor
    ports:
      - &amp;quot;8080:8080&amp;quot;
    volumes:
      - ./dags:/opt/airflow/dags  # Mount the DAGs from the local folder
    depends_on:
      - postgres
  postgres:
    image: postgres:13
    environment:
      POSTGRES_USER: airflow
      POSTGRES_PASSWORD: airflow
      POSTGRES_DB: airflow
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;This ensures that all required Python libraries and the dbt project are installed in the Airflow container, allowing it to execute tasks involving Dremio, Spark, Snowflake, and dbt.&lt;/p&gt;
&lt;h3&gt;Step 2: Configuring Environment Variables from GitHub Secrets&lt;/h3&gt;
&lt;p&gt;To securely pass environment variables, such as API tokens or credentials, from your GitHub repository’s secrets to the Docker container, you can use GitHub Actions. These variables may be required to connect to external services such as Dremio, Snowflake, and your Spark cluster.&lt;/p&gt;
&lt;h4&gt;Steps:&lt;/h4&gt;
&lt;ul&gt;
&lt;li&gt;Store the necessary secrets in GitHub Secrets (e.g., DREMIO_TOKEN, SNOWFLAKE_USER, SPARK_MASTER).&lt;/li&gt;
&lt;li&gt;Pass these secrets to the Airflow container by configuring your GitHub Actions workflow.&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;Here’s an example of how to pass environment variables using GitHub Actions:&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-yaml&quot;&gt;jobs:
  trigger-airflow:
    runs-on: ubuntu-latest

    steps:
      - name: Checkout repository
        uses: actions/checkout@v3

      - name: Set up Docker
        uses: docker/setup-buildx-action@v2

      - name: Run Airflow containers
        run: |
          docker-compose up -d
        env:
          DREMIO_TOKEN: ${{ secrets.DREMIO_TOKEN }}
          SNOWFLAKE_USER: ${{ secrets.SNOWFLAKE_USER }}
          SNOWFLAKE_PASSWORD: ${{ secrets.SNOWFLAKE_PASSWORD }}
          SPARK_MASTER: ${{ secrets.SPARK_MASTER }}
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;In your &lt;code&gt;docker-compose.yml&lt;/code&gt;, ensure the container receives these environment variables:&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-yaml&quot;&gt;services:
  webserver:
    environment:
      - DREMIO_TOKEN=${DREMIO_TOKEN}
      - SNOWFLAKE_USER=${SNOWFLAKE_USER}
      - SNOWFLAKE_PASSWORD=${SNOWFLAKE_PASSWORD}
      - SPARK_MASTER=${SPARK_MASTER}
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;This ensures that all required credentials and tokens are securely passed to the container environment and accessible during Airflow task execution.&lt;/p&gt;
&lt;h3&gt;Step 3: Mounting DAGs from the Repository into the Airflow Container&lt;/h3&gt;
&lt;p&gt;To ensure that your DAGs from the repository are available in the Airflow container, you need to mount the DAGs folder from the local environment into the container. This is achieved through volume mapping in your docker-compose.yml file.&lt;/p&gt;
&lt;p&gt;Example configuration:&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-yaml&quot;&gt;services:
  webserver:
    volumes:
      - ./dags:/opt/airflow/dags  # Mounts the local DAGs folder into the Airflow container
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;This ensures that any DAGs in your GitHub repository are made available to the Airflow container and can be automatically picked up for execution. In GitHub Actions, ensure that the repository is checked out before running the docker-compose up command to make sure the latest DAGs are present.&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-yaml&quot;&gt;- name: Checkout repository
  uses: actions/checkout@v3

- name: Run Airflow containers
  run: |
    docker-compose up -d
&lt;/code&gt;&lt;/pre&gt;
&lt;h3&gt;Step 4: Copying the dbt Project and Creating dbt Profiles Using Environment Variables&lt;/h3&gt;
&lt;p&gt;To enable Airflow to run dbt models, you need to copy the dbt project into the Airflow container and configure the dbt profiles.yml file with the necessary environment variables (such as credentials for Dremio or Snowflake).&lt;/p&gt;
&lt;h4&gt;Copying the dbt Project:&lt;/h4&gt;
&lt;p&gt;In your Dockerfile, ensure that the COPY command includes your dbt project directory, as shown below:&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-Dockerfile&quot;&gt;# Copy the dbt project into the container
COPY ./dbt_project /usr/local/airflow/dbt_project
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;&lt;em&gt;note: You could also use a volume mapping in your docker-compose.yml file to achieve the same result.&lt;/em&gt;&lt;/p&gt;
&lt;h4&gt;Creating the dbt Profile:&lt;/h4&gt;
&lt;p&gt;The dbt &lt;code&gt;profiles.yml&lt;/code&gt; file typically contains the connection details for the database or data warehouse you&apos;re working with. You can generate this file dynamically inside the Airflow container, using environment variables passed from GitHub Secrets.&lt;/p&gt;
&lt;p&gt;Example of creating a dbt &lt;code&gt;profiles.yml&lt;/code&gt; file:&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-python&quot;&gt;def create_dbt_profile():
    profiles_content = f&amp;quot;&amp;quot;&amp;quot;
    dremio:
      target: dev
      outputs:
        dev:
          type: odbc
          driver: Dremio ODBC Driver
          host: {os.getenv(&apos;DREMIO_HOST&apos;)}
          port: 31010
          user: {os.getenv(&apos;DREMIO_USER&apos;)}
          password: {os.getenv(&apos;DREMIO_PASSWORD&apos;)}
          database: {os.getenv(&apos;DREMIO_DB&apos;)}
    &amp;quot;&amp;quot;&amp;quot;
    with open(&apos;/usr/local/airflow/dbt_project/profiles.yml&apos;, &apos;w&apos;) as file:
        file.write(profiles_content)
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Ensure this function is called before running any dbt models in your Airflow DAG to dynamically create the profiles.yml with the correct environment variables for each environment.&lt;/p&gt;
&lt;h3&gt;Summary&lt;/h3&gt;
&lt;p&gt;By following these steps, you ensure that:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;All required Python libraries (for Dremio, Snowflake, dbt, etc.) are installed in the Airflow container via a custom Docker image.&lt;/li&gt;
&lt;li&gt;Environment variables (such as API tokens and credentials) are securely passed from GitHub Secrets to the container environment.&lt;/li&gt;
&lt;li&gt;DAG files and dbt projects from your repository are mounted and configured in the Airflow container.&lt;/li&gt;
&lt;li&gt;The dbt profiles are dynamically created using environment variables for flexibility across environments.&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;With this setup, your Airflow container is fully configured to run your data pipeline, handling tasks from Spark data ingestion to dbt transformations and Snowflake loading, all triggered by GitHub Actions.&lt;/p&gt;
&lt;h2&gt;Optimizing Performance of the Workflow&lt;/h2&gt;
&lt;p&gt;When building a data pipeline using &lt;strong&gt;GitHub Actions&lt;/strong&gt; to trigger &lt;strong&gt;Airflow DAGs&lt;/strong&gt;, it’s important to ensure that your workflow is not only functional but also optimized for performance. This becomes especially critical when dealing with large datasets or complex workflows involving multiple external systems like &lt;strong&gt;Apache Spark&lt;/strong&gt;, &lt;strong&gt;Dremio&lt;/strong&gt;, and &lt;strong&gt;Snowflake&lt;/strong&gt;. In this section, we will explore several strategies to optimize the performance of your GitHub Actions workflow, from reducing unnecessary triggers to improving the efficiency of your data processing tasks.&lt;/p&gt;
&lt;h3&gt;1. &lt;strong&gt;Use Caching to Avoid Rebuilding Images&lt;/strong&gt;&lt;/h3&gt;
&lt;p&gt;One of the biggest performance bottlenecks in Docker-based workflows is the repeated building of Docker images each time the workflow is triggered. To optimize performance, you can use GitHub Actions’ built-in &lt;strong&gt;caching&lt;/strong&gt; feature to cache dependencies and intermediate stages of your Docker builds. This avoids having to rebuild your container and re-download libraries every time the workflow runs.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;How to implement caching:&lt;/strong&gt;&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Cache Docker layers&lt;/strong&gt;: Cache Docker build layers to speed up image builds.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Cache Python dependencies&lt;/strong&gt;: If your workflow installs Python libraries (e.g., &lt;code&gt;dremio-simple-query&lt;/code&gt;, &lt;code&gt;snowflake-connector-python&lt;/code&gt;), cache the &lt;code&gt;pip&lt;/code&gt; packages between runs.&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;Example using Docker layer caching in GitHub Actions:&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-yaml&quot;&gt;jobs:
  build:
    runs-on: ubuntu-latest
    steps:
      - name: Set up Docker Buildx
        uses: docker/setup-buildx-action@v2

      - name: Cache Docker layers
        uses: actions/cache@v3
        with:
          path: /tmp/.buildx-cache
          key: ${{ runner.os }}-buildx-${{ github.sha }}
          restore-keys: |
            ${{ runner.os }}-buildx-

      - name: Build and push Docker image
        run: docker-compose build --cache-from type=local,src=/tmp/.buildx-cache
&lt;/code&gt;&lt;/pre&gt;
&lt;h4&gt;Benefits:&lt;/h4&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Reduces build times:&lt;/strong&gt; Speeds up the workflow by avoiding the need to rebuild the entire Docker image or re-install Python dependencies each time.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Improves CI efficiency:&lt;/strong&gt; Caching is particularly useful for speeding up continuous integration processes where workflows are frequently triggered.&lt;/li&gt;
&lt;/ul&gt;
&lt;h3&gt;2. Trigger Workflows Only When Necessary&lt;/h3&gt;
&lt;p&gt;To avoid unnecessary executions of your workflow, ensure that the pipeline only runs when changes relevant to the DAG or pipeline configuration are made. This can be achieved by using conditional triggers or path filters in your GitHub Actions workflow file. By narrowing down the workflow to run only when critical files (e.g., DAG files, configuration files) are changed, you reduce unnecessary execution and improve overall performance.&lt;/p&gt;
&lt;p&gt;Example of filtering based on path:&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-yaml&quot;&gt;on:
  push:
    branches:
      - main
    paths:
      - &apos;dags/**&apos;
      - &apos;dbt/**&apos;
&lt;/code&gt;&lt;/pre&gt;
&lt;h4&gt;Benefits:&lt;/h4&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Avoids unnecessary runs:&lt;/strong&gt; Prevents the workflow from running when unrelated files are changed, conserving computational resources.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Optimizes execution frequency:&lt;/strong&gt; Ensures that the pipeline only runs when relevant changes occur, reducing the load on your infrastructure.&lt;/li&gt;
&lt;/ul&gt;
&lt;h3&gt;3. Parallelize Tasks in the DAG&lt;/h3&gt;
&lt;p&gt;If your DAG contains multiple independent tasks (e.g., ingesting data with Spark, transforming data with Dremio, and loading data into Snowflake), you can improve performance by running these tasks in parallel. Airflow natively supports task parallelization, and you can configure it to run tasks concurrently to reduce the overall runtime of your workflow.&lt;/p&gt;
&lt;p&gt;Example of parallel tasks in an Airflow DAG:&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-python&quot;&gt;from airflow import DAG
from airflow.operators.python_operator import PythonOperator
from datetime import datetime

def task_1():
    # Task 1 logic here
    pass

def task_2():
    # Task 2 logic here
    pass

dag = DAG(&apos;my_dag&apos;, start_date=datetime(2024, 1, 1), schedule_interval=&apos;@daily&apos;)

# Define parallel tasks
t1 = PythonOperator(task_id=&apos;task_1&apos;, python_callable=task_1, dag=dag)
t2 = PythonOperator(task_id=&apos;task_2&apos;, python_callable=task_2, dag=dag)

# Run tasks in parallel
[t1, t2]
&lt;/code&gt;&lt;/pre&gt;
&lt;h4&gt;Benefits:&lt;/h4&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Reduces workflow runtime:&lt;/strong&gt; Parallelizing tasks that don’t depend on each other cuts down the total time required to complete the workflow.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Scalability:&lt;/strong&gt; Allows your workflow to scale as you add more tasks, without increasing the runtime proportionally.&lt;/li&gt;
&lt;/ul&gt;
&lt;h3&gt;4. Optimize Data Processing with Arrow and Dremio Reflections&lt;/h3&gt;
&lt;p&gt;For workflows involving large datasets, it’s important to optimize how data is processed and moved between systems. In this pipeline, you’re leveraging Apache Arrow Flight and Dremio Reflections to efficiently retrieve and accelerate access to large datasets. To further optimize performance:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Use Arrow Flight to transport large datasets between Dremio and external systems, minimizing serialization/deserialization overhead.&lt;/li&gt;
&lt;li&gt;Enable Dremio Reflections on your datasets, particularly for gold-layer data, to accelerate queries and transformations.&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;Example of creating a Dremio Reflection:&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-sql&quot;&gt;ALTER TABLE my_lake.gold.final_dataset
  CREATE RAW REFLECTION gold_accelerator USING DISPLAY (id,lastName,firstName,address,country)
    PARTITION BY (country)
    LOCALSORT BY (lastName);
&lt;/code&gt;&lt;/pre&gt;
&lt;h4&gt;Benefits:&lt;/h4&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Minimizes data transfer latency:&lt;/strong&gt; Apache Arrow Flight reduces the overhead of transferring large datasets, improving overall workflow performance.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Speeds up queries:&lt;/strong&gt; Dremio Reflections cache results, reducing the time required for frequently executed queries.&lt;/li&gt;
&lt;/ul&gt;
&lt;ol start=&quot;5&quot;&gt;
&lt;li&gt;Limit Resource Usage in GitHub Actions
Finally, you can optimize the performance of the workflow by managing resource usage in GitHub Actions. This includes specifying appropriate runner types (e.g., ubuntu-latest or custom runners) and limiting the number of jobs that run concurrently to avoid hitting resource limits.&lt;/li&gt;
&lt;/ol&gt;
&lt;p&gt;Example of limiting concurrency in GitHub Actions:&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-yaml&quot;&gt;concurrency:
  group: my-workflow-${{ github.ref }}
  cancel-in-progress: true
&lt;/code&gt;&lt;/pre&gt;
&lt;h4&gt;Benefits:&lt;/h4&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Avoids resource contention:&lt;/strong&gt; By controlling concurrency, you prevent multiple runs of the same workflow from competing for resources, improving stability and performance.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Efficient use of runners:&lt;/strong&gt; Ensures that the workflow uses only the necessary resources, reducing cost and improving execution efficiency.&lt;/li&gt;
&lt;/ul&gt;
&lt;h3&gt;Summary&lt;/h3&gt;
&lt;p&gt;Optimizing the performance of your GitHub Actions-triggered Airflow workflow involves a combination of techniques such as caching, task parallelization, data transport optimization, and using distributed executors. By implementing these strategies, you can ensure that your pipeline runs efficiently, scales with increasing data or complexity, and delivers fast results across your data systems like Apache Spark, Dremio, and Snowflake.&lt;/p&gt;
&lt;h2&gt;Troubleshooting Considerations&lt;/h2&gt;
&lt;p&gt;When building a complex data pipeline using GitHub Actions to trigger Airflow DAGs and interact with external systems such as Dremio, Snowflake, and Apache Spark, you may encounter issues related to dependencies, environment variables, and connectivity. This section outlines key troubleshooting considerations to ensure your workflow runs smoothly and is properly configured.&lt;/p&gt;
&lt;h3&gt;1. &lt;strong&gt;Ensuring All Python Libraries Are Installed&lt;/strong&gt;&lt;/h3&gt;
&lt;p&gt;One of the most common issues in Dockerized Airflow environments is missing Python libraries. If the necessary dependencies (e.g., &lt;code&gt;dremio-simple-query&lt;/code&gt;, &lt;code&gt;snowflake-connector-python&lt;/code&gt;, &lt;code&gt;pandas&lt;/code&gt;, etc.) are not installed, your DAG tasks will fail during execution.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Steps to Troubleshoot:&lt;/strong&gt;&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Check Dockerfile&lt;/strong&gt;: Ensure all required Python libraries are listed in your &lt;code&gt;Dockerfile&lt;/code&gt;. If a package is missing, add it and rebuild the Docker image.&lt;/p&gt;
&lt;p&gt;Example:&lt;/p&gt;
&lt;/li&gt;
&lt;/ul&gt;
&lt;pre&gt;&lt;code class=&quot;language-Dockerfile&quot;&gt;  RUN pip install dremio-simple-query==&amp;lt;version&amp;gt; snowflake-connector-python==&amp;lt;version&amp;gt; pandas==&amp;lt;version&amp;gt;
&lt;/code&gt;&lt;/pre&gt;
&lt;h4&gt;Common Gotchas:&lt;/h4&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Dependency Conflicts:&lt;/strong&gt; Ensure that library versions are compatible with each other to avoid conflicts.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Rebuild Docker Image:&lt;/strong&gt; After making changes to the Dockerfile, don’t forget to rebuild the Docker image and restart the containers.&lt;/li&gt;
&lt;/ul&gt;
&lt;h3&gt;2. Ensuring Environment Variables Are Passed Correctly&lt;/h3&gt;
&lt;p&gt;Incorrect or missing environment variables can lead to issues connecting to external systems like Dremio, Snowflake, or Apache Spark. These variables often include API tokens, usernames, passwords, and endpoints.&lt;/p&gt;
&lt;h4&gt;Steps to Troubleshoot:&lt;/h4&gt;
&lt;p&gt;&lt;strong&gt;Check docker-compose.yml:&lt;/strong&gt; Make sure all necessary environment variables are listed in the environment section of the &lt;code&gt;docker-compose.yml&lt;/code&gt; file.&lt;/p&gt;
&lt;p&gt;Example:&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-yaml&quot;&gt;environment:
  - DREMIO_TOKEN=${DREMIO_TOKEN}
  - SNOWFLAKE_USER=${SNOWFLAKE_USER}
  - SNOWFLAKE_PASSWORD=${SNOWFLAKE_PASSWORD}
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;&lt;strong&gt;Check GitHub Secrets:&lt;/strong&gt; Verify that GitHub Secrets are correctly passed to the workflow. If a secret is missing, update the GitHub repository&apos;s Secrets settings and ensure they are referenced properly in the GitHub Actions workflow.&lt;/p&gt;
&lt;p&gt;Example:&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-yaml&quot;&gt;env:
  DREMIO_TOKEN: ${{ secrets.DREMIO_TOKEN }}
  SNOWFLAKE_USER: ${{ secrets.SNOWFLAKE_USER }}
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;&lt;strong&gt;Log Environment Variables:&lt;/strong&gt; Temporarily log environment variables in the Airflow task to verify that they are being passed correctly:&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-python&quot;&gt;import os
print(os.getenv(&apos;DREMIO_TOKEN&apos;))
&lt;/code&gt;&lt;/pre&gt;
&lt;h4&gt;Common Gotchas:&lt;/h4&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Variable Case Sensitivity:&lt;/strong&gt; Ensure the environment variables in the &lt;code&gt;docker-compose.yml&lt;/code&gt; file match exactly with those in GitHub Secrets, as they are case-sensitive.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Missing Secrets:&lt;/strong&gt; If a required secret is not present, the workflow will fail. Double-check that all secrets are configured in GitHub.&lt;/li&gt;
&lt;/ul&gt;
&lt;h3&gt;3. Ensuring DAGs Are Visible to the Airflow Container&lt;/h3&gt;
&lt;p&gt;If Airflow doesn’t detect your DAGs, it could be due to incorrect volume mounting or file path issues in the Docker setup.&lt;/p&gt;
&lt;h4&gt;Steps to Troubleshoot:&lt;/h4&gt;
&lt;p&gt;&lt;strong&gt;Check Volume Mounting:&lt;/strong&gt; Ensure that the dags directory is correctly mounted in the &lt;code&gt;docker-compose.yml&lt;/code&gt; file.&lt;/p&gt;
&lt;p&gt;Example:&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-yaml&quot;&gt;services:
  webserver:
    volumes:
      - ./dags:/opt/airflow/dags
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;&lt;strong&gt;Check DAG Folder Structure:&lt;/strong&gt; Ensure that your DAGs are in the correct folder structure inside the repository. The dags directory should be at the root level of your project and contain .py files defining the DAGs.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Restart the Airflow Webserver:&lt;/strong&gt; Sometimes, new DAGs are not detected until the Airflow webserver is restarted:&lt;/p&gt;
&lt;h4&gt;Common Gotchas:&lt;/h4&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Incorrect File Paths:&lt;/strong&gt; If DAGs are not mounted properly, Airflow won’t be able to find them. Double-check the volume path in docker-compose.yml.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;DAG Parsing Errors:&lt;/strong&gt; Check for syntax errors in your DAGs that may prevent Airflow from loading them.&lt;/li&gt;
&lt;/ul&gt;
&lt;h3&gt;4. Ensuring PySpark Scripts Connect to the Remote Spark Cluster&lt;/h3&gt;
&lt;p&gt;If your Airflow DAG includes tasks that run PySpark scripts, you need to ensure that the scripts have the correct information to connect to a remote Spark cluster.&lt;/p&gt;
&lt;h4&gt;Steps to Troubleshoot:&lt;/h4&gt;
&lt;p&gt;&lt;strong&gt;Check Spark Configuration:&lt;/strong&gt; Verify that the Spark configuration in your PySpark script points to the correct Spark master node (either in standalone mode or on a distributed cluster like YARN or Kubernetes):&lt;/p&gt;
&lt;p&gt;Example:&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-python&quot;&gt;spark = SparkSession.builder \
  .master(&amp;quot;spark://&amp;lt;remote-spark-master&amp;gt;:7077&amp;quot;) \
  .appName(&amp;quot;MyApp&amp;quot;) \
  .getOrCreate()
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;&lt;strong&gt;Pass Spark Configuration via Environment Variables:&lt;/strong&gt; If you’re dynamically assigning the Spark master or other settings, ensure that these values are passed as environment variables:&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-python&quot;&gt;spark_master = os.getenv(&apos;SPARK_MASTER&apos;, &apos;spark://&amp;lt;default-spark-master&amp;gt;:7077&apos;)
spark = SparkSession.builder.master(spark_master).getOrCreate()
&lt;/code&gt;&lt;/pre&gt;
&lt;h4&gt;Common Gotchas:&lt;/h4&gt;
&lt;p&gt;&lt;strong&gt;Incorrect Spark Master URL:&lt;/strong&gt; If the Spark master URL is incorrect or inaccessible, the PySpark script will fail.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Firewall Issues:&lt;/strong&gt; Make sure there are no firewall rules blocking communication between the Airflow container and the Spark cluster.&lt;/p&gt;
&lt;h3&gt;5. Other Possible Gotchas&lt;/h3&gt;
&lt;p&gt;&lt;strong&gt;File Permissions:&lt;/strong&gt; If you’re accessing local files (e.g., configuration files or data), ensure that the file permissions allow access from within the container. You can fix this by adjusting the file permissions in your local environment or specifying correct permissions when mounting volumes.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Container Resource Limits:&lt;/strong&gt; If the Airflow container or any associated services (e.g., Spark, Dremio, Snowflake) are consuming too much memory or CPU, they might hit resource limits and cause the workflow to fail. Check your Docker resource allocation settings and ensure you’ve allocated sufficient resources to each service.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Airflow Scheduler Issues:&lt;/strong&gt; If your DAG is not running even though it’s visible in the UI, the issue could be with the Airflow scheduler. Ensure the scheduler is running correctly:&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Data Transfer Bottlenecks:&lt;/strong&gt; If your pipeline involves moving large datasets (e.g., between Dremio and Snowflake), ensure that you’re using efficient formats like Parquet and leveraging high-performance data transport protocols like Apache Arrow Flight.&lt;/p&gt;
&lt;h3&gt;Summary&lt;/h3&gt;
&lt;p&gt;By following these troubleshooting guidelines, you can identify and resolve common issues related to Python dependencies, environment variables, DAG visibility, PySpark configuration, and other potential gotchas. Ensuring that your environment is properly set up and configured will help you run your GitHub Actions-triggered Airflow workflows smoothly and efficiently.&lt;/p&gt;
&lt;h3&gt;Conclusion&lt;/h3&gt;
&lt;p&gt;The example provided in this blog serves as an &lt;strong&gt;illustrative guide&lt;/strong&gt; to show how you can trigger Airflow DAGs using GitHub Actions to orchestrate data pipelines that integrate external systems like &lt;strong&gt;Apache Spark&lt;/strong&gt;, &lt;strong&gt;Dremio&lt;/strong&gt;, and &lt;strong&gt;Snowflake&lt;/strong&gt;. While the steps outlined offer a practical starting point, it&apos;s important to recognize that this pattern can be &lt;strong&gt;customized and expanded&lt;/strong&gt; to meet your specific data workflow requirements.&lt;/p&gt;
&lt;p&gt;Every data pipeline has unique characteristics depending on the nature of the data, the scale of processing, and the systems involved. Whether you&apos;re dealing with more complex DAGs, additional external systems, or specialized configurations, this guide can serve as the foundation for implementing your own tailored solution.&lt;/p&gt;
&lt;p&gt;Key optimizations, such as efficient data transport with &lt;strong&gt;Apache Arrow Flight&lt;/strong&gt;, distributed task execution with &lt;strong&gt;Airflow Executors&lt;/strong&gt;, and performance improvements through &lt;strong&gt;Dremio Reflections&lt;/strong&gt;, are flexible tools that can be adjusted to meet the scale and performance needs of your project. Additionally, the GitHub Actions workflow can be adapted to trigger the pipeline on various events, such as code changes or scheduled jobs, giving you full control over your pipeline orchestration.&lt;/p&gt;
&lt;p&gt;By starting with this example and iterating based on your organization&apos;s specific needs, you can build a scalable, cost-effective, and performant data orchestration pipeline. Whether you’re managing data lake transformations, synchronizing with cloud warehouses, or running periodic ingestion jobs, this workflow pattern can be a valuable framework for your data operations.&lt;/p&gt;
</content:encoded><author>Alex Merced</author></item><item><title>A Deep Dive Into GitHub Actions From Software Development to Data Engineering</title><link>https://iceberglakehouse.com/posts/2024-10-intro-to-github-actions/</link><guid isPermaLink="true">https://iceberglakehouse.com/posts/2024-10-intro-to-github-actions/</guid><description>
- [Free Copy of Apache Iceberg the Definitive Guide](https://hello.dremio.com/wp-apache-iceberg-the-definitive-guide-reg.html?utm_source=alexmerced&amp;u...</description><pubDate>Sat, 19 Oct 2024 09:00:00 GMT</pubDate><content:encoded>&lt;ul&gt;
&lt;li&gt;&lt;a href=&quot;https://hello.dremio.com/wp-apache-iceberg-the-definitive-guide-reg.html?utm_source=alexmerced&amp;amp;utm_medium=external_blog&amp;amp;utm_campaign=githubactionsintro&quot;&gt;Free Copy of Apache Iceberg the Definitive Guide&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&quot;https://hello.dremio.com/webcast-an-apache-iceberg-lakehouse-crash-course-reg.html?utm_source=alexmerced&amp;amp;utm_medium=external_blog&amp;amp;utm_campaign=githubactionsintro&quot;&gt;Free Apache Iceberg Crash Course&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&quot;https://www.youtube.com/watch?v=SIriNcVIGJQ&amp;amp;list=PLsLAVBjQJO0p0Yq1fLkoHvt2lEJj5pcYe&quot;&gt;Iceberg Lakehouse Engineering Video Playlist&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;GitHub Actions is widely recognized as a powerful tool for automating tasks in software development. It&apos;s commonly used for tasks like running tests, building applications, and deploying to production environments. However, the true potential of GitHub Actions extends far beyond software development. Whether you&apos;re orchestrating complex data pipelines, automating ETL jobs, or even generating reports, GitHub Actions offers a flexible and scalable solution.&lt;/p&gt;
&lt;p&gt;In this blog, we&apos;ll dive deep into how GitHub Actions can be used not just in traditional CI/CD pipelines but also across various data engineering workflows. By the end, you&apos;ll understand how to leverage GitHub Actions to automate processes from software development to data engineering, unlocking new efficiencies and streamlining tasks you may not have realized could be automated. Let&apos;s explore the possibilities!&lt;/p&gt;
&lt;h2&gt;What is GitHub Actions?&lt;/h2&gt;
&lt;p&gt;GitHub Actions is a platform that allows developers to automate workflows directly within their GitHub repositories. It integrates seamlessly with GitHub, allowing you to trigger workflows based on events like pushes, pull requests, or even on a schedule. It’s essentially a CI/CD tool built directly into GitHub, but its use cases go far beyond just continuous integration and continuous deployment.&lt;/p&gt;
&lt;h3&gt;Key Components of GitHub Actions&lt;/h3&gt;
&lt;ul&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Workflows&lt;/strong&gt;: A collection of jobs, defined in YAML files, that automate tasks in your repository. Each workflow can be triggered by different events (like code pushes) and is stored in the &lt;code&gt;.github/workflows&lt;/code&gt; directory.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Jobs&lt;/strong&gt;: A job is a set of steps that execute on the same runner. Jobs are executed in parallel by default, though they can be configured to run sequentially if needed.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Steps&lt;/strong&gt;: These are individual tasks that make up a job. Each step can run commands, scripts, or use actions to complete a specific part of the job.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Actions&lt;/strong&gt;: Actions are reusable components that allow you to automate specific tasks within your workflow. These can be pre-built (available in the GitHub Marketplace) or custom actions that you define yourself.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Events&lt;/strong&gt;: These are triggers that start a workflow. They can be based on GitHub events (e.g., a push to a branch, opening a pull request) or scheduled to run at specific intervals.&lt;/p&gt;
&lt;/li&gt;
&lt;/ul&gt;
&lt;h3&gt;Why Use GitHub Actions?&lt;/h3&gt;
&lt;p&gt;GitHub Actions offers a simple yet powerful way to automate tasks without the need for external tools. It reduces friction by eliminating context switching between different CI/CD platforms, and since it’s tightly integrated with GitHub, it allows for streamlined automation across the development lifecycle.&lt;/p&gt;
&lt;p&gt;When compared to other CI/CD tools like Jenkins or CircleCI, GitHub Actions stands out for its ease of use, flexibility, and the ability to run workflows directly within your GitHub repository. Whether you&apos;re working on a small open-source project or managing large enterprise-scale pipelines, GitHub Actions provides a scalable solution for automation.&lt;/p&gt;
&lt;h2&gt;Core Use Cases in Software Development&lt;/h2&gt;
&lt;p&gt;GitHub Actions is often leveraged for automating common software development tasks, making it an essential tool for streamlining CI/CD workflows. Let&apos;s explore some of the core use cases where GitHub Actions excels in software development.&lt;/p&gt;
&lt;h3&gt;Automating Testing and Building Code&lt;/h3&gt;
&lt;p&gt;One of the most popular uses of GitHub Actions is to automate the testing process. Every time a developer pushes new code or creates a pull request, GitHub Actions can automatically trigger tests to ensure that the new changes don’t break any existing functionality. This not only increases confidence in the code but also speeds up the feedback loop for developers.&lt;/p&gt;
&lt;p&gt;For example, you can set up workflows to run:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Unit tests for verifying individual functions or components.&lt;/li&gt;
&lt;li&gt;Integration tests to ensure different parts of your application work together.&lt;/li&gt;
&lt;li&gt;End-to-end tests to simulate real user scenarios.&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;Additionally, GitHub Actions can automate the building of code. This includes compiling source code, generating binaries, or packaging applications, making sure your application is always ready for deployment.&lt;/p&gt;
&lt;h3&gt;Continuous Integration (CI)&lt;/h3&gt;
&lt;p&gt;GitHub Actions enables continuous integration by automating the testing and merging of code changes. When developers push new changes, GitHub Actions can:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Automatically pull the latest code.&lt;/li&gt;
&lt;li&gt;Run predefined tests to verify the changes.&lt;/li&gt;
&lt;li&gt;Merge the code into the main branch if all tests pass.&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;This helps maintain a clean, stable codebase and reduces the risk of integration issues, especially in larger teams with frequent code commits.&lt;/p&gt;
&lt;h3&gt;Continuous Deployment (CD)&lt;/h3&gt;
&lt;p&gt;After your code has been tested and merged, GitHub Actions can take care of continuous deployment. With CD, you can automatically deploy your application to staging or production environments, ensuring that the latest version is always available.&lt;/p&gt;
&lt;p&gt;For example, you can set up workflows to:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Deploy a web app to cloud platforms like AWS, Azure, or Google Cloud.&lt;/li&gt;
&lt;li&gt;Push a Docker container to a registry like Docker Hub or Amazon ECR.&lt;/li&gt;
&lt;li&gt;Update Kubernetes clusters or serverless functions.&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;This level of automation simplifies the release process, reduces manual intervention, and minimizes the risk of human errors during deployment.&lt;/p&gt;
&lt;h2&gt;Advanced GitHub Actions for Software Development&lt;/h2&gt;
&lt;p&gt;Beyond basic CI/CD workflows, GitHub Actions offers powerful capabilities for automating advanced tasks in software development. These advanced use cases can improve code quality, enhance security, and optimize your workflow&apos;s performance. Let&apos;s explore some of the ways you can leverage GitHub Actions for more complex scenarios.&lt;/p&gt;
&lt;h3&gt;Security Checks and Vulnerability Scanning&lt;/h3&gt;
&lt;p&gt;Maintaining security throughout the development lifecycle is crucial, and GitHub Actions makes it easy to integrate security checks into your workflows. You can automatically scan your dependencies and codebase for vulnerabilities, ensuring that potential risks are caught early in the development process.&lt;/p&gt;
&lt;p&gt;For example:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Dependabot&lt;/strong&gt; can be configured to automatically check for outdated dependencies and open pull requests to update them.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;CodeQL&lt;/strong&gt; can be used to run static analysis to identify security vulnerabilities in your code.&lt;/li&gt;
&lt;li&gt;Third-party security tools (like &lt;strong&gt;Snyk&lt;/strong&gt; or &lt;strong&gt;Bandit&lt;/strong&gt;) can be integrated to perform additional vulnerability scans.&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;By incorporating these tools into your workflow, you can ensure that your code remains secure throughout the entire development process.&lt;/p&gt;
&lt;h3&gt;Code Quality and Linting Automation&lt;/h3&gt;
&lt;p&gt;Maintaining consistent code quality is essential for long-term maintainability, and GitHub Actions allows you to enforce coding standards automatically. By integrating code linters and formatters into your workflows, you can ensure that code adheres to team guidelines before it&apos;s merged into the main branch.&lt;/p&gt;
&lt;p&gt;For example:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;ESLint&lt;/strong&gt; for JavaScript projects can be used to enforce coding style and catch common issues.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Pylint&lt;/strong&gt; or &lt;strong&gt;Black&lt;/strong&gt; for Python projects can check for code style consistency and potential errors.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Prettier&lt;/strong&gt; can automatically format code to ensure it&apos;s consistently styled across the project.&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;These tools can be configured to run on every pull request, catching issues early and helping to maintain high code quality standards.&lt;/p&gt;
&lt;h3&gt;Managing Multiple Environments&lt;/h3&gt;
&lt;p&gt;In modern development workflows, applications often need to be tested and deployed in multiple environments, such as development, staging, and production. GitHub Actions can simplify the management of these environments by automating the deployment process across them.&lt;/p&gt;
&lt;p&gt;You can set up workflows that:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Run different sets of tests or build configurations depending on the environment.&lt;/li&gt;
&lt;li&gt;Deploy code to specific environments based on branch or tag (e.g., deploying to staging from a &lt;code&gt;staging&lt;/code&gt; branch and to production from a &lt;code&gt;main&lt;/code&gt; branch).&lt;/li&gt;
&lt;li&gt;Manage environment-specific secrets and credentials securely using GitHub Secrets.&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;By automating environment management, GitHub Actions ensures that your deployments are consistent and reduces the risk of configuration drift between environments.&lt;/p&gt;
&lt;h2&gt;Expanding GitHub Actions to Data Engineering&lt;/h2&gt;
&lt;p&gt;While GitHub Actions is a staple in software development workflows, it also offers tremendous potential for automating data engineering tasks. From managing ETL pipelines to automating data quality checks, GitHub Actions can help streamline data workflows in ways similar to traditional CI/CD processes.&lt;/p&gt;
&lt;h3&gt;Automating ETL Pipelines&lt;/h3&gt;
&lt;p&gt;One of the most common tasks in data engineering is managing ETL (Extract, Transform, Load) pipelines. GitHub Actions can automate the scheduling and execution of these pipelines, ensuring that data is extracted from various sources, transformed according to business rules, and loaded into target systems at regular intervals.&lt;/p&gt;
&lt;p&gt;Example workflows could include:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Extracting data from APIs or databases on a set schedule.&lt;/li&gt;
&lt;li&gt;Running Python or SQL scripts to transform the data.&lt;/li&gt;
&lt;li&gt;Loading the data into cloud storage or a data warehouse such as Snowflake, Redshift, or BigQuery.&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;By leveraging GitHub Actions’ built-in scheduling and triggers, you can set up data workflows that run without manual intervention.&lt;/p&gt;
&lt;h3&gt;Orchestrating Data Workflows&lt;/h3&gt;
&lt;p&gt;In more complex data engineering projects, orchestration tools like Apache Airflow or dbt are used to manage dependencies between tasks. GitHub Actions can be used to trigger and manage these orchestrations, making it easier to maintain and monitor them directly from GitHub.&lt;/p&gt;
&lt;p&gt;For instance:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;GitHub Actions can trigger the execution of dbt models, transforming raw data into analytics-ready datasets.&lt;/li&gt;
&lt;li&gt;Actions can trigger Airflow DAGs (Directed Acyclic Graphs) to orchestrate data pipelines across multiple stages.&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;This integration allows you to maintain and deploy your data models and orchestrations seamlessly through GitHub, using a single platform for both development and data workflows.&lt;/p&gt;
&lt;h3&gt;Data Quality and Validation&lt;/h3&gt;
&lt;p&gt;Ensuring that your data is accurate and reliable is critical for any data pipeline. GitHub Actions can automate data validation checks, ensuring that the data meets specified quality standards before being used in downstream processes.&lt;/p&gt;
&lt;p&gt;For example:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;You can set up GitHub Actions to run data validation scripts after ingestion (e.g., using &lt;strong&gt;Great Expectations&lt;/strong&gt; to ensure that data conforms to your expectations).&lt;/li&gt;
&lt;li&gt;Validate schema changes automatically when new datasets are ingested.&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;This level of automation not only improves data quality but also reduces manual checks and ensures that only validated data is passed to other systems.&lt;/p&gt;
&lt;h3&gt;Automating Analytics and Reporting&lt;/h3&gt;
&lt;p&gt;Data engineers and analysts often generate reports or dashboards that summarize insights from large datasets. GitHub Actions can automate the creation of these reports and ensure that they are regularly updated as new data is ingested.&lt;/p&gt;
&lt;p&gt;Use cases include:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Automating Jupyter Notebooks to generate reports and committing the outputs to the repository.&lt;/li&gt;
&lt;li&gt;Triggering analytics tools like Apache Superset to refresh dashboards based on new data availability.&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;By integrating reporting tools into GitHub Actions, you can ensure that your reports are always up-to-date and accessible to stakeholders without manual intervention.&lt;/p&gt;
&lt;h2&gt;Using GitHub Actions in Hybrid Environments&lt;/h2&gt;
&lt;p&gt;Data engineering workflows often span across both on-premise and cloud environments, requiring coordination between different systems and infrastructure. GitHub Actions can bridge the gap between these environments, enabling smooth automation across hybrid architectures. Whether you&apos;re working with cloud storage or local databases, GitHub Actions can help you manage and synchronize data across these systems seamlessly.&lt;/p&gt;
&lt;h3&gt;Managing On-Prem and Cloud Workflows&lt;/h3&gt;
&lt;p&gt;In hybrid environments, data engineers may need to orchestrate workflows that involve both cloud-based services and on-premise infrastructure. GitHub Actions can be set up to automate tasks across these diverse environments by integrating with cloud providers like AWS, Azure, or GCP while also interacting with local systems.&lt;/p&gt;
&lt;p&gt;For example:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;GitHub Actions can pull data from an on-premise SQL server, process it, and then upload it to cloud storage like AWS S3 or Azure Blob Storage.&lt;/li&gt;
&lt;li&gt;Workflows can trigger compute jobs in a cloud environment (e.g., running Spark or Dremio jobs) and then download the results to a local file system for further processing.&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;By centralizing control in GitHub Actions, you can manage and execute workflows across multiple environments without needing to juggle different automation tools for cloud and on-prem systems.&lt;/p&gt;
&lt;h3&gt;Data Transfers and Syncing Across Systems&lt;/h3&gt;
&lt;p&gt;A common challenge in hybrid environments is keeping data synchronized between cloud and on-prem systems. GitHub Actions can be used to automate data transfers between different storage locations and ensure that the latest data is always available where it&apos;s needed.&lt;/p&gt;
&lt;p&gt;Some common use cases include:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Automating the synchronization of data between a cloud data lake (e.g., AWS S3 or Google Cloud Storage) and on-premise Hadoop clusters or local databases.&lt;/li&gt;
&lt;li&gt;Using GitHub Actions to monitor for new data in a cloud storage bucket and trigger a transfer job to move it into an on-prem data warehouse.&lt;/li&gt;
&lt;li&gt;Automating backups from on-prem databases to cloud storage for disaster recovery purposes.&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;With GitHub Actions, you can schedule regular sync jobs or trigger data transfers based on specific events, ensuring that your hybrid environment remains in sync.&lt;/p&gt;
&lt;h3&gt;Orchestrating Multi-Cloud Data Pipelines&lt;/h3&gt;
&lt;p&gt;For organizations utilizing multiple cloud providers, GitHub Actions can serve as a central orchestrator for multi-cloud data pipelines. By connecting to APIs and services across AWS, Azure, and GCP, GitHub Actions enables you to build workflows that span across different cloud platforms.&lt;/p&gt;
&lt;p&gt;Example workflows:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Extract data from an AWS RDS database, process it in Azure Data Factory, and store the results in Google BigQuery.&lt;/li&gt;
&lt;li&gt;Trigger machine learning models in different cloud environments, collecting results and merging them into a unified data lake.&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;Using GitHub Actions to orchestrate multi-cloud data workflows allows for efficient management of distributed systems, while maintaining flexibility across cloud vendors.&lt;/p&gt;
&lt;h2&gt;Practical Examples of GitHub Actions for Data Engineers&lt;/h2&gt;
&lt;p&gt;Let’s take a look at some real-world examples of how GitHub Actions can be leveraged by data engineers to automate and optimize their workflows. These examples demonstrate the versatility of GitHub Actions in handling a variety of data tasks, from orchestration to deployment.&lt;/p&gt;
&lt;h3&gt;Automating Apache Airflow DAG Deployments&lt;/h3&gt;
&lt;p&gt;Apache Airflow is a popular tool for managing data pipelines, but deploying Airflow DAGs (Directed Acyclic Graphs) can involve a lot of manual work. GitHub Actions can automate this process, ensuring that new or updated DAGs are deployed consistently and reliably.&lt;/p&gt;
&lt;p&gt;Example:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Use a GitHub Action that triggers whenever a new DAG is pushed to the repository.&lt;/li&gt;
&lt;li&gt;The workflow copies the DAG to your Airflow environment, restarts the scheduler, and verifies that the DAG is available and ready to run.&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;By automating this deployment process, you can save time and reduce the risk of errors when introducing new DAGs to your workflow.&lt;/p&gt;
&lt;h3&gt;Automating Dremio Queries with GitHub Actions&lt;/h3&gt;
&lt;p&gt;Dremio is a powerful data lakehouse platform that enables fast SQL queries over cloud and on-premise data. GitHub Actions can be used to automate querying and even data transformations in Dremio, allowing for seamless integration with your data pipelines.&lt;/p&gt;
&lt;p&gt;Example:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Set up a GitHub Action that triggers a query in Dremio to refresh a dataset or generate a new view.&lt;/li&gt;
&lt;li&gt;Automate the retrieval of query results, which can then be stored in a data warehouse or used to generate reports.&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;This allows for efficient, automated querying without needing to manually run queries in the Dremio UI.&lt;/p&gt;
&lt;h3&gt;CI/CD for dbt Models&lt;/h3&gt;
&lt;p&gt;dbt (data build tool) is widely used for transforming data in analytics workflows. GitHub Actions can handle the CI/CD process for dbt models, ensuring that changes to your models are tested and deployed automatically.&lt;/p&gt;
&lt;p&gt;Example:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;A GitHub Action triggers on every pull request or push to the repository.&lt;/li&gt;
&lt;li&gt;The workflow runs &lt;code&gt;dbt test&lt;/code&gt; to validate the integrity of your dbt models.&lt;/li&gt;
&lt;li&gt;After testing, the models are deployed to the production environment.&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;This automated workflow ensures that your dbt transformations are always up to date and error-free, saving time and reducing the risk of manual errors.&lt;/p&gt;
&lt;h3&gt;Automating Data Ingestion from APIs&lt;/h3&gt;
&lt;p&gt;Data engineers often need to pull data from external APIs into their data pipelines. GitHub Actions can be used to automate the ingestion of this data, ensuring that it&apos;s available on a regular schedule or in response to specific triggers.&lt;/p&gt;
&lt;p&gt;Example:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;A GitHub Action triggers a Python script that pulls data from an external API.&lt;/li&gt;
&lt;li&gt;The data is processed and stored in a data warehouse (e.g., Snowflake or BigQuery).&lt;/li&gt;
&lt;li&gt;The workflow runs on a schedule or can be triggered manually to ensure that the data is always up to date.&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;By automating data ingestion, GitHub Actions simplifies the process of keeping external data sources synchronized with your internal systems.&lt;/p&gt;
&lt;h2&gt;Basics of Implementing a GitHub Actions Workflow&lt;/h2&gt;
&lt;p&gt;Setting up a GitHub Actions workflow is straightforward and follows a defined structure using YAML files. This section will walk you through the basic components and how to create your first workflow, which can be extended to more complex use cases later on.&lt;/p&gt;
&lt;h3&gt;Step 1: Creating a Workflow File&lt;/h3&gt;
&lt;p&gt;Workflows in GitHub Actions are defined in YAML files located in the &lt;code&gt;.github/workflows/&lt;/code&gt; directory of your repository. Each workflow is represented as a separate YAML file, and you can create multiple workflows for different purposes (e.g., one for testing, one for deployment, etc.).&lt;/p&gt;
&lt;p&gt;To create a new workflow:&lt;/p&gt;
&lt;ol&gt;
&lt;li&gt;Navigate to your repository.&lt;/li&gt;
&lt;li&gt;Create a new directory called &lt;code&gt;.github/workflows/&lt;/code&gt;.&lt;/li&gt;
&lt;li&gt;Inside this directory, create a new YAML file (e.g., &lt;code&gt;my-workflow.yml&lt;/code&gt;).&lt;/li&gt;
&lt;/ol&gt;
&lt;h3&gt;Step 2: Defining the Workflow Structure&lt;/h3&gt;
&lt;p&gt;Each workflow file needs the following basic structure:&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-yaml&quot;&gt;name: My Workflow # Give your workflow a name
on:               # Define the trigger for the workflow
  push:           # Example: trigger on push events
    branches:
      - main      # Run the workflow only when pushing to the &apos;main&apos; branch

jobs:             # Define the jobs the workflow will run
  build:          # Example job name
    runs-on: ubuntu-latest   # Specify the environment for the job
    steps:                    # Define the steps within the job
      - name: Checkout code   # A step to checkout the repository
        uses: actions/checkout@v2

      - name: Run a script    # A step to run a custom script
        run: echo &amp;quot;Hello, world!&amp;quot;
&lt;/code&gt;&lt;/pre&gt;
&lt;h3&gt;Step 3: Triggers for Workflows&lt;/h3&gt;
&lt;p&gt;The on field defines when the workflow should be triggered. GitHub Actions provides several triggers based on repository events, such as:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;push:&lt;/strong&gt; Trigger the workflow when changes are pushed to a specified branch.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;pull_request:&lt;/strong&gt; Run the workflow when a pull request is opened or updated.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;schedule:&lt;/strong&gt; Set up a cron-like schedule to run the workflow at regular intervals.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;workflow_dispatch:&lt;/strong&gt; Manually trigger a workflow from the GitHub Actions tab.&lt;/li&gt;
&lt;/ul&gt;
&lt;h4&gt;Example using multiple triggers:&lt;/h4&gt;
&lt;pre&gt;&lt;code class=&quot;language-yaml&quot;&gt;on:
  push:
    branches:
      - main
  pull_request:
    branches:
      - main
  schedule:
    - cron: &apos;0 0 * * *&apos;  # Run daily at midnight (UTC)
&lt;/code&gt;&lt;/pre&gt;
&lt;h3&gt;Step 4: Defining Jobs and Steps&lt;/h3&gt;
&lt;p&gt;Within the jobs section, you define one or more jobs that will be run in parallel (by default). Each job contains:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;runs-on:&lt;/strong&gt; Specifies the type of runner (virtual machine) to run the job on. Common values include ubuntu-latest, windows-latest, and macos-latest.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;steps:&lt;/strong&gt; Lists the individual tasks that make up the job. Steps can include running commands, checking out the code, or using pre-built actions from the GitHub Marketplace.&lt;/p&gt;
&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;In the example below, we define a test job that runs on ubuntu-latest and includes steps to checkout the code and run tests:&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-yaml&quot;&gt;jobs:
  test:
    runs-on: ubuntu-latest
    steps:
      - name: Checkout repository
        uses: actions/checkout@v2

      - name: Run tests
        run: npm test
&lt;/code&gt;&lt;/pre&gt;
&lt;h3&gt;Step 5: Using Pre-built Actions&lt;/h3&gt;
&lt;p&gt;GitHub Actions has a large marketplace of pre-built actions that can be reused in workflows. For instance, the &lt;code&gt;actions/checkout@v2&lt;/code&gt; action is commonly used to check out your repository’s code before running further steps.&lt;/p&gt;
&lt;p&gt;Example of using a pre-built action to set up a Python environment:&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-yaml&quot;&gt;jobs:
  setup-python:
    runs-on: ubuntu-latest
    steps:
      - name: Checkout repository
        uses: actions/checkout@v2

      - name: Set up Python 3.8
        uses: actions/setup-python@v2
        with:
          python-version: 3.8

      - name: Install dependencies
        run: pip install -r requirements.txt
&lt;/code&gt;&lt;/pre&gt;
&lt;h3&gt;Step 6: Running and Monitoring Workflows&lt;/h3&gt;
&lt;p&gt;Once your workflow YAML file is defined and committed to your repository, GitHub Actions will automatically trigger the workflow based on the events you&apos;ve specified. You can monitor the progress of your workflows and view logs directly from the GitHub Actions tab in your repository.&lt;/p&gt;
&lt;h3&gt;Step 7: Best Practices for Workflow Implementation&lt;/h3&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Modularize your steps:&lt;/strong&gt; Use separate jobs for different stages like testing, building, and deployment.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Reuse actions:&lt;/strong&gt; Instead of writing custom scripts for common tasks, use community actions from the GitHub Marketplace to save time.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Parallelism and caching:&lt;/strong&gt; Take advantage of parallel jobs and caching to reduce build times and improve efficiency.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Secrets management:&lt;/strong&gt; Use GitHub Secrets to store sensitive information like API keys, database credentials, or tokens.&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;By following these steps, you can set up a robust GitHub Actions workflow that automates repetitive tasks and enhances productivity.&lt;/p&gt;
&lt;h2&gt;Using GitHub Secrets&lt;/h2&gt;
&lt;p&gt;When automating workflows, you often need to interact with sensitive data like API keys, database credentials, or tokens. Storing these secrets in plaintext within your workflow files is a security risk, but GitHub Secrets provides a secure way to manage sensitive information.&lt;/p&gt;
&lt;p&gt;GitHub Secrets allows you to securely store and access sensitive data in your workflows without exposing them in your version control system. This section will explain how to set up and use GitHub Secrets in your workflows.&lt;/p&gt;
&lt;h3&gt;Step 1: Adding Secrets to Your Repository&lt;/h3&gt;
&lt;p&gt;You can add secrets to your repository or organization, and they are encrypted to ensure their safety. To add a secret to your repository:&lt;/p&gt;
&lt;ol&gt;
&lt;li&gt;Go to the repository on GitHub.&lt;/li&gt;
&lt;li&gt;Click on the &lt;strong&gt;Settings&lt;/strong&gt; tab.&lt;/li&gt;
&lt;li&gt;In the sidebar, click &lt;strong&gt;Secrets and variables&lt;/strong&gt; and then select &lt;strong&gt;Actions&lt;/strong&gt;.&lt;/li&gt;
&lt;li&gt;Click the &lt;strong&gt;New repository secret&lt;/strong&gt; button.&lt;/li&gt;
&lt;li&gt;Add the name of the secret (e.g., &lt;code&gt;API_KEY&lt;/code&gt;) and paste the value in the provided field.&lt;/li&gt;
&lt;li&gt;Click &lt;strong&gt;Add secret&lt;/strong&gt; to save it.&lt;/li&gt;
&lt;/ol&gt;
&lt;p&gt;Your secret is now securely stored and can be accessed in your workflows.&lt;/p&gt;
&lt;h3&gt;Step 2: Accessing Secrets in a Workflow&lt;/h3&gt;
&lt;p&gt;Once you&apos;ve added a secret to your repository, you can reference it in your GitHub Actions workflows using the &lt;code&gt;secrets&lt;/code&gt; context. This ensures that the secret&apos;s value remains hidden even in the workflow logs.&lt;/p&gt;
&lt;p&gt;Here’s an example where an API key stored as a secret is used in a workflow step:&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-yaml&quot;&gt;name: Example Workflow

on: [push]

jobs:
  example-job:
    runs-on: ubuntu-latest
    steps:
      - name: Checkout code
        uses: actions/checkout@v2

      - name: Use API Key
        run: curl -H &amp;quot;Authorization: Bearer ${{ secrets.API_KEY }}&amp;quot; https://api.example.com
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;In this example:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;The secret API_KEY is securely referenced in the curl command using ${{ secrets.API_KEY }}.&lt;/li&gt;
&lt;li&gt;The actual value of the secret will not appear in the logs, ensuring it is not exposed.&lt;/li&gt;
&lt;/ul&gt;
&lt;h3&gt;Step 3: Environment-Specific Secrets&lt;/h3&gt;
&lt;p&gt;GitHub Secrets can also be scoped to environments. For instance, you might have different credentials for your development and production environments. GitHub allows you to set up secrets specific to these environments.&lt;/p&gt;
&lt;p&gt;To add secrets for an environment:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;In the Settings tab of your repository, click Environments.&lt;/li&gt;
&lt;li&gt;Select an environment (or create one) and configure secrets specific to that environment.&lt;/li&gt;
&lt;li&gt;In your workflow, ensure that the correct environment is being referenced when accessing secrets.&lt;/li&gt;
&lt;/ul&gt;
&lt;h4&gt;Example using environment-specific secrets:&lt;/h4&gt;
&lt;pre&gt;&lt;code class=&quot;language-yaml&quot;&gt;name: Deploy to Production

on: [push]

jobs:
  deploy:
    runs-on: ubuntu-latest
    environment: production
    steps:
      - name: Checkout code
        uses: actions/checkout@v2

      - name: Deploy using Production API Key
        run: curl -H &amp;quot;Authorization: Bearer ${{ secrets.PROD_API_KEY }}&amp;quot; https://api.production.com
&lt;/code&gt;&lt;/pre&gt;
&lt;h3&gt;Step 4: Managing Organization-Level Secrets&lt;/h3&gt;
&lt;p&gt;If you are working in a multi-repository project, it may be useful to store secrets at the organization level, which allows them to be shared across multiple repositories. Organization-level secrets work the same way as repository-level secrets but are accessible to all repositories within the organization that are authorized to use them.&lt;/p&gt;
&lt;p&gt;To add an organization secret:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Go to your GitHub organization’s main page.&lt;/li&gt;
&lt;li&gt;Click on Settings in the top navigation bar.
In the sidebar, click Secrets and variables and then select Actions.&lt;/li&gt;
&lt;li&gt;Click New organization secret, provide a name and value, and save it.&lt;/li&gt;
&lt;li&gt;You can then use this secret in any workflow across your repositories, using the same &lt;code&gt;${{ secrets.SECRET_NAME }}&lt;/code&gt; syntax.&lt;/li&gt;
&lt;/ul&gt;
&lt;h3&gt;Step 5: Best Practices for Managing Secrets&lt;/h3&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Use descriptive names:&lt;/strong&gt; Name your secrets clearly to differentiate between similar ones (e.g., &lt;code&gt;DB_PASSWORD&lt;/code&gt;, &lt;code&gt;PROD_API_KEY&lt;/code&gt;, &lt;code&gt;DEV_API_KEY&lt;/code&gt;).&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Limit access:&lt;/strong&gt; Ensure that secrets are scoped appropriately (e.g., use environment-specific secrets to restrict production credentials to production workflows).&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Rotate secrets regularly:&lt;/strong&gt; Regularly update and rotate your secrets to ensure they remain secure.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Monitor secret usage:&lt;/strong&gt; Use logging and monitoring tools to track when and how secrets are used, but ensure secrets themselves are not exposed in logs.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Do not hard-code secrets:&lt;/strong&gt; Never hard-code sensitive information directly in workflows or code. Always store it in GitHub Secrets for security and ease of management.&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;By following these steps, you can securely manage sensitive information in your GitHub workflows, ensuring that secrets are protected while enabling automated processes.&lt;/p&gt;
&lt;h2&gt;Parallelism and Matrix Builds in GitHub Actions&lt;/h2&gt;
&lt;p&gt;One of the most powerful features of GitHub Actions is the ability to run jobs in parallel or use matrix builds to test your application across different configurations. This can significantly reduce the time it takes to complete your CI/CD workflows by allowing multiple tasks to run simultaneously. Matrix builds, in particular, enable you to test your application across various environments, operating systems, and versions in a single workflow.&lt;/p&gt;
&lt;h3&gt;Step 1: Running Jobs in Parallel&lt;/h3&gt;
&lt;p&gt;By default, GitHub Actions runs jobs in parallel, meaning you don’t need to do anything extra to enable this. Multiple jobs will start as soon as there are available runners. However, you can explicitly set dependencies between jobs if you need certain jobs to complete before others start.&lt;/p&gt;
&lt;p&gt;Here’s an example of two jobs (&lt;code&gt;test&lt;/code&gt; and &lt;code&gt;build&lt;/code&gt;) running in parallel:&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-yaml&quot;&gt;name: Parallel Jobs Example

on: [push]

jobs:
  test:
    runs-on: ubuntu-latest
    steps:
      - name: Checkout code
        uses: actions/checkout@v2
      - name: Run tests
        run: npm test

  build:
    runs-on: ubuntu-latest
    steps:
      - name: Checkout code
        uses: actions/checkout@v2
      - name: Build project
        run: npm run build
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;In this example, both the test and build jobs will run simultaneously. GitHub Actions automatically schedules the jobs to run in parallel, optimizing the workflow&apos;s overall execution time.&lt;/p&gt;
&lt;h3&gt;Step 2: Defining Job Dependencies&lt;/h3&gt;
&lt;p&gt;If one job depends on another (e.g., you want to build your project only after the tests pass), you can define job dependencies using the needs keyword. This ensures that jobs are executed in a specific order, despite GitHub Actions&apos; parallel nature.&lt;/p&gt;
&lt;p&gt;Example of a workflow where the build job depends on the test job:&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-yaml&quot;&gt;jobs:
  test:
    runs-on: ubuntu-latest
    steps:
      - name: Checkout code
        uses: actions/checkout@v2
      - name: Run tests
        run: npm test

  build:
    needs: test
    runs-on: ubuntu-latest
    steps:
      - name: Checkout code
        uses: actions/checkout@v2
      - name: Build project
        run: npm run build
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Here, the build job will only start once the test job has completed successfully.&lt;/p&gt;
&lt;h3&gt;Step 3: Matrix Builds&lt;/h3&gt;
&lt;p&gt;Matrix builds allow you to run multiple versions of your workflow with different parameters, such as different versions of a programming language, operating systems, or environments. This is particularly useful for ensuring your code works across various configurations.&lt;/p&gt;
&lt;p&gt;To set up a matrix build, define a matrix strategy under your job. For example, if you want to test a Node.js application on multiple versions of Node.js and different operating systems, you can use a matrix build like this:&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-yaml&quot;&gt;name: Matrix Build Example

on: [push]

jobs:
  test:
    runs-on: ${{ matrix.os }}
    strategy:
      matrix:
        os: [ubuntu-latest, windows-latest, macos-latest]
        node: [12, 14, 16]
    steps:
      - name: Checkout code
        uses: actions/checkout@v2
      - name: Set up Node.js
        uses: actions/setup-node@v2
        with:
          node-version: ${{ matrix.node }}
      - name: Install dependencies
        run: npm install
      - name: Run tests
        run: npm test
&lt;/code&gt;&lt;/pre&gt;
&lt;h4&gt;In this example:&lt;/h4&gt;
&lt;ul&gt;
&lt;li&gt;The workflow will run on three operating systems (ubuntu-latest, windows-latest, macos-latest).&lt;/li&gt;
&lt;li&gt;For each OS, the tests will run on three versions of Node.js (12, 14, 16).&lt;/li&gt;
&lt;li&gt;This results in a total of 9 combinations (3 OS versions x 3 Node.js versions), all running in parallel.&lt;/li&gt;
&lt;/ul&gt;
&lt;h3&gt;Step 4: Excluding Specific Combinations&lt;/h3&gt;
&lt;p&gt;You may not need to test every combination of matrix parameters. GitHub Actions allows you to exclude specific combinations using the exclude keyword within the matrix strategy.&lt;/p&gt;
&lt;p&gt;For example, if you want to skip testing Node.js 12 on macOS, you can modify the matrix like this:&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-yaml&quot;&gt;strategy:
  matrix:
    os: [ubuntu-latest, windows-latest, macos-latest]
    node: [12, 14, 16]
    exclude:
      - os: macos-latest
        node: 12
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;This will run all combinations except Node.js 12 on macOS, reducing unnecessary testing and saving resources.&lt;/p&gt;
&lt;h3&gt;Step 5: Using Fail-Fast in Matrix Builds&lt;/h3&gt;
&lt;p&gt;By default, if one of the jobs in a matrix build fails, the others will continue running. However, you can enable the fail-fast option, which will cancel all remaining jobs in the matrix as soon as one of them fails. This can save time and resources, especially in large matrices.&lt;/p&gt;
&lt;p&gt;To enable fail-fast:&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-yaml&quot;&gt;strategy:
  matrix:
    os: [ubuntu-latest, windows-latest, macos-latest]
    node: [12, 14, 16]
    fail-fast: true
&lt;/code&gt;&lt;/pre&gt;
&lt;h3&gt;Step 6: Best Practices for Parallelism and Matrix Builds&lt;/h3&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Optimize for common use cases:&lt;/strong&gt; Only test combinations that are critical for your project. Exclude unnecessary combinations to reduce build time.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Use caching:&lt;/strong&gt; Caching dependencies (e.g., Node modules, Python packages) across jobs can significantly reduce execution time in parallel jobs.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Monitor build performance:&lt;/strong&gt; Keep an eye on execution time, especially for larger matrices, to identify slow combinations or bottlenecks.&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;By using parallel jobs and matrix builds effectively, you can reduce the time it takes to validate your code across multiple environments and configurations, ensuring robust test coverage with minimal overhead.&lt;/p&gt;
&lt;h2&gt;Caching Dependencies in GitHub Actions&lt;/h2&gt;
&lt;p&gt;Caching is a powerful feature in GitHub Actions that helps speed up your workflows by reusing dependencies or other resources from previous workflow runs. By caching dependencies like package managers, build artifacts, or compiled code, you can significantly reduce the time it takes to run your jobs, particularly when working with large projects or multiple environments.&lt;/p&gt;
&lt;h3&gt;Step 1: Understanding Caching in GitHub Actions&lt;/h3&gt;
&lt;p&gt;When a workflow runs, certain tasks (like installing dependencies) can be time-consuming, especially if they need to be performed repeatedly for each job or every push to the repository. Caching allows you to store these resources and reuse them in subsequent runs, reducing execution time.&lt;/p&gt;
&lt;p&gt;Common use cases for caching include:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Package manager dependencies (e.g., npm, pip, Maven, Gradle).&lt;/li&gt;
&lt;li&gt;Build artifacts (e.g., compiled binaries or generated files).&lt;/li&gt;
&lt;li&gt;Docker layers.&lt;/li&gt;
&lt;/ul&gt;
&lt;h3&gt;Step 2: Using the &lt;code&gt;actions/cache&lt;/code&gt; Action&lt;/h3&gt;
&lt;p&gt;GitHub provides a built-in action called &lt;code&gt;actions/cache&lt;/code&gt;, which allows you to easily cache directories, files, or other dependencies across workflow runs. You specify a key to uniquely identify the cache and paths to the directories or files you want to cache.&lt;/p&gt;
&lt;p&gt;Here’s an example of caching npm dependencies for a Node.js project:&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-yaml&quot;&gt;jobs:
  build:
    runs-on: ubuntu-latest
    steps:
      - name: Checkout code
        uses: actions/checkout@v2
      
      - name: Set up Node.js
        uses: actions/setup-node@v2
        with:
          node-version: &apos;14&apos;
      
      - name: Cache npm dependencies
        uses: actions/cache@v3
        with:
          path: ~/.npm
          key: ${{ runner.os }}-npm-cache-${{ hashFiles(&apos;package-lock.json&apos;) }}
          restore-keys: |
            ${{ runner.os }}-npm-cache-
      
      - name: Install dependencies
        run: npm install
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;In this example:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;The actions/cache action is used to cache npm dependencies, stored in ~/.npm.&lt;/li&gt;
&lt;li&gt;The key is based on the operating system and a hash of the package-lock.json file, ensuring that the cache is invalidated if dependencies change.&lt;/li&gt;
&lt;li&gt;The restore-keys are used as fallback keys to look for older caches if an exact match is not found.&lt;/li&gt;
&lt;/ul&gt;
&lt;h3&gt;Step 3: Key Strategies for Caching&lt;/h3&gt;
&lt;p&gt;The cache key is crucial because it determines whether a cache hit occurs. If the key matches a previously stored cache, it will be restored. If not, the cache will be rebuilt and stored with the new key. Here are common strategies for cache keys:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Hashing dependency files:&lt;/strong&gt; Use hash values of files like package-lock.json or requirements.txt to ensure that the cache is updated when dependencies change.&lt;/li&gt;
&lt;/ul&gt;
&lt;pre&gt;&lt;code class=&quot;language-yaml&quot;&gt;key: ${{ runner.os }}-pip-${{ hashFiles(&apos;requirements.txt&apos;) }}
&lt;/code&gt;&lt;/pre&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Timestamps or version numbers:&lt;/strong&gt; For certain types of caches (e.g., build artifacts), you can include version numbers or timestamps to manage cache invalidation.&lt;/li&gt;
&lt;/ul&gt;
&lt;pre&gt;&lt;code class=&quot;language-yaml&quot;&gt;key: build-artifacts-${{ runner.os }}-v1
&lt;/code&gt;&lt;/pre&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Fallback restore keys:&lt;/strong&gt; If an exact key match is not found, you can use restore-keys to specify broader matches. This helps reuse caches even if the exact key changes, like reusing an older npm cache if package-lock.json changes slightly.&lt;/li&gt;
&lt;/ul&gt;
&lt;pre&gt;&lt;code class=&quot;language-yaml&quot;&gt;restore-keys: |
  ${{ runner.os }}-npm-cache-
&lt;/code&gt;&lt;/pre&gt;
&lt;h3&gt;Step 4: Caching for Different Languages&lt;/h3&gt;
&lt;p&gt;GitHub Actions caching can be used across a variety of languages and frameworks. Here are examples for some common setups:&lt;/p&gt;
&lt;h4&gt;Python (pip) Cache Example&lt;/h4&gt;
&lt;pre&gt;&lt;code class=&quot;language-yaml&quot;&gt;jobs:
  build:
    runs-on: ubuntu-latest
    steps:
      - name: Checkout code
        uses: actions/checkout@v2

      - name: Set up Python
        uses: actions/setup-python@v2
        with:
          python-version: &apos;3.8&apos;

      - name: Cache pip dependencies
        uses: actions/cache@v3
        with:
          path: ~/.cache/pip
          key: ${{ runner.os }}-pip-${{ hashFiles(&apos;requirements.txt&apos;) }}
          restore-keys: |
            ${{ runner.os }}-pip-

      - name: Install dependencies
        run: pip install -r requirements.txt
&lt;/code&gt;&lt;/pre&gt;
&lt;h4&gt;Maven (Java) Cache Example&lt;/h4&gt;
&lt;pre&gt;&lt;code class=&quot;language-yaml&quot;&gt;jobs:
  build:
    runs-on: ubuntu-latest
    steps:
      - name: Checkout code
        uses: actions/checkout@v2

      - name: Cache Maven dependencies
        uses: actions/cache@v3
        with:
          path: ~/.m2/repository
          key: ${{ runner.os }}-maven-${{ hashFiles(&apos;pom.xml&apos;) }}
          restore-keys: |
            ${{ runner.os }}-maven-

      - name: Build with Maven
        run: mvn clean install
&lt;/code&gt;&lt;/pre&gt;
&lt;h3&gt;Step 5: Best Practices for Caching&lt;/h3&gt;
&lt;p&gt;Cache specific directories: Only cache the files or directories that significantly impact performance (e.g., dependency folders, build artifacts). Avoid caching large, unnecessary directories.&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Use dependency file hashes:&lt;/strong&gt; Ensure cache keys are tied to dependency file hashes (e.g., package-lock.json or requirements.txt) to automatically invalidate the cache when dependencies change.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Monitor cache usage:&lt;/strong&gt; Caching saves time, but it also uses storage and bandwidth. Monitor cache hits and misses to ensure that caching is being used effectively.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Restore with fallbacks:&lt;/strong&gt; Always provide restore-keys as a fallback mechanism to use previous caches when exact matches are unavailable.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Use cache in long workflows:&lt;/strong&gt; In workflows with multiple jobs, you can reuse the cache across different jobs to avoid reinstalling dependencies in each job.&lt;/li&gt;
&lt;/ul&gt;
&lt;h3&gt;Step 6: Cache Limitations&lt;/h3&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Cache size limits:&lt;/strong&gt; GitHub caches have a size limit of 5GB per repository, per cache entry. If your cache exceeds this limit, it won&apos;t be stored.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Eviction:&lt;/strong&gt; GitHub automatically evicts caches that have not been accessed in over 7 days. Make sure your workflows are regularly run to keep caches active.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Permissions:&lt;/strong&gt; Caches are scoped to the repository where they are created and cannot be shared across repositories or users.&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;By leveraging caching in your workflows, you can dramatically speed up your builds, reducing redundant tasks like re-installing dependencies and re-compiling code in each workflow run.&lt;/p&gt;
&lt;h2&gt;Monitoring and Debugging GitHub Actions Workflows&lt;/h2&gt;
&lt;p&gt;Monitoring and debugging GitHub Actions workflows is critical to ensuring that your automation processes run smoothly. GitHub Actions provides built-in tools and features to help you track workflow progress, troubleshoot failures, and optimize performance. In this section, we&apos;ll explore the best practices for monitoring and debugging your workflows effectively.&lt;/p&gt;
&lt;h3&gt;Step 1: Monitoring Workflow Runs&lt;/h3&gt;
&lt;p&gt;GitHub Actions provides a detailed interface for monitoring the status of your workflows. You can view logs, check the status of jobs, and inspect the steps of each job directly from the GitHub repository.&lt;/p&gt;
&lt;p&gt;To access workflow runs:&lt;/p&gt;
&lt;ol&gt;
&lt;li&gt;Navigate to the &lt;strong&gt;Actions&lt;/strong&gt; tab in your repository.&lt;/li&gt;
&lt;li&gt;Select the workflow you want to monitor.&lt;/li&gt;
&lt;li&gt;Click on a specific workflow run to view its details, including logs, job status, and timestamps.&lt;/li&gt;
&lt;/ol&gt;
&lt;p&gt;Each workflow run is color-coded:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Green&lt;/strong&gt;: Successful run.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Red&lt;/strong&gt;: Failed run.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Yellow&lt;/strong&gt;: Job is currently running or there is a warning.&lt;/li&gt;
&lt;/ul&gt;
&lt;h3&gt;Step 2: Reviewing Job Logs&lt;/h3&gt;
&lt;p&gt;Each step in a job generates a log that can be reviewed to understand what happened during the workflow run. Logs show the output from each step, including any commands run, environment variables, and error messages. These logs are crucial for identifying issues when a job fails.&lt;/p&gt;
&lt;p&gt;To view the logs:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Expand each step in the job to see detailed log output.&lt;/li&gt;
&lt;li&gt;If a job failed, the error message or output causing the failure will be highlighted in the logs.&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;Example of reviewing logs for a failed step:&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-plaintext&quot;&gt;Run npm install
npm ERR! code E404
npm ERR! 404 Not Found: some-package@1.0.0
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;This error indicates that a dependency was not found, allowing you to pinpoint the problem quickly.&lt;/p&gt;
&lt;h3&gt;Step 3: Debugging Failed Workflows&lt;/h3&gt;
&lt;p&gt;When a workflow fails, GitHub Actions provides detailed logs and context to help you troubleshoot. Here are a few tips for debugging failed workflows:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Review the logs:&lt;/strong&gt; Start by reviewing the logs for the step that failed. The error message and stack trace will often indicate the cause of the problem.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Check environment variables:&lt;/strong&gt; Ensure that the correct environment variables are being used. You can print them in the logs using echo commands for debugging.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Re-run failed workflows:&lt;/strong&gt; GitHub Actions allows you to re-run failed workflows after addressing the issue. This helps verify if your fix resolves the problem.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Use the fail-fast option:&lt;/strong&gt; In matrix builds, enabling fail-fast can help isolate issues faster by canceling other jobs once a failure occurs.&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;Example of adding a step to print environment variables for debugging:&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-yaml&quot;&gt;steps:
  - name: Print environment variables
    run: env
&lt;/code&gt;&lt;/pre&gt;
&lt;h3&gt;Step 4: Using Debugging Mode&lt;/h3&gt;
&lt;p&gt;GitHub Actions provides a debug mode that can give you more detailed output when you encounter complex issues. To enable debugging, you need to set the following secrets in your repository:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;ACTIONS_RUNNER_DEBUG:&lt;/strong&gt; Set this to true to get more verbose logging about the runner&apos;s behavior.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;ACTIONS_STEP_DEBUG:&lt;/strong&gt; Set this to true to get debug logs from each step in the workflow.&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;Once these are enabled, GitHub Actions will provide more detailed logs, helping you identify the exact cause of the failure.&lt;/p&gt;
&lt;h3&gt;Step 5: Setting Up Notifications for Workflow Events&lt;/h3&gt;
&lt;p&gt;To stay informed about your workflow runs, you can set up notifications for specific events such as workflow failures, successes, or completion. GitHub integrates with various communication tools like Slack and email, so you can get real-time notifications about workflow status.&lt;/p&gt;
&lt;p&gt;Example using the &lt;code&gt;slackapi/slack-github-action&lt;/code&gt; to send a Slack notification when a workflow fails:&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-yaml&quot;&gt;jobs:
  notify:
    runs-on: ubuntu-latest
    if: failure()
    steps:
      - name: Send Slack notification on failure
        uses: slackapi/slack-github-action@v1.23.0
        with:
          slack-bot-token: ${{ secrets.SLACK_BOT_TOKEN }}
          channel-id: &apos;YOUR_CHANNEL_ID&apos;
          text: &amp;quot;Workflow failed: ${{ github.workflow }} - ${{ github.run_id }}&amp;quot;
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;This example sends a message to your Slack channel whenever a workflow fails, allowing you to react quickly to issues.&lt;/p&gt;
&lt;h3&gt;Step 6: Monitoring Workflow Performance&lt;/h3&gt;
&lt;p&gt;Over time, you may want to optimize your workflows to improve their performance. GitHub Actions provides timestamps for each job and step, allowing you to monitor how long specific tasks take to execute.&lt;/p&gt;
&lt;p&gt;To monitor performance:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Check the duration of each job and step in the Actions logs.&lt;/li&gt;
&lt;li&gt;Look for bottlenecks or long-running tasks that could be optimized (e.g., caching dependencies, running jobs in parallel).&lt;/li&gt;
&lt;li&gt;Use metrics from GitHub&apos;s built-in insights or third-party tools like Datadog or Prometheus to monitor workflow execution over time.&lt;/li&gt;
&lt;/ul&gt;
&lt;h4&gt;Example of identifying a bottleneck:&lt;/h4&gt;
&lt;pre&gt;&lt;code class=&quot;language-plaintext&quot;&gt;Step 1: Install dependencies (2m 34s)
Step 2: Run tests (1m 05s)
Step 3: Build project (3m 42s)
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;If the &amp;quot;Build project&amp;quot; step is consistently slow, you might explore ways to cache build artifacts or split the build process across parallel jobs.&lt;/p&gt;
&lt;h3&gt;Step 7: Best Practices for Debugging and Monitoring&lt;/h3&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Use continue-on-error:&lt;/strong&gt; For non-critical steps, use &lt;code&gt;continue-on-error: true&lt;/code&gt; to allow the workflow to continue even if a step fails. This can help isolate issues without interrupting the entire workflow.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Use job dependencies:&lt;/strong&gt; Ensure that jobs with dependencies on other jobs are properly defined using the needs keyword, to avoid unnecessary failures.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Leverage if conditions:&lt;/strong&gt; Use conditional expressions like &lt;code&gt;if: success()&lt;/code&gt; or &lt;code&gt;if: failure()&lt;/code&gt; to control which steps run, based on the outcome of previous steps.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Test workflows locally:&lt;/strong&gt; Tools like act allow you to run GitHub Actions locally for faster iteration and debugging.&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;By applying these techniques and monitoring tools, you can ensure that your GitHub Actions workflows run reliably, debug issues more effectively, and optimize their performance over time.&lt;/p&gt;
&lt;h2&gt;Conclusion&lt;/h2&gt;
&lt;p&gt;GitHub Actions is a versatile and powerful automation tool that goes far beyond its initial use case of CI/CD in software development. By understanding its core features—such as workflows, parallelism, matrix builds, and caching—you can automate a wide range of tasks, from code deployment to data engineering workflows. Additionally, with its built-in secrets management, monitoring, and debugging tools, GitHub Actions enables you to create secure, efficient, and resilient automation pipelines.&lt;/p&gt;
&lt;p&gt;Whether you&apos;re building, testing, and deploying applications, orchestrating complex data pipelines, or even generating reports and syncing data across hybrid environments, GitHub Actions provides a flexible framework to streamline your workflows. By implementing best practices such as caching dependencies, utilizing matrix builds for comprehensive testing, and monitoring performance through actionable insights, you can optimize your automation strategies and deliver results faster and more reliably.&lt;/p&gt;
&lt;p&gt;As you continue to explore the possibilities with GitHub Actions, remember that its true power lies in its ability to automate virtually any task you can define in a workflow. Take advantage of its rich ecosystem of pre-built actions and its seamless integration with other platforms, and let GitHub Actions handle the repetitive tasks so you can focus on innovation and problem-solving.&lt;/p&gt;
&lt;p&gt;Now that you&apos;ve seen what GitHub Actions can do across both software development and data engineering, it&apos;s time to get started and unlock the full potential of automation in your workflows.&lt;/p&gt;
</content:encoded><author>Alex Merced</author></item><item><title>A Guide to dbt Macros - Purpose, Benefits, and Usage</title><link>https://iceberglakehouse.com/posts/2024-10-a-guide-to-dbt-macros/</link><guid isPermaLink="true">https://iceberglakehouse.com/posts/2024-10-a-guide-to-dbt-macros/</guid><description>
- [Apache Iceberg 101](https://www.dremio.com/lakehouse-deep-dives/apache-iceberg-101/?utm_source=ev_external_blog&amp;utm_medium=influencer&amp;utm_campaign...</description><pubDate>Fri, 18 Oct 2024 09:00:00 GMT</pubDate><content:encoded>&lt;ul&gt;
&lt;li&gt;&lt;a href=&quot;https://www.dremio.com/lakehouse-deep-dives/apache-iceberg-101/?utm_source=ev_external_blog&amp;amp;utm_medium=influencer&amp;amp;utm_campaign=dbtmacros&amp;amp;utm_content=alexmerced&amp;amp;utm_term=external_blog&quot;&gt;Apache Iceberg 101&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&quot;https://www.dremio.com/blog/intro-to-dremio-nessie-and-apache-iceberg-on-your-laptop/?utm_source=ev_external_blog&amp;amp;utm_medium=influencer&amp;amp;utm_campaign=dbtmacros&amp;amp;utm_content=alexmerced&amp;amp;utm_term=external_blog&quot;&gt;Hands-on Intro with Apache iceberg&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&quot;https://hello.dremio.com/webcast-an-apache-iceberg-lakehouse-crash-course-reg.html?utm_source=ev_external_blog&amp;amp;utm_medium=influencer&amp;amp;utm_campaign=dbtmacros&amp;amp;utm_content=alexmerced&amp;amp;utm_term=external_blog&quot;&gt;Free Apache Iceberg Crash Course&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&quot;https://hello.dremio.com/wp-apache-iceberg-the-definitive-guide-reg.html?utm_source=ev_external_blog&amp;amp;utm_medium=influencer&amp;amp;utm_campaign=dbtmacros&amp;amp;utm_content=alexmerced&amp;amp;utm_term=external_blog&quot;&gt;Free Copy Of Apache Iceberg: The Definitive Guide&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;When working with dbt, one of the most powerful features available to you is &lt;strong&gt;macros&lt;/strong&gt;. Macros allow you to write reusable code that can be used throughout your dbt project, helping you optimize development, reduce redundancy, and standardize common patterns. In this post, we will explore the purpose of dbt macros, how they can help you streamline your data transformation workflows, and how to use them effectively.&lt;/p&gt;
&lt;h2&gt;What Are dbt Macros?&lt;/h2&gt;
&lt;p&gt;At a high level, &lt;strong&gt;dbt macros&lt;/strong&gt; are snippets of reusable code written in Jinja, a templating language integrated into dbt. Macros act like functions that you can call in various places within your dbt project (such as models, tests, and even other macros). They allow you to simplify repetitive tasks and add logic to your SQL transformations.&lt;/p&gt;
&lt;p&gt;You can think of macros as a way to &lt;strong&gt;DRY&lt;/strong&gt; (Don’t Repeat Yourself) your dbt code, which is particularly useful in larger projects where similar SQL patterns are repeated across many models.&lt;/p&gt;
&lt;h2&gt;How dbt Macros Help You&lt;/h2&gt;
&lt;p&gt;Here are some of the main benefits of using dbt macros in your project:&lt;/p&gt;
&lt;h3&gt;1. &lt;strong&gt;Reduce Redundancy&lt;/strong&gt;&lt;/h3&gt;
&lt;p&gt;In many data transformation workflows, you might find yourself writing the same SQL logic across multiple models. For example, filtering out invalid records or applying specific transformations. With macros, you can abstract this logic into reusable functions and call them whenever needed, reducing code duplication.&lt;/p&gt;
&lt;h3&gt;2. &lt;strong&gt;Standardize SQL Logic&lt;/strong&gt;&lt;/h3&gt;
&lt;p&gt;Macros help ensure that common logic (such as data validation or custom joins) is applied consistently throughout your project. This standardization reduces the likelihood of errors and ensures that your transformations follow the same rules across different models.&lt;/p&gt;
&lt;h3&gt;3. &lt;strong&gt;Simplify Complex Logic&lt;/strong&gt;&lt;/h3&gt;
&lt;p&gt;By using macros, you can break down complex logic into manageable, reusable components. This simplifies your SQL models, making them easier to read, maintain, and debug.&lt;/p&gt;
&lt;h3&gt;4. &lt;strong&gt;Dynamically Generate SQL&lt;/strong&gt;&lt;/h3&gt;
&lt;p&gt;Macros allow you to write SQL that adapts to different use cases based on variables, configuration settings, or inputs. This dynamic generation of SQL can help you handle a variety of edge cases and environments without manually altering the code.&lt;/p&gt;
&lt;h3&gt;5. &lt;strong&gt;Reuse Across Models&lt;/strong&gt;&lt;/h3&gt;
&lt;p&gt;Once a macro is defined, it can be used in multiple models, ensuring that any updates to the macro are reflected across the project. This promotes easier maintenance and faster updates.&lt;/p&gt;
&lt;h2&gt;How to Write and Use dbt Macros&lt;/h2&gt;
&lt;h3&gt;Defining a Macro&lt;/h3&gt;
&lt;p&gt;Macros are typically defined in a &lt;code&gt;.sql&lt;/code&gt; file within the &lt;code&gt;macros/&lt;/code&gt; directory of your dbt project. Here&apos;s an example of a simple macro that calculates the average of a column:&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-sql&quot;&gt;-- macros/calculate_average.sql

{% macro calculate_average(column_name) %}
    AVG({{ column_name }})
{% endmacro %}
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;In this example, the macro calculate_average accepts a column name as a parameter and returns the &lt;code&gt;AVG()&lt;/code&gt; SQL function applied to that column.&lt;/p&gt;
&lt;h3&gt;Using a Macro in a Model&lt;/h3&gt;
&lt;p&gt;Once you&apos;ve defined the macro, you can call it within any model by using the following syntax:&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-sql&quot;&gt;-- models/my_model.sql

SELECT
    {{ calculate_average(&apos;price&apos;) }} AS avg_price,
    category
FROM
    {{ ref(&apos;products&apos;) }}
GROUP BY
    category
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Here, we’re using the calculate_average macro in the SELECT statement to calculate the average price in the products table, without needing to manually repeat the logic.&lt;/p&gt;
&lt;h3&gt;Using Macros with Variables&lt;/h3&gt;
&lt;p&gt;Macros can also be combined with variables to add more flexibility. For example, let’s define a macro that dynamically builds a WHERE clause based on a variable:&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-sql&quot;&gt;-- macros/filter_by_status.sql

{% macro filter_by_status(status) %}
    WHERE status = &apos;{{ status }}&apos;
{% endmacro %}
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;You can now use this macro to filter data based on a variable like so:&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-sql&quot;&gt;-- models/orders.sql

SELECT *
FROM {{ ref(&apos;orders&apos;) }}
{{ filter_by_status(var(&apos;order_status&apos;, &apos;completed&apos;)) }}
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;In this case, &lt;code&gt;filter_by_status&lt;/code&gt; dynamically adds a &lt;code&gt;WHERE&lt;/code&gt; clause that filters the results by &lt;code&gt;order_status&lt;/code&gt;, which defaults to completed if not provided.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Complex Macros:&lt;/strong&gt; Dynamic Table Joins
Here’s an example of a more advanced macro that creates a dynamic join based on parameters passed to it:&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-sql&quot;&gt;-- macros/join_tables.sql

{% macro join_tables(left_table, right_table, join_key) %}
    SELECT
        left.*,
        right.*
    FROM
        {{ ref(left_table) }} AS left
    INNER JOIN
        {{ ref(right_table) }} AS right
    ON
        left.{{ join_key }} = right.{{ join_key }}
{% endmacro %}
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;This macro takes two table names and a join key, then dynamically creates an INNER JOIN between the tables:&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-sql&quot;&gt;-- models/joined_data.sql

{{ join_tables(&apos;customers&apos;, &apos;orders&apos;, &apos;customer_id&apos;) }}
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;When you call this macro, it generates the full SQL for joining the customers and orders tables on the customer_id key.&lt;/p&gt;
&lt;h2&gt;Best Practices for Using dbt Macros&lt;/h2&gt;
&lt;ul&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Keep Macros Focused:&lt;/strong&gt; Each macro should perform a single, well-defined task. Avoid cramming too much logic into a single macro; instead, break it down into smaller, reusable components.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Use Clear Naming Conventions:&lt;/strong&gt; Make sure macro names are descriptive so that their purpose is clear when used in models. This makes the code easier to understand and maintain.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Handle Edge Cases:&lt;/strong&gt; Always account for possible edge cases (e.g., null values or unexpected inputs) within your macros to ensure they perform reliably across different scenarios.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Leverage Macros in Tests:&lt;/strong&gt; You can also use macros in your dbt tests to create reusable testing logic, ensuring consistency across your project’s validation steps.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Document Your Macros:&lt;/strong&gt; Add comments and documentation to your macros to explain their purpose, parameters, and usage. This is especially helpful when multiple team members are contributing to the same project.&lt;/p&gt;
&lt;/li&gt;
&lt;/ul&gt;
&lt;h2&gt;Conclusion&lt;/h2&gt;
&lt;p&gt;dbt macros are a powerful tool that can help you write cleaner, more maintainable, and reusable code in your data transformation projects. By abstracting complex logic, standardizing repetitive patterns, and dynamically generating SQL, macros significantly reduce the complexity and improve the reliability of your dbt workflows.&lt;/p&gt;
&lt;p&gt;Whether you&apos;re new to dbt or an experienced user, learning to write and use macros effectively can take your data engineering capabilities to the next level. Start small with simple reusable snippets, and over time, incorporate more advanced logic to fully unlock the potential of macros in your dbt projects.&lt;/p&gt;
</content:encoded><author>Alex Merced</author></item><item><title>Data Lakehouse Roundup 1 - News and Insights on the Lakehouse</title><link>https://iceberglakehouse.com/posts/2024-10-data-lakehouse-roundup-1/</link><guid isPermaLink="true">https://iceberglakehouse.com/posts/2024-10-data-lakehouse-roundup-1/</guid><description>
I’m excited to kick off a new series called &quot;Data Lakehouse Roundup,&quot; where I’ll cover the latest developments in the data lakehouse space, approxima...</description><pubDate>Wed, 16 Oct 2024 09:00:00 GMT</pubDate><content:encoded>&lt;p&gt;I’m excited to kick off a new series called &amp;quot;Data Lakehouse Roundup,&amp;quot; where I’ll cover the latest developments in the data lakehouse space, approximately every quarter. These articles are designed to quickly bring you up to speed on new releases and features related to data lakehouses. Each edition will start with a brief overview of key trends, followed by a roundup of major news from the past few months. Let’s dive in!&lt;/p&gt;
&lt;h2&gt;Trends&lt;/h2&gt;
&lt;p&gt;Data lakehouses are an emerging trend that help organizations achieve the best of both worlds—offering the structured, queryable data of a data warehouse while decoupling storage from compute. This shift allows organizations to model their data and define business-critical assets as structured tables, but instead of being tied to a compute system, the data is stored independently on a distributed storage system like Object Storage or HDFS. This modular and composable architecture reduces the need for excessive data movement, cutting down on time and resource costs.&lt;/p&gt;
&lt;p&gt;Two key abstractions make this possible. First, table formats enable datasets saved in groups of Parquet files to be recognized as singular tables, while maintaining the same transactional guarantees as an integrated data warehouse or database system. Second, catalogs serve as directories for lakehouse assets like tables, namespaces, and views. These catalogs allow any tool to connect and view assets on the lakehouse in a way similar to how one would interact with data warehouse assets. The result is a composable system that behaves much like traditional, tightly coupled systems.&lt;/p&gt;
&lt;p&gt;When it comes to table formats, there are four primary options: Apache Iceberg, Apache Hudi, Apache Paimon, and Delta Lake. The first three are Apache projects with diverse development communities, while Delta Lake’s primary repository and roadmap are largely driven by Databricks, with a broader community helping to replicate the API in other languages like Rust and Python.&lt;/p&gt;
&lt;h3&gt;Table Formats&lt;/h3&gt;
&lt;p&gt;In terms of analytics and data science workloads, the two dominant table formats are Apache Iceberg and Delta Lake. Apache Iceberg is favored for analytics because of its SQL-centric design and ease of use, while Delta Lake is often popular for AI/ML workloads, due to its mature Python support and the powerful enhancements provided by the Databricks platform, which is highly regarded for AI/ML. Although Iceberg and Delta lead the race in terms of consumption, Apache Iceberg has gained significant momentum, with major announcements from a wide range of companies, including Dremio, Snowflake, Upsolver, Estuary, AWS, Azure, Google, and Cloudera, all offering support for Iceberg lakehouses. Meanwhile, Apache Hudi and Apache Paimon have been embraced for their low-latency streaming ingestion capabilities. Often, these formats are converted into Iceberg or Delta for consumption later using tools like Apache Xtable.&lt;/p&gt;
&lt;p&gt;Given this level of community adoption, Apache Iceberg is quickly becoming the industry standard for data lakehouse tables. However, there are still many questions as organizations architect their lakehouses.&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;a href=&quot;https://www.dremio.com/blog/apache-iceberg-crash-course-what-is-a-data-lakehouse-and-a-table-format/?utm_medium=influencer&amp;amp;utm_content=alexmerced&amp;amp;utm_source=ev_external_blog&amp;amp;utm_term=evolutions1&quot;&gt;What is a Table Format?&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&quot;https://www.dremio.com/blog/exploring-the-architecture-of-apache-iceberg-delta-lake-and-apache-hudi/?utm_medium=influencer&amp;amp;utm_content=alexmerced&amp;amp;utm_source=ev_external_blog&amp;amp;utm_term=evolutions1&quot;&gt;Architecture of Iceberg, Hudi and Delta&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&quot;https://hello.dremio.com/wp-apache-iceberg-the-definitive-guide-reg.html?utm_medium=influencer&amp;amp;utm_content=alexmerced&amp;amp;utm_source=ev_external_blog&amp;amp;utm_term=evolutions1&quot;&gt;Free Copy of O&apos;Reilly&apos;s Apache Iceberg Definitive Guide&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;h3&gt;Streaming&lt;/h3&gt;
&lt;p&gt;While using Hudi or Paimon and converting to Delta or Iceberg is an option, this conversion can cause a loss of benefits provided by formats like Iceberg—such as partition evolution and hidden partitioning, which optimize how data is organized and written. Native Iceberg streaming pipelines can be built with open-source tools like Kafka Connect, Flink, and Spark Streaming. Additionally, managed streaming services from companies like Upsolver and Estuary are emerging, making it easier to stream data into Iceberg tables with high performance.&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;a href=&quot;https://www.dremio.com/blog/cdc-with-apache-iceberg/?utm_medium=influencer&amp;amp;utm_content=alexmerced&amp;amp;utm_source=ev_external_blog&amp;amp;utm_term=evolutions1&quot;&gt;A Guide to Apache Iceberg CDC&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&quot;https://www.dremio.com/blog/streaming-and-batch-data-lakehouses-with-apache-iceberg-dremio-and-upsolver/?utm_medium=influencer&amp;amp;utm_content=alexmerced&amp;amp;utm_source=ev_external_blog&amp;amp;utm_term=evolutions1&quot;&gt;Streaming to Apache Iceberg with Upsolver&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&quot;https://www.dremio.com/blog/ingesting-data-into-nessie-apache-iceberg-with-kafka-connect-and-querying-it-with-dremio/?utm_medium=influencer&amp;amp;utm_content=alexmerced&amp;amp;utm_source=ev_external_blog&amp;amp;utm_term=evolutions1&quot;&gt;Streaming into Apache Iceberg Tables with Kafka Connect&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;h3&gt;Catalogs&lt;/h3&gt;
&lt;p&gt;Managing governance rules for data lakehouse tables across different compute tools can be cumbersome. To address this, there’s a growing interest in shifting access control and data management responsibilities from the compute engine to the catalog. Open-source catalogs like Apache Polaris (incubating), Apache Gravitino (incubating), Nessie, and Unity OSS provide options for tracking and governing lakehouse tables. Polaris, Gravitino, and Nessie all support Iceberg tables, while Unity OSS supports Delta Lake. Managed services for these open-source catalogs are also on the rise, with companies like Dremio, Snowflake, Datastrato, and Databricks offering solutions (though Databricks&apos; managed Unity service uses a different codebase than Unity OSS). Over the next year, catalog management will likely become a central focus in the lakehouse ecosystem. Multi-catalog management will also become more feasible as Polaris and Gravitino offer catalog federation features, and all of these catalogs support the Apache Iceberg REST API, ensuring compatibility with engines that follow the Iceberg REST spec.&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;a href=&quot;https://www.dremio.com/blog/the-evolution-of-apache-iceberg-catalogs/?utm_medium=influencer&amp;amp;utm_content=alexmerced&amp;amp;utm_source=ev_external_blog&amp;amp;utm_term=evolutions1&quot;&gt;The Evolution of Apache Iceberg Catalogs&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&quot;https://www.dremio.com/blog/why-thinking-about-apache-iceberg-catalogs-like-nessie-and-apache-polaris-incubating-matters/?utm_medium=influencer&amp;amp;utm_content=alexmerced&amp;amp;utm_source=ev_external_blog&amp;amp;utm_term=evolutions1&quot;&gt;Why Apache Polaris and Nessie Matter&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&quot;https://www.dremio.com/blog/the-nessie-ecosystem-and-the-reach-of-git-for-data-for-apache-iceberg/?utm_medium=influencer&amp;amp;utm_content=alexmerced&amp;amp;utm_source=ev_external_blog&amp;amp;utm_term=evolutions1&quot;&gt;The Nessie Ecosystem&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;h3&gt;Hybrid Lakehouse&lt;/h3&gt;
&lt;p&gt;As organizations face regulations, cost considerations, and other factors, many are moving data from the cloud back to on-premise data centers or private clouds. While some data remains in the cloud for accessibility and regional performance, other data is stored on-premise for long-term archiving, to co-locate with on-prem compute, or for similar reasons. In this hybrid data lakehouse model, organizations need vendors that provide high-performance, feature-rich storage. Vendors like Minio, Pure Storage, Vast Data, and NetApp are stepping up to fill this need. Additionally, organizations require compute solutions that can access both on-prem and cloud data seamlessly, which is where Dremio excels. With its hybrid design, Dremio brings together a query engine, semantic layer, virtualization, and lakehouse catalog features to offer a unified view of all your data, whether in the cloud or on-premise.&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;a href=&quot;https://www.dremio.com/blog/3-reasons-to-create-hybrid-apache-iceberg-data-lakehouses/?utm_medium=influencer&amp;amp;utm_content=alexmerced&amp;amp;utm_source=ev_external_blog&amp;amp;utm_term=evolutions1&quot;&gt;3 Reasons to Have a Hybrid Lakehouse&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&quot;https://www.dremio.com/blog/hybrid-lakehouse-storage-solutions-minio/?utm_medium=influencer&amp;amp;utm_content=alexmerced&amp;amp;utm_source=ev_external_blog&amp;amp;utm_term=evolutions1&quot;&gt;Hybrid Solutions: Minio&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&quot;https://www.dremio.com/blog/hybrid-lakehouse-storage-solutions-purestorage/?utm_medium=influencer&amp;amp;utm_content=alexmerced&amp;amp;utm_source=ev_external_blog&amp;amp;utm_term=evolutions1&quot;&gt;Hybrid Solutions: Pure Storage&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&quot;https://www.dremio.com/blog/hybrid-lakehouse-infrastructure-solutions-vast-data/?utm_medium=influencer&amp;amp;utm_content=alexmerced&amp;amp;utm_source=ev_external_blog&amp;amp;utm_term=evolutions1&quot;&gt;Hybrid Solutions: Vast Data&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&quot;https://www.dremio.com/blog/hybrid-lakehouse-storage-solutions-netapp/?utm_medium=influencer&amp;amp;utm_content=alexmerced&amp;amp;utm_source=ev_external_blog&amp;amp;utm_term=evolutions1&quot;&gt;Hybrid Solutions: NetApp&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;h2&gt;Conclusion&lt;/h2&gt;
&lt;p&gt;There is still of a lot of growth in the lakehouse spaces and keeping an eye on streaming, catalogs and hybrid lakehouse solutions will keep you forward looking in the lakehouse world. Look forward to discussing trends in a few months in the next Lakehouse roundup.&lt;/p&gt;
</content:encoded><author>Alex Merced</author></item><item><title>Getting Started with Data Analytics Using PyArrow in Python</title><link>https://iceberglakehouse.com/posts/2024-10-getting-started-with-pyarrow-in-python/</link><guid isPermaLink="true">https://iceberglakehouse.com/posts/2024-10-getting-started-with-pyarrow-in-python/</guid><description>
- [Apache Iceberg Crash Course: What is a Data Lakehouse and a Table Format?](https://www.dremio.com/blog/apache-iceberg-crash-course-what-is-a-data-...</description><pubDate>Tue, 15 Oct 2024 09:00:00 GMT</pubDate><content:encoded>&lt;ul&gt;
&lt;li&gt;&lt;a href=&quot;https://www.dremio.com/blog/apache-iceberg-crash-course-what-is-a-data-lakehouse-and-a-table-format/?utm_source=ev_external_blog&amp;amp;utm_medium=influencer&amp;amp;utm_campaign=intropyarrow&amp;amp;utm_content=alexmerced&amp;amp;utm_term=external_blog&quot;&gt;Apache Iceberg Crash Course: What is a Data Lakehouse and a Table Format?&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&quot;https://hello.dremio.com/wp-apache-iceberg-the-definitive-guide-reg.html?utm_source=alexmerced&amp;amp;utm_medium=external_blog&amp;amp;utm_campaign=intropyarrow&quot;&gt;Free Copy of Apache Iceberg the Definitive Guide&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&quot;https://hello.dremio.com/webcast-an-apache-iceberg-lakehouse-crash-course-reg.html?utm_source=alexmerced&amp;amp;utm_medium=external_blog&amp;amp;utm_campaign=intropyarrow&quot;&gt;Free Apache Iceberg Crash Course&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&quot;https://www.youtube.com/watch?v=SIriNcVIGJQ&amp;amp;list=PLsLAVBjQJO0p0Yq1fLkoHvt2lEJj5pcYe&quot;&gt;Iceberg Lakehouse Engineering Video Playlist&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;h2&gt;Introduction&lt;/h2&gt;
&lt;h3&gt;Overview of the Docker Environment&lt;/h3&gt;
&lt;p&gt;In this guide, we will explore data analytics using &lt;strong&gt;PyArrow&lt;/strong&gt;, a powerful library designed for efficient in-memory data processing with columnar storage. We will work within a pre-configured environment using the &lt;strong&gt;Python Data Science Notebook Docker Image&lt;/strong&gt;. This environment includes all the essential libraries for data manipulation, machine learning, and database connectivity, making it an ideal setup for performing analytics with PyArrow.&lt;/p&gt;
&lt;p&gt;To get started, you can pull and run the Docker container by following these steps:&lt;/p&gt;
&lt;ol&gt;
&lt;li&gt;&lt;strong&gt;Build the Docker Image:&lt;/strong&gt;&lt;/li&gt;
&lt;/ol&gt;
&lt;pre&gt;&lt;code class=&quot;language-bash&quot;&gt;   docker pull alexmerced/datanotebook
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Run the Container:&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-bash&quot;&gt;docker run -p 8888:8888 -v $(pwd):/home/pydata/work alexmerced/datanotebook
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;&lt;strong&gt;Access Jupyter Notebook:&lt;/strong&gt; Open your browser and navigate to http://localhost:8888 to access the notebook interface.&lt;/p&gt;
&lt;p&gt;This setup provides a user-friendly experience with Jupyter Notebook running on port 8888, where you can easily write and execute Python code for data analysis.&lt;/p&gt;
&lt;h3&gt;Why PyArrow?&lt;/h3&gt;
&lt;p&gt;Apache Arrow is an open-source framework optimized for in-memory data processing with a columnar format. PyArrow, the Python implementation of Arrow, enables faster, more efficient data access and manipulation compared to traditional column-based libraries like Pandas. Here are some key benefits of using PyArrow:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Faster Data Processing:&lt;/strong&gt; PyArrow uses a columnar memory layout that accelerates access to large datasets.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Lower Memory Usage:&lt;/strong&gt; Thanks to Arrow’s efficient memory format, you can handle larger datasets with less memory.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Interoperability:&lt;/strong&gt; PyArrow integrates smoothly with other systems and languages, making it a versatile tool for multi-language environments.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Better Support for Large Datasets:&lt;/strong&gt; PyArrow is designed to handle big data tasks, making it ideal for workloads that Pandas struggles with.&lt;/p&gt;
&lt;/li&gt;
&lt;/ul&gt;
&lt;h2&gt;Section 1: Understanding Key PyArrow Objects&lt;/h2&gt;
&lt;p&gt;PyArrow provides a set of data structures that are specifically optimized for in-memory analytics and manipulation. In this section, we will explore the key objects in PyArrow and their purposes.&lt;/p&gt;
&lt;h3&gt;PyArrow&apos;s Core Data Structures:&lt;/h3&gt;
&lt;ol&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Table&lt;/strong&gt;:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;A &lt;code&gt;Table&lt;/code&gt; in PyArrow is a collection of columnar data, optimized for efficient processing and memory usage.&lt;/li&gt;
&lt;li&gt;It can be thought of as similar to a DataFrame in Pandas but designed to work seamlessly with Arrow’s columnar format.&lt;/li&gt;
&lt;li&gt;Tables can be partitioned and processed in parallel, which improves performance with large datasets.&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;&lt;strong&gt;Example&lt;/strong&gt;:&lt;/p&gt;
&lt;/li&gt;
&lt;/ol&gt;
&lt;pre&gt;&lt;code class=&quot;language-python&quot;&gt;   import pyarrow as pa

   data = [
       pa.array([1, 2, 3]),
       pa.array([&apos;A&apos;, &apos;B&apos;, &apos;C&apos;]),
   ]
   table = pa.Table.from_arrays(data, names=[&apos;column1&apos;, &apos;column2&apos;])
   print(table)
&lt;/code&gt;&lt;/pre&gt;
&lt;ol start=&quot;2&quot;&gt;
&lt;li&gt;&lt;strong&gt;RecordBatch&lt;/strong&gt;:&lt;/li&gt;
&lt;/ol&gt;
&lt;p&gt;A RecordBatch is a collection of rows with a defined schema. It allows for efficient in-memory processing of data in batches.&lt;/p&gt;
&lt;p&gt;It&apos;s useful when you need to process data in chunks, enabling better memory management.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Example:&lt;/strong&gt;&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-python&quot;&gt;batch = pa.RecordBatch.from_pandas(df)
print(batch)
&lt;/code&gt;&lt;/pre&gt;
&lt;ol start=&quot;3&quot;&gt;
&lt;li&gt;&lt;strong&gt;Array&lt;/strong&gt;:&lt;/li&gt;
&lt;/ol&gt;
&lt;p&gt;An Array in PyArrow is a fundamental data structure representing a one-dimensional, homogeneous sequence of values.&lt;/p&gt;
&lt;p&gt;Arrays can be of various types, including integers, floats, strings, and more. PyArrow provides specialized arrays for different types of data.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Example:&lt;/strong&gt;&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-python&quot;&gt;array = pa.array([1, 2, 3, 4, 5])
print(array)
&lt;/code&gt;&lt;/pre&gt;
&lt;ol start=&quot;4&quot;&gt;
&lt;li&gt;&lt;strong&gt;Schema:&lt;/strong&gt;&lt;/li&gt;
&lt;/ol&gt;
&lt;p&gt;A Schema defines the structure of data in a Table or RecordBatch. It consists of the names and data types of each column.&lt;/p&gt;
&lt;p&gt;Schemas ensure that all data being processed follows a consistent format.&lt;/p&gt;
&lt;p&gt;Example:&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-python&quot;&gt;schema = pa.schema([
    (&apos;column1&apos;, pa.int32()),
    (&apos;column2&apos;, pa.string())
])

print(schema)
&lt;/code&gt;&lt;/pre&gt;
&lt;ol start=&quot;5&quot;&gt;
&lt;li&gt;&lt;strong&gt;ChunkedArray:&lt;/strong&gt;&lt;/li&gt;
&lt;/ol&gt;
&lt;p&gt;A ChunkedArray is a sequence of Array objects that have been split into smaller chunks. This allows for parallel processing on chunks of data, improving efficiency when working with larger datasets.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Example:&lt;/strong&gt;&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-python&quot;&gt;chunked_array = pa.chunked_array([[1, 2, 3], [4, 5, 6]])
print(chunked_array)
&lt;/code&gt;&lt;/pre&gt;
&lt;h3&gt;Summary&lt;/h3&gt;
&lt;p&gt;These core objects are essential for working with PyArrow and enable efficient data processing in memory. By utilizing PyArrow&apos;s columnar format and its efficient handling of large datasets, you can perform complex data manipulations with ease. As we continue, you&apos;ll see how these objects interact to make reading, writing, and analyzing data faster and more memory-efficient.&lt;/p&gt;
&lt;h2&gt;Section 2: Reading and Writing Parquet Files with PyArrow&lt;/h2&gt;
&lt;p&gt;Parquet is a columnar storage file format that is widely used in big data analytics. Its efficient compression and encoding make it ideal for storing large datasets. In this section, we will explore how to use PyArrow to read from and write to Parquet files.&lt;/p&gt;
&lt;h3&gt;Why Use Parquet?&lt;/h3&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Efficient Storage&lt;/strong&gt;: Parquet’s columnar format allows for efficient compression, reducing the storage size of large datasets.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Faster Querying&lt;/strong&gt;: By storing data in columns, Parquet files allow analytical queries to scan only the relevant columns, reducing I/O and improving performance.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Interoperability&lt;/strong&gt;: Parquet is a widely supported format that can be read and written by many different systems, making it ideal for data exchange.&lt;/li&gt;
&lt;/ul&gt;
&lt;h3&gt;Reading Parquet Files&lt;/h3&gt;
&lt;p&gt;Using PyArrow, you can easily read a Parquet file into memory as a PyArrow &lt;code&gt;Table&lt;/code&gt;. This table can then be used for further data processing or manipulation.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Example&lt;/strong&gt;:&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-python&quot;&gt;import pyarrow.parquet as pq

# Reading a Parquet file
table = pq.read_table(&apos;sample_data.parquet&apos;)

# Displaying the contents of the PyArrow table
print(table)
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;In this example, the &lt;code&gt;pq.read_table()&lt;/code&gt; function reads the Parquet file and returns a &lt;code&gt;Table&lt;/code&gt; object. This table can now be used for in-memory operations such as filtering, joining, or aggregating data.&lt;/p&gt;
&lt;h3&gt;Writing Parquet Files&lt;/h3&gt;
&lt;p&gt;To store data as Parquet, you can write a PyArrow Table back to disk in Parquet format. PyArrow provides methods for this purpose, allowing you to save your data efficiently.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Example:&lt;/strong&gt;&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-python&quot;&gt;import pyarrow as pa
import pyarrow.parquet as pq

# Create a simple PyArrow table
data = [
    pa.array([1, 2, 3, 4]),
    pa.array([&apos;A&apos;, &apos;B&apos;, &apos;C&apos;, &apos;D&apos;])
]
table = pa.Table.from_arrays(data, names=[&apos;column1&apos;, &apos;column2&apos;])

# Writing the table to a Parquet file
pq.write_table(table, &apos;output_data.parquet&apos;)
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;In this example, a PyArrow table is created and saved to disk as a Parquet file using the &lt;code&gt;pq.write_table()&lt;/code&gt; function.&lt;/p&gt;
&lt;h3&gt;Working with Large Datasets&lt;/h3&gt;
&lt;p&gt;One of the key advantages of Parquet is its ability to handle large datasets efficiently. When reading a Parquet file, you can load only specific columns into memory, which is especially useful when working with large datasets.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Example:&lt;/strong&gt;&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-python&quot;&gt;# Reading only specific columns from a Parquet file
table = pq.read_table(&apos;sample_data.parquet&apos;, columns=[&apos;column1&apos;])

print(table)
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;This code demonstrates how to read only the relevant columns, reducing the memory footprint when loading the dataset.&lt;/p&gt;
&lt;h3&gt;Summary&lt;/h3&gt;
&lt;p&gt;By using PyArrow to read and write Parquet files, you gain access to a highly efficient, compressed, and columnar data format that works well for large datasets. PyArrow simplifies working with Parquet by providing easy-to-use functions for loading and saving data, while also supporting advanced operations like selective column reads to optimize performance.&lt;/p&gt;
&lt;h2&gt;Section 3: Basic Analytical Operations with PyArrow&lt;/h2&gt;
&lt;p&gt;PyArrow not only provides efficient tools for reading and writing Parquet files but also enables you to perform basic data analytics operations like filtering, joining, and aggregating data in memory. These operations can be performed directly on PyArrow &lt;code&gt;Table&lt;/code&gt; objects, offering a significant performance boost when dealing with large datasets.&lt;/p&gt;
&lt;h3&gt;Filtering Data&lt;/h3&gt;
&lt;p&gt;PyArrow allows you to filter rows based on conditions, similar to how you would with Pandas. This operation is highly efficient due to the columnar nature of PyArrow&apos;s data structures.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Example&lt;/strong&gt;:&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-python&quot;&gt;import pyarrow.compute as pc

# Assume we have a table with two columns: &apos;column1&apos; and &apos;column2&apos;
table = pq.read_table(&apos;sample_data.parquet&apos;)

# Apply a filter to keep rows where &apos;column1&apos; &amp;gt; 2
filtered_table = table.filter(pc.greater(table[&apos;column1&apos;], 2))

print(filtered_table)
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;In this example, we use PyArrow’s compute module to filter the data. The pc.greater() function returns a boolean mask, and the filter() method applies this mask to the table, returning only rows where &apos;column1&apos; is greater than 2.&lt;/p&gt;
&lt;h3&gt;Joining Data&lt;/h3&gt;
&lt;p&gt;Just like in SQL or Pandas, PyArrow allows you to join two tables based on a common column. This operation is particularly useful when combining datasets.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Example:&lt;/strong&gt;&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-python&quot;&gt;import pyarrow as pa

# Creating two tables to join
left_table = pa.table({&apos;key&apos;: [1, 2, 3], &apos;value_left&apos;: [&apos;A&apos;, &apos;B&apos;, &apos;C&apos;]})
right_table = pa.table({&apos;key&apos;: [1, 2, 3], &apos;value_right&apos;: [&apos;X&apos;, &apos;Y&apos;, &apos;Z&apos;]})

# Performing an inner join on the &apos;key&apos; column
joined_table = left_table.join(right_table, keys=&apos;key&apos;)

print(joined_table)
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Here, we use PyArrow’s join method to perform an inner join on two tables, combining them based on the common column &apos;key&apos;. The result is a new table with data from both tables.&lt;/p&gt;
&lt;h3&gt;Aggregation Operations&lt;/h3&gt;
&lt;p&gt;Aggregation operations like summing, counting, and averaging are essential for data analytics. PyArrow provides efficient methods to perform these operations on large datasets.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Example:&lt;/strong&gt;&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-python&quot;&gt;import pyarrow.compute as pc

# Assume we have a table with a numerical column &apos;column1&apos;
table = pq.read_table(&apos;sample_data.parquet&apos;)

# Perform aggregation: sum of &apos;column1&apos;
sum_column1 = pc.sum(table[&apos;column1&apos;])

print(f&amp;quot;Sum of column1: {sum_column1.as_py()}&amp;quot;)
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;In this example, we use the &lt;code&gt;pc.sum()&lt;/code&gt; function to calculate the sum of a column. Similarly, you can apply other aggregation functions like &lt;code&gt;pc.mean()&lt;/code&gt;, &lt;code&gt;pc.min()&lt;/code&gt;, or &lt;code&gt;pc.max()&lt;/code&gt;.&lt;/p&gt;
&lt;h3&gt;Combining Operations: Filter and Aggregate&lt;/h3&gt;
&lt;p&gt;PyArrow allows you to chain operations together, such as filtering the data first and then applying aggregation.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Example:&lt;/strong&gt;&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-python&quot;&gt;# Filter the table where &apos;column1&apos; &amp;gt; 2
filtered_table = table.filter(pc.greater(table[&apos;column1&apos;], 2))

# Sum the filtered data in &apos;column1&apos;
sum_filtered = pc.sum(filtered_table[&apos;column1&apos;])

print(f&amp;quot;Sum of filtered column1: {sum_filtered.as_py()}&amp;quot;)
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;In this case, we first filter the data and then apply the aggregation function on the filtered subset. This combination of operations enables more complex analyses with just a few lines of code.&lt;/p&gt;
&lt;h3&gt;Summary&lt;/h3&gt;
&lt;p&gt;PyArrow’s powerful analytical capabilities make it a great choice for performing data operations on large datasets. By leveraging its efficient in-memory structures, you can filter, join, and aggregate data in a way that is both fast and memory-efficient. Whether you are working with small or large datasets, PyArrow provides the tools to handle your data analytics tasks with ease.&lt;/p&gt;
&lt;h2&gt;Section 4: Working with JSON, CSV, and Feather Files using PyArrow&lt;/h2&gt;
&lt;p&gt;In addition to Parquet, PyArrow supports a wide variety of file formats, including JSON, CSV, and Feather. These formats are commonly used for data storage and interchange, and PyArrow makes it easy to read from and write to them efficiently.&lt;/p&gt;
&lt;h3&gt;Reading and Writing JSON Files&lt;/h3&gt;
&lt;p&gt;JSON (JavaScript Object Notation) is a lightweight data-interchange format that is widely used for data transfer. While it may not be as efficient as columnar formats like Parquet, JSON is still commonly used, especially for web data.&lt;/p&gt;
&lt;h4&gt;Reading JSON Files&lt;/h4&gt;
&lt;p&gt;PyArrow allows you to read JSON data and convert it into a PyArrow &lt;code&gt;Table&lt;/code&gt; for further processing.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Example&lt;/strong&gt;:&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-python&quot;&gt;import pyarrow.json as paj

# Reading a JSON file into a PyArrow table
table = paj.read_json(&apos;sample_data.json&apos;)

# Display the contents of the table
print(table)
&lt;/code&gt;&lt;/pre&gt;
&lt;h4&gt;Writing JSON Files&lt;/h4&gt;
&lt;p&gt;PyArrow can also write Table data back into JSON format, making it convenient for exchanging data in systems where JSON is the preferred format.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Example:&lt;/strong&gt;&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-python&quot;&gt;import pyarrow as pa
import pyarrow.json as paj

# Create a simple PyArrow table
data = {
    &apos;column1&apos;: [1, 2, 3],
    &apos;column2&apos;: [&apos;A&apos;, &apos;B&apos;, &apos;C&apos;]
}
table = pa.Table.from_pydict(data)

# Writing the table to a JSON file
paj.write_json(table, &apos;output_data.json&apos;)
&lt;/code&gt;&lt;/pre&gt;
&lt;h3&gt;Reading and Writing CSV Files&lt;/h3&gt;
&lt;p&gt;CSV (Comma-Separated Values) is one of the most common file formats for structured data, particularly in data science and analytics. PyArrow makes it easy to work with CSV files by converting them to Table objects.&lt;/p&gt;
&lt;h4&gt;Reading CSV Files&lt;/h4&gt;
&lt;p&gt;PyArrow’s CSV reader allows for fast parsing of large CSV files, which can then be converted into tables for in-memory analytics.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Example:&lt;/strong&gt;&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-python&quot;&gt;import pyarrow.csv as pac

# Reading a CSV file into a PyArrow table
table = pac.read_csv(&apos;sample_data.csv&apos;)

# Display the table
print(table)
&lt;/code&gt;&lt;/pre&gt;
&lt;h4&gt;Writing CSV Files&lt;/h4&gt;
&lt;p&gt;You can also write PyArrow tables back to CSV format, which is helpful for data sharing and reporting.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Example:&lt;/strong&gt;&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-python&quot;&gt;import pyarrow as pa
import pyarrow.csv as pac

# Create a simple PyArrow table
data = {
    &apos;column1&apos;: [1, 2, 3],
    &apos;column2&apos;: [&apos;A&apos;, &apos;B&apos;, &apos;C&apos;]
}
table = pa.Table.from_pydict(data)

# Writing the table to a CSV file
pac.write_csv(table, &apos;output_data.csv&apos;)
&lt;/code&gt;&lt;/pre&gt;
&lt;h3&gt;Reading and Writing Feather Files&lt;/h3&gt;
&lt;p&gt;Feather is a binary columnar file format that provides better performance compared to CSV and JSON, while maintaining interoperability between Python and R. PyArrow natively supports Feather, allowing for efficient storage and fast reads.&lt;/p&gt;
&lt;h4&gt;Reading Feather Files&lt;/h4&gt;
&lt;p&gt;Feather files are ideal for fast I/O operations and work seamlessly with PyArrow.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Example:&lt;/strong&gt;&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-python&quot;&gt;import pyarrow.feather as paf

# Reading a Feather file into a PyArrow table
table = paf.read_table(&apos;sample_data.feather&apos;)

# Display the table
print(table)
&lt;/code&gt;&lt;/pre&gt;
&lt;h4&gt;Writing Feather Files&lt;/h4&gt;
&lt;p&gt;PyArrow can write Table objects to Feather format, offering a balance between ease of use and performance, particularly for in-memory data sharing.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Example:&lt;/strong&gt;&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-python&quot;&gt;import pyarrow as pa
import pyarrow.feather as paf

# Create a simple PyArrow table
data = {
    &apos;column1&apos;: [1, 2, 3],
    &apos;column2&apos;: [&apos;A&apos;, &apos;B&apos;, &apos;C&apos;]
}
table = pa.Table.from_pydict(data)

# Writing the table to a Feather file
paf.write_feather(table, &apos;output_data.feather&apos;)
&lt;/code&gt;&lt;/pre&gt;
&lt;h3&gt;Summary&lt;/h3&gt;
&lt;p&gt;PyArrow’s support for various file formats—such as JSON, CSV, and Feather—makes it a versatile tool for data analytics. Whether you&apos;re working with structured CSVs, exchanging JSON data, or aiming for faster performance with Feather files, PyArrow simplifies the process of reading and writing these formats. This flexibility allows you to handle a wide range of data tasks, from data ingestion to efficient storage and retrieval.&lt;/p&gt;
&lt;h2&gt;Section 5: Using Apache Arrow Flight with PyArrow&lt;/h2&gt;
&lt;p&gt;&lt;strong&gt;Apache Arrow Flight&lt;/strong&gt; is a high-performance data transport layer built on top of Apache Arrow. It provides an efficient way to transfer large datasets between systems. One of its key benefits is the ability to perform fast, scalable data transfers using gRPC for remote procedure calls. In this section, we will explore how to use Apache Arrow Flight with PyArrow with an example of connecting to &lt;strong&gt;Dremio&lt;/strong&gt;, a popular data platform that supports Arrow Flight for query execution.&lt;/p&gt;
&lt;h3&gt;Connecting to Dremio Using PyArrow Flight&lt;/h3&gt;
&lt;p&gt;Below is an example of how to connect to Dremio using PyArrow Flight, execute a query, and retrieve the results.&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-python&quot;&gt;from pyarrow import flight
from pyarrow.flight import FlightClient
import os

# Step 1: Set the location of the Arrow Flight server
location = &amp;quot;grpc+tls://data.dremio.cloud:443&amp;quot;

# Step 2: Obtain the authentication token (from environment variables in this case)
token = os.getenv(&amp;quot;token&amp;quot;)

# Step 3: Define the headers for the Flight requests
# Here, we pass the bearer token for authentication
headers = [
    (b&amp;quot;authorization&amp;quot;, f&amp;quot;bearer {token}&amp;quot;.encode(&amp;quot;utf-8&amp;quot;))
]

# Step 4: Write the SQL query you want to execute
query = &amp;quot;SELECT * FROM table1&amp;quot;

# Step 5: Create a FlightClient instance to connect to the server
client = FlightClient(location=location)

# Step 6: Set up FlightCallOptions to include the authorization headers
options = flight.FlightCallOptions(headers=headers)

# Step 7: Request information about the query&apos;s execution
flight_info = client.get_flight_info(flight.FlightDescriptor.for_command(query), options)

# Step 8: Fetch the results of the query
results = client.do_get(flight_info.endpoints[0].ticket, options)

# Step 9: Read and print the results from the server
print(results.read_all())
&lt;/code&gt;&lt;/pre&gt;
&lt;h3&gt;Explanation of Each Step&lt;/h3&gt;
&lt;ol&gt;
&lt;li&gt;&lt;strong&gt;Set the Flight Server Location:&lt;/strong&gt;&lt;/li&gt;
&lt;/ol&gt;
&lt;pre&gt;&lt;code class=&quot;language-python&quot;&gt;location = &amp;quot;grpc+tls://data.dremio.cloud:443&amp;quot;
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;The location variable holds the address of the Dremio server that supports Apache Arrow Flight. Here, we use gRPC over TLS for a secure connection to Dremio Cloud.&lt;/p&gt;
&lt;ol start=&quot;2&quot;&gt;
&lt;li&gt;&lt;strong&gt;Authentication with Bearer Token:&lt;/strong&gt;&lt;/li&gt;
&lt;/ol&gt;
&lt;pre&gt;&lt;code class=&quot;language-python&quot;&gt;token = os.getenv(&amp;quot;token&amp;quot;)
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;The token is retrieved from an environment variable using &lt;code&gt;os.getenv()&lt;/code&gt;. This token is required for authenticating requests to Dremio’s Arrow Flight server.&lt;/p&gt;
&lt;ol start=&quot;3&quot;&gt;
&lt;li&gt;&lt;strong&gt;Setting Request Headers:&lt;/strong&gt;&lt;/li&gt;
&lt;/ol&gt;
&lt;pre&gt;&lt;code class=&quot;language-python&quot;&gt;headers = [
    (b&amp;quot;authorization&amp;quot;, f&amp;quot;bearer {token}&amp;quot;.encode(&amp;quot;utf-8&amp;quot;))
]
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;The headers include an authorization field with the bearer token, which is required for Dremio to authenticate the request. We use the &lt;code&gt;FlightCallOptions&lt;/code&gt; to attach this header to our request later.&lt;/p&gt;
&lt;ol start=&quot;4&quot;&gt;
&lt;li&gt;&lt;strong&gt;SQL Query:&lt;/strong&gt;&lt;/li&gt;
&lt;/ol&gt;
&lt;pre&gt;&lt;code class=&quot;language-python&quot;&gt;query = &amp;quot;SELECT * FROM table1&amp;quot;
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;This is the SQL query we will execute on Dremio. You can replace &amp;quot;table1&amp;quot; with any table or a more complex SQL query as needed.&lt;/p&gt;
&lt;ol start=&quot;5&quot;&gt;
&lt;li&gt;&lt;strong&gt;Creating the FlightClient:&lt;/strong&gt;&lt;/li&gt;
&lt;/ol&gt;
&lt;pre&gt;&lt;code class=&quot;language-python&quot;&gt;client = FlightClient(location=location)
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;The &lt;code&gt;FlightClient&lt;/code&gt; is the main object used to interact with the Arrow Flight server. It is initialized with the location of the server, allowing us to send requests and receive results.&lt;/p&gt;
&lt;ol start=&quot;6&quot;&gt;
&lt;li&gt;&lt;strong&gt;Setting Flight Call Options:&lt;/strong&gt;&lt;/li&gt;
&lt;/ol&gt;
&lt;pre&gt;&lt;code class=&quot;language-python&quot;&gt;options = flight.FlightCallOptions(headers=headers)
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Here, FlightCallOptions is used to attach the headers (including our authentication token) to the requests made by the FlightClient.&lt;/p&gt;
&lt;ol start=&quot;7&quot;&gt;
&lt;li&gt;&lt;strong&gt;Fetching Flight Information:&lt;/strong&gt;&lt;/li&gt;
&lt;/ol&gt;
&lt;pre&gt;&lt;code class=&quot;language-python&quot;&gt;flight_info = client.get_flight_info(flight.FlightDescriptor.for_command(query), options)
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;The &lt;code&gt;get_flight_info()&lt;/code&gt; function sends the query to Dremio and returns information about the query’s execution, such as where the results are located. The &lt;code&gt;FlightDescriptor.for_command()&lt;/code&gt; method is used to wrap the SQL query into a format understood by the Flight server.&lt;/p&gt;
&lt;ol start=&quot;8&quot;&gt;
&lt;li&gt;&lt;strong&gt;Retrieving the Query Results:&lt;/strong&gt;&lt;/li&gt;
&lt;/ol&gt;
&lt;pre&gt;&lt;code class=&quot;language-python&quot;&gt;results = client.do_get(flight_info.endpoints[0].ticket, options)
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;The &lt;code&gt;do_get()&lt;/code&gt; function fetches the results of the query from the server. It takes in a ticket, which points to the data location, and the options to pass authentication headers.&lt;/p&gt;
&lt;ol start=&quot;9&quot;&gt;
&lt;li&gt;&lt;strong&gt;Reading and Printing Results:&lt;/strong&gt;&lt;/li&gt;
&lt;/ol&gt;
&lt;pre&gt;&lt;code class=&quot;language-python&quot;&gt;print(results.read_all())
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Finally, the &lt;code&gt;read_all()&lt;/code&gt; function is called to read all of the results into memory, and &lt;code&gt;print()&lt;/code&gt; displays the data.&lt;/p&gt;
&lt;h3&gt;Benefits of Using Apache Arrow Flight&lt;/h3&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;High Performance:&lt;/strong&gt; Arrow Flight is optimized for fast, high-volume data transfers, making it ideal for large datasets.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;gRPC Communication:&lt;/strong&gt; The use of gRPC allows for more efficient, low-latency communication between systems.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Cross-Language Support:&lt;/strong&gt; Arrow Flight works across multiple programming languages, providing flexibility in how data is accessed and processed.&lt;/li&gt;
&lt;/ul&gt;
&lt;h3&gt;Summary&lt;/h3&gt;
&lt;p&gt;Apache Arrow Flight with PyArrow offers an efficient and powerful way to transport data between systems, especially in high-performance environments. Using the example above, you can easily connect to Dremio, execute queries, and retrieve data in a highly optimized fashion. The combination of Arrow&apos;s in-memory data structures and Flight&apos;s fast data transport capabilities makes it an excellent tool for scalable, real-time data analytics.&lt;/p&gt;
&lt;h2&gt;Conclusion&lt;/h2&gt;
&lt;p&gt;In this blog, we explored the powerful capabilities of PyArrow for data analytics and efficient data handling. We began by setting up a practice environment using a &lt;strong&gt;Python Data Science Notebook Docker Image&lt;/strong&gt;, which provides a comprehensive suite of pre-installed libraries for data manipulation and analysis.&lt;/p&gt;
&lt;p&gt;We discussed the core benefits of &lt;strong&gt;PyArrow&lt;/strong&gt; over traditional libraries like Pandas, focusing on its performance advantages, particularly for large datasets. PyArrow&apos;s columnar memory layout and efficient in-memory processing make it a go-to tool for high-performance analytics.&lt;/p&gt;
&lt;p&gt;Throughout the blog, we covered key PyArrow objects like &lt;code&gt;Table&lt;/code&gt;, &lt;code&gt;RecordBatch&lt;/code&gt;, &lt;code&gt;Array&lt;/code&gt;, &lt;code&gt;Schema&lt;/code&gt;, and &lt;code&gt;ChunkedArray&lt;/code&gt;, explaining how they work together to enable efficient data processing. We also demonstrated how to read and write &lt;strong&gt;Parquet&lt;/strong&gt;, &lt;strong&gt;JSON&lt;/strong&gt;, &lt;strong&gt;CSV&lt;/strong&gt;, and &lt;strong&gt;Feather&lt;/strong&gt; files, showcasing PyArrow&apos;s versatility across various file formats commonly used in data science.&lt;/p&gt;
&lt;p&gt;Additionally, we delved into essential data operations like filtering, joining, and aggregating data using PyArrow. These operations allow users to handle large datasets efficiently while performing complex analyses with minimal memory usage.&lt;/p&gt;
&lt;p&gt;Lastly, we introduced &lt;strong&gt;Apache Arrow Flight&lt;/strong&gt; as a high-performance transport layer for data transfer. We provided a detailed example of how to connect to &lt;strong&gt;Dremio&lt;/strong&gt;, execute SQL queries, and retrieve results using Arrow Flight, highlighting its benefits for scalable, real-time data access.&lt;/p&gt;
&lt;p&gt;With these tools and techniques, you are equipped to perform efficient data analytics using PyArrow, whether you&apos;re working with local files or connecting to powerful cloud-based platforms like Dremio. By leveraging PyArrow&apos;s capabilities, you can handle big data tasks with speed and precision, making it an indispensable tool for modern&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;a href=&quot;https://www.dremio.com/blog/apache-iceberg-crash-course-what-is-a-data-lakehouse-and-a-table-format/?utm_source=ev_external_blog&amp;amp;utm_medium=influencer&amp;amp;utm_campaign=intropyarrow&amp;amp;utm_content=alexmerced&amp;amp;utm_term=external_blog&quot;&gt;Apache Iceberg Crash Course: What is a Data Lakehouse and a Table Format?&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&quot;https://hello.dremio.com/wp-apache-iceberg-the-definitive-guide-reg.html?utm_source=alexmerced&amp;amp;utm_medium=external_blog&amp;amp;utm_campaign=intropyarrow&quot;&gt;Free Copy of Apache Iceberg the Definitive Guide&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&quot;https://hello.dremio.com/webcast-an-apache-iceberg-lakehouse-crash-course-reg.html?utm_source=alexmerced&amp;amp;utm_medium=external_blog&amp;amp;utm_campaign=intropyarrow&quot;&gt;Free Apache Iceberg Crash Course&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&quot;https://www.youtube.com/watch?v=SIriNcVIGJQ&amp;amp;list=PLsLAVBjQJO0p0Yq1fLkoHvt2lEJj5pcYe&quot;&gt;Iceberg Lakehouse Engineering Video Playlist&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;
</content:encoded><author>Alex Merced</author></item><item><title>What is Three-Tier Data (Bronze, Silver, Gold) and How Dremio Simplifies It</title><link>https://iceberglakehouse.com/posts/2024-10-bronze-silver-gold-data/</link><guid isPermaLink="true">https://iceberglakehouse.com/posts/2024-10-bronze-silver-gold-data/</guid><description>
- [Apache Iceberg 101](https://www.dremio.com/lakehouse-deep-dives/apache-iceberg-101/?utm_source=ev_external_blog&amp;utm_medium=influencer&amp;utm_campaign...</description><pubDate>Wed, 09 Oct 2024 09:00:00 GMT</pubDate><content:encoded>&lt;ul&gt;
&lt;li&gt;&lt;a href=&quot;https://www.dremio.com/lakehouse-deep-dives/apache-iceberg-101/?utm_source=ev_external_blog&amp;amp;utm_medium=influencer&amp;amp;utm_campaign=threelayers&amp;amp;utm_content=alexmerced&amp;amp;utm_term=external_blog&quot;&gt;Apache Iceberg 101&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&quot;https://www.dremio.com/blog/intro-to-dremio-nessie-and-apache-iceberg-on-your-laptop/?utm_source=ev_external_blog&amp;amp;utm_medium=influencer&amp;amp;utm_campaign=threelayers&amp;amp;utm_content=alexmerced&amp;amp;utm_term=external_blog&quot;&gt;Hands-on Intro with Apache iceberg&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&quot;https://hello.dremio.com/webcast-an-apache-iceberg-lakehouse-crash-course-reg.html?utm_source=ev_external_blog&amp;amp;utm_medium=influencer&amp;amp;utm_campaign=threelayers&amp;amp;utm_content=alexmerced&amp;amp;utm_term=external_blog&quot;&gt;Free Apache Iceberg Crash Course&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&quot;https://hello.dremio.com/wp-apache-iceberg-the-definitive-guide-reg.html?utm_source=ev_external_blog&amp;amp;utm_medium=influencer&amp;amp;utm_campaign=threelayers&amp;amp;utm_content=alexmerced&amp;amp;utm_term=external_blog&quot;&gt;Free Copy Of Apache Iceberg: The Definitive Guide&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;Organizing and curating data efficiently is key to delivering actionable insights. One of the most time-tested patterns for structuring data is the three-tier data organization pattern. This approach has been around for years, with each layer representing a different level of processing, from raw ingestion to fully prepared data ready for business use. While the names for these layers have changed over time, the concept remains foundational to managing data flows in complex environments.&lt;/p&gt;
&lt;p&gt;In this blog, we’ll explore the evolution of the three-tier data organization pattern and how it has been referred to by different names like raw/business/application, bronze/silver/gold, and raw/clean/semantic. We will then dive into how this pattern is used to move data from one layer to the next. Lastly, we&apos;ll discuss how tools like Dremio, along with advanced features such as Incremental and Live Reflections, simplify managing these layers without needing excessive data copies, particularly when working with Apache Iceberg tables.&lt;/p&gt;
&lt;h2&gt;The Evolution of the Three-Tier Data Organization Pattern&lt;/h2&gt;
&lt;h3&gt;Historical Terminologies&lt;/h3&gt;
&lt;p&gt;Over the years, the three-tier data organization pattern has been referenced using different naming conventions. Each naming scheme reflects the progression of data through its lifecycle—from unprocessed to refined and actionable. Here are some common terminologies used:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Raw / Business / Application&lt;/strong&gt;: One of the earliest naming conventions, where the focus is on raw data, business logic, and application-specific outputs.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Bronze / Silver / Gold&lt;/strong&gt;: A more modern take, especially in the context of data lakes and lakehouses. Bronze refers to raw data, Silver to cleaned or enriched data, and Gold to the most refined and consumable version of data.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Raw / Clean / Semantic&lt;/strong&gt;: A naming convention used often in data governance discussions. It emphasizes the transformation from raw ingested data to clean, validated data, and finally to a semantic layer where business logic and definitions are applied.&lt;/li&gt;
&lt;/ul&gt;
&lt;h3&gt;A Universal Pattern&lt;/h3&gt;
&lt;p&gt;Despite the variation in names, the underlying concept remains the same: data is moved through different stages, each one adding more processing and value to the data. This structured movement helps ensure that organizations can have data at varying stages of readiness, depending on the use case. The pattern facilitates everything from raw data exploration to high-performance reporting.&lt;/p&gt;
&lt;p&gt;In the next sections, we’ll explore how this pattern is applied to move data between layers, and how modern tools like Dremio can make managing this process easier and more efficient.&lt;/p&gt;
&lt;h2&gt;2. The Role of Each Layer in the Pattern&lt;/h2&gt;
&lt;p&gt;Each layer in the three-tier data organization pattern serves a distinct purpose in processing data, making it easier to manage and consume over time. Let&apos;s break down the role of each layer.&lt;/p&gt;
&lt;h3&gt;Raw Layer (Bronze, Raw)&lt;/h3&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Definition:&lt;/strong&gt; The raw layer is where data lands directly after ingestion from the source, without any transformation. This data might come from transactional databases, sensors, logs, or third-party APIs, often in formats like JSON, CSV, or raw Parquet files.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Use Case:&lt;/strong&gt; The raw layer is vital for preserving the integrity of the original data. It’s useful for tracing back to the source, performing audits, and enabling detailed exploration of the untransformed data. However, it typically requires significant transformation before it can be useful for analytical purposes.&lt;/li&gt;
&lt;/ul&gt;
&lt;h3&gt;Business Layer (Silver, Clean)&lt;/h3&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Definition:&lt;/strong&gt; In the business layer, data is cleaned, transformed, and partially processed. Here, duplicate records may be removed, data is normalized, and business rules are applied, but it still retains much of the underlying data structure.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Use Case:&lt;/strong&gt; This layer is useful for intermediate analysis and exploration, as the data is clean but not yet fully aggregated or processed. It allows data teams to explore trends and patterns before fully curating the data for business consumption. Often, this layer involves key transformations like joining data across multiple sources, removing irrelevant data, and applying first steps toward data modeling.&lt;/li&gt;
&lt;/ul&gt;
&lt;h3&gt;Application Layer (Gold, Semantic)&lt;/h3&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Definition:&lt;/strong&gt; The application layer contains fully refined data, curated and optimized for consumption by business applications and reporting tools. At this point, business logic is fully applied, aggregations are completed, and the data is optimized for fast access.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Use Case:&lt;/strong&gt; The application layer is ideal for final reporting, business intelligence (BI) tools, and machine learning models. It&apos;s where the highest level of transformation occurs, and where performance is critical for real-time queries and analytics.&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;By organizing data in this tiered structure, organizations ensure that they can move data smoothly from raw to ready-for-business use, making each layer available for different types of analysis depending on the needs of the business or application.&lt;/p&gt;
&lt;h2&gt;3. Traditional Challenges with Data Movement Between Layers&lt;/h2&gt;
&lt;p&gt;While the three-tier data pattern is foundational in modern data systems, it comes with challenges, particularly around moving data from one layer to the next.&lt;/p&gt;
&lt;h3&gt;Data Duplication&lt;/h3&gt;
&lt;p&gt;In traditional data systems, each layer typically involves creating separate copies of data. For example, data must be copied from the raw layer to the business layer, and again to the application layer. These copies consume storage resources and often lead to increased operational complexity in managing different versions of the same data.&lt;/p&gt;
&lt;h3&gt;Latency and Sync Issues&lt;/h3&gt;
&lt;p&gt;As data moves between layers, transformation jobs are often scheduled as batch processes, leading to delays between the availability of new data in each layer. This latency can cause inconsistencies between layers, particularly when the data in one layer is updated while the data in another is outdated.&lt;/p&gt;
&lt;h3&gt;Storage Overhead&lt;/h3&gt;
&lt;p&gt;Maintaining multiple copies of data across different layers results in significant storage overhead. For large-scale data systems, this can quickly become a burden, not only in terms of storage costs but also in terms of maintaining a clear lineage and understanding of the data.&lt;/p&gt;
&lt;p&gt;In the next section, we’ll discuss how Dremio addresses these challenges by allowing organizations to streamline data movement through virtual views and reflections, reducing the need for excessive data duplication.&lt;/p&gt;
&lt;h2&gt;4. How Dremio Streamlines Three-Tier Data Curation&lt;/h2&gt;
&lt;p&gt;Dremio provides a modern approach to managing the three-tier data organization pattern, reducing many of the challenges traditionally associated with moving data between layers. By leveraging Dremio&apos;s features such as virtual views and reflections, organizations can streamline the process, minimize data duplication, and improve query performance without needing to manage multiple physical copies of data.&lt;/p&gt;
&lt;h3&gt;Virtual Views: Logical Representation Without Duplication&lt;/h3&gt;
&lt;p&gt;One of Dremio’s most powerful features is the ability to create &lt;strong&gt;virtual views&lt;/strong&gt;, which allow you to logically represent the data at different stages (raw, business, and application) without having to duplicate or physically move it. These virtual views are essentially SQL queries that define how the data should appear at each stage, offering the following benefits:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;No Physical Copies Required:&lt;/strong&gt; Unlike traditional approaches that involve moving data into new tables for each layer, Dremio&apos;s virtual views allow you to create layers on top of the same physical data without creating extra copies.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Instant Access to Different Layers:&lt;/strong&gt; Data teams can quickly define how data should look at each layer by writing simple SQL queries, reducing the time and effort needed to curate and maintain different versions of the data.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Centralized Data Governance:&lt;/strong&gt; By managing the different layers through virtual views, organizations can maintain better control over how data is transformed and accessed across the business, ensuring consistent governance without managing separate physical datasets.&lt;/li&gt;
&lt;/ul&gt;
&lt;h3&gt;Reflections: Efficiently Materializing Data When Needed&lt;/h3&gt;
&lt;p&gt;While virtual views provide logical representations of each layer, Dremio&apos;s &lt;strong&gt;reflections&lt;/strong&gt; allow you to physically materialize data when necessary for performance optimization. Reflections are essentially pre-computed Iceberg-based data representations that Dremio can use to accelerate query performance across different layers. The key advantages include:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;On-Demand Materialization:&lt;/strong&gt; Reflections let you materialize data at the final, most refined layer (application or gold) without requiring you to manage multiple physical copies at the intermediate stages.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Optimized Query Performance:&lt;/strong&gt; Dremio’s reflections enable fast query responses by allowing frequently accessed or complex views to be pre-computed and cached. This minimizes the need for repetitive transformations and significantly reduces query times across the entire data pipeline.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Transparent to End Users:&lt;/strong&gt; Reflections are completely transparent to end users; they automatically use the most efficient reflection to answer a query without the user needing to know about the underlying optimizations.&lt;/p&gt;
&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;With these features, Dremio makes it much easier to manage the three-tier data pattern, offering flexibility in how data is represented and materialized while reducing the need for costly and complex data movement between layers. This is particularly valuable in modern data architectures, where the volume and velocity of data continue to grow.&lt;/p&gt;
&lt;p&gt;In the next section, we’ll explore how Dremio’s &lt;strong&gt;Incremental Reflections&lt;/strong&gt; and &lt;strong&gt;Live Reflections&lt;/strong&gt; enhance this process even further, particularly when using Apache Iceberg tables as the underlying data format.&lt;/p&gt;
&lt;h2&gt;5. The Impact of Dremio’s Incremental and Live Reflections&lt;/h2&gt;
&lt;p&gt;Dremio takes data acceleration a step further with &lt;strong&gt;Incremental Reflections&lt;/strong&gt; and &lt;strong&gt;Live Reflections&lt;/strong&gt;, especially when working with &lt;strong&gt;Apache Iceberg tables&lt;/strong&gt;. These features significantly enhance the efficiency of the three-tier data organization pattern by optimizing how reflections are updated and refreshed, ensuring data consistency without the need for full table reprocessing.&lt;/p&gt;
&lt;h3&gt;Incremental Reflections: Optimizing Data Refreshes&lt;/h3&gt;
&lt;p&gt;&lt;strong&gt;Incremental Reflections&lt;/strong&gt; allow Dremio to refresh only the parts of a reflection that have changed, rather than reprocessing the entire dataset. This is particularly valuable in large-scale environments where data is constantly being ingested and updated. Incremental Reflections provide several key benefits:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Faster Updates:&lt;/strong&gt; Since only the modified data is refreshed, the time and resources needed to update a reflection are significantly reduced. This ensures that data at each tier can be kept up to date without the latency or overhead associated with full data reprocessing.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Reduced Costs:&lt;/strong&gt; Incremental updates minimize the compute resources required to keep data synchronized between layers, reducing operational costs while maintaining high performance.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Seamless Integration with Iceberg:&lt;/strong&gt; When using Apache Iceberg tables, which support efficient partitioning and metadata management, Incremental Reflections leverage the ability to track data changes, making the refresh process even more streamlined and scalable.&lt;/li&gt;
&lt;/ul&gt;
&lt;h3&gt;Live Reflections: Always Fresh Data&lt;/h3&gt;
&lt;p&gt;&lt;strong&gt;Live Reflections&lt;/strong&gt; take the concept of data freshness even further by automatically updating whenever underlying data changes. This means that whenever the raw Iceberg tables are updated, the reflections built on top of them are automatically kept in sync without manual intervention. The advantages include:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Automatic Updates:&lt;/strong&gt; With Live Reflections, you don’t need to schedule jobs or manually trigger reflection updates. Changes to the data are automatically propagated, ensuring that all layers are always up-to-date.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Consistency Across Layers:&lt;/strong&gt; As new data is ingested into the raw layer, the business and application layers are immediately refreshed, ensuring consistency across the entire data pipeline.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Ideal for Real-Time Analytics:&lt;/strong&gt; Live Reflections are particularly useful for real-time or near-real-time analytics use cases, where having the most up-to-date data is critical. This enables decision-makers to rely on the most current data without delays caused by batch processing.&lt;/li&gt;
&lt;/ul&gt;
&lt;h3&gt;Use Case: Incremental and Live Reflections with Apache Iceberg&lt;/h3&gt;
&lt;p&gt;When combined with Apache Iceberg, Dremio&apos;s Incremental and Live Reflections offer a powerful solution for managing data across the three-tier pattern. If the underlying data sources are Apache Iceberg tables then reflections across your layers can be refreshed incrementally and triggered when data changes vs full refreshes and scheduled refreshes for non-Iceberg sources (databases, data warehouses, non-iceberg data on your data lake):&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Maintain Fresh Data:&lt;/strong&gt; As Iceberg’s metadata structure tracks data changes, Dremio’s Live Reflections ensure that updates are immediately reflected.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Scale Efficiently:&lt;/strong&gt; Iceberg’s optimized partitioning enables Incremental Reflections to process only the changed partitions, reducing compute and storage costs.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Deliver Fast Queries:&lt;/strong&gt; With both Incremental and Live Reflections, users can execute high-performance queries on curated data without worrying about outdated data or long refresh times.&lt;/p&gt;
&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;In summary, Dremio’s Incremental and Live Reflections bring significant improvements to the three-tier data organization pattern by ensuring data remains fresh and synchronized with minimal overhead.&lt;/p&gt;
&lt;p&gt;By leveraging these powerful features, Dremio not only simplifies the process of managing the three-tier data pattern but also ensures that organizations can do so with optimal efficiency and minimal cost.&lt;/p&gt;
&lt;h2&gt;6. Real-World Benefits of Using Dremio with the Three-Tier Data Organization Pattern&lt;/h2&gt;
&lt;p&gt;Now that we’ve explored how Dremio’s virtual views, reflections, and advanced features like Incremental and Live Reflections enhance the three-tier data organization pattern, let’s dive into the real-world benefits this approach delivers to data teams and organizations.&lt;/p&gt;
&lt;h3&gt;1. Minimized Data Duplication&lt;/h3&gt;
&lt;p&gt;Traditional data architectures often rely on creating multiple physical copies of data at each tier, which leads to increased storage costs, operational complexity, and data governance challenges. With Dremio’s virtual views and reflections, you can represent data at different stages without needing to physically copy it. By reducing data duplication, organizations can save significantly on storage costs and maintain a more streamlined data architecture.&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Storage Savings:&lt;/strong&gt; By eliminating redundant copies, organizations reduce their data storage footprint, which is particularly important as datasets grow.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Simplified Data Governance:&lt;/strong&gt; Managing fewer copies means a clearer lineage of data transformations and a reduced risk of governance issues, ensuring better control over data access and compliance.&lt;/p&gt;
&lt;/li&gt;
&lt;/ul&gt;
&lt;h3&gt;2. Faster Time to Insights&lt;/h3&gt;
&lt;p&gt;One of the core objectives of the three-tier data organization pattern is to move data through different stages of readiness, from raw to fully processed, as efficiently as possible. Dremio’s reflections dramatically speed up query times by precomputing views of the data, allowing users to access insights faster, particularly at the business and application layers.&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Accelerated Analytics:&lt;/strong&gt; Reflections optimize query performance across different tiers, so analysts and decision-makers can run complex queries on highly processed data without waiting for lengthy transformations.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Reduced Latency:&lt;/strong&gt; With Incremental and Live Reflections, the time required to refresh and propagate data between layers is minimized, ensuring up-to-date data is always available for analysis.&lt;/p&gt;
&lt;/li&gt;
&lt;/ul&gt;
&lt;h3&gt;3. Real-Time Data with Minimal Overhead&lt;/h3&gt;
&lt;p&gt;The combination of Apache Iceberg’s efficient data partitioning and Dremio’s Live Reflections enables organizations to maintain real-time or near-real-time data freshness without the operational overhead typically associated with traditional batch processing. Live Reflections automatically update when new data arrives, ensuring that the entire pipeline—from raw to application-ready data—stays consistent and up-to-date.&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Real-Time Updates:&lt;/strong&gt; Live Reflections ensure that every layer in the data pipeline reflects the most current state of the underlying data, which is essential for real-time analytics and decision-making.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Lower Maintenance Costs:&lt;/strong&gt; By automating data synchronization across layers, organizations can reduce the operational burden on their data engineering teams, freeing up resources for more strategic work.&lt;/p&gt;
&lt;/li&gt;
&lt;/ul&gt;
&lt;h3&gt;4. Scalability and Flexibility with Apache Iceberg&lt;/h3&gt;
&lt;p&gt;Using Dremio alongside Apache Iceberg provides an ideal foundation for scaling the three-tier data architecture. Iceberg’s design allows for efficient handling of large datasets, versioning, and schema evolution, which are crucial for maintaining data consistency and performance as data volumes grow.&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Seamless Scaling:&lt;/strong&gt; Iceberg’s metadata and partitioning capabilities make it easy to scale data systems without sacrificing performance, while Dremio’s reflections ensure that queries remain fast and responsive as the dataset grows.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Flexible Data Management:&lt;/strong&gt; Iceberg’s ability to handle schema changes and partition evolution allows organizations to adapt their data models over time without reprocessing entire datasets, further optimizing costs and resources.&lt;/p&gt;
&lt;/li&gt;
&lt;/ul&gt;
&lt;h3&gt;5. Better Resource Utilization&lt;/h3&gt;
&lt;p&gt;With Dremio’s ability to streamline data movement between layers and optimize query performance, organizations can make better use of their computational resources. Instead of spending significant compute power on redundant data transformations or processing entire datasets for minor changes, Incremental Reflections ensure that only the necessary data is processed, reducing costs and improving efficiency.&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Optimized Compute Costs:&lt;/strong&gt; Incremental processing allows organizations to use their compute resources more efficiently, focusing on changes rather than processing entire datasets repeatedly.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Improved Query Performance:&lt;/strong&gt; By precomputing reflections, Dremio ensures that even complex queries across large datasets are executed with minimal compute overhead, freeing up resources for other critical tasks.&lt;/p&gt;
&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;These real-world benefits make Dremio an invaluable tool for implementing and optimizing the three-tier data organization pattern. By leveraging its advanced features like virtual views, reflections, and its seamless integration with Apache Iceberg, organizations can achieve faster, more cost-effective data management, while maintaining the flexibility to scale as their data needs grow.&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;a href=&quot;https://www.dremio.com/lakehouse-deep-dives/apache-iceberg-101/?utm_source=ev_external_blog&amp;amp;utm_medium=influencer&amp;amp;utm_campaign=threelayers&amp;amp;utm_content=alexmerced&amp;amp;utm_term=external_blog&quot;&gt;Apache Iceberg 101&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&quot;https://www.dremio.com/blog/intro-to-dremio-nessie-and-apache-iceberg-on-your-laptop/?utm_source=ev_external_blog&amp;amp;utm_medium=influencer&amp;amp;utm_campaign=threelayers&amp;amp;utm_content=alexmerced&amp;amp;utm_term=external_blog&quot;&gt;Hands-on Intro with Apache iceberg&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&quot;https://hello.dremio.com/webcast-an-apache-iceberg-lakehouse-crash-course-reg.html?utm_source=ev_external_blog&amp;amp;utm_medium=influencer&amp;amp;utm_campaign=threelayers&amp;amp;utm_content=alexmerced&amp;amp;utm_term=external_blog&quot;&gt;Free Apache Iceberg Crash Course&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&quot;https://hello.dremio.com/wp-apache-iceberg-the-definitive-guide-reg.html?utm_source=ev_external_blog&amp;amp;utm_medium=influencer&amp;amp;utm_campaign=threelayers&amp;amp;utm_content=alexmerced&amp;amp;utm_term=external_blog&quot;&gt;Free Copy Of Apache Iceberg: The Definitive Guide&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;
</content:encoded><author>Alex Merced</author></item><item><title>A Brief Guide to the Governance of Apache Iceberg Tables</title><link>https://iceberglakehouse.com/posts/2024-10-governing-apache-iceberg-tables/</link><guid isPermaLink="true">https://iceberglakehouse.com/posts/2024-10-governing-apache-iceberg-tables/</guid><description>
- [Apache Iceberg Crash Course: What is a Data Lakehouse and a Table Format?](https://www.dremio.com/blog/apache-iceberg-crash-course-what-is-a-data-...</description><pubDate>Mon, 07 Oct 2024 09:00:00 GMT</pubDate><content:encoded>&lt;ul&gt;
&lt;li&gt;&lt;a href=&quot;https://www.dremio.com/blog/apache-iceberg-crash-course-what-is-a-data-lakehouse-and-a-table-format/?utm_source=ev_external_blog&amp;amp;utm_medium=influencer&amp;amp;utm_campaign=iceberggov&amp;amp;utm_content=alexmerced&amp;amp;utm_term=external_blog&quot;&gt;Apache Iceberg Crash Course: What is a Data Lakehouse and a Table Format?&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&quot;https://hello.dremio.com/wp-apache-iceberg-the-definitive-guide-reg.html?utm_source=alexmerced&amp;amp;utm_medium=external_blog&amp;amp;utm_campaign=iceberggov&quot;&gt;Free Copy of Apache Iceberg the Definitive Guide&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&quot;https://hello.dremio.com/webcast-an-apache-iceberg-lakehouse-crash-course-reg.html?utm_source=alexmerced&amp;amp;utm_medium=external_blog&amp;amp;utm_campaign=iceberggov&quot;&gt;Free Apache Iceberg Crash Course&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&quot;https://www.youtube.com/watch?v=SIriNcVIGJQ&amp;amp;list=PLsLAVBjQJO0p0Yq1fLkoHvt2lEJj5pcYe&quot;&gt;Iceberg Lakehouse Engineering Video Playlist&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;Apache Iceberg is a powerful table format designed for data lakes, offering many features that simplify the management and evolution of large datasets. However, one area that Apache Iceberg leaves outside its scope is &lt;strong&gt;table governance&lt;/strong&gt;—specifically, managing access control and security for Iceberg tables. This means that Iceberg&apos;s metadata specification doesn&apos;t inherently govern who can view or modify tables. As a result, the responsibility of securing and governing access to Iceberg tables must be handled at different levels within your lakehouse architecture. Let’s explore these levels and their roles in controlling access.&lt;/p&gt;
&lt;h3&gt;File Level Governance&lt;/h3&gt;
&lt;p&gt;At the most granular level, governance can be applied directly to the metadata files that describe the Iceberg table and the data files themselves. These files are typically stored in cloud storage (e.g., Amazon S3, Azure Blob Storage, or Google Cloud Storage). By configuring &lt;strong&gt;bucket-level access rules&lt;/strong&gt; in your storage layer, you can control which users or services are allowed to interact with the underlying data. This approach, while effective, has its limitations. It’s challenging to enforce fine-grained access rules like row-level or column-level security at the file level, and it requires manual management of permissions across many objects.&lt;/p&gt;
&lt;h3&gt;Engine Level Governance&lt;/h3&gt;
&lt;p&gt;Another layer where governance can be applied is at the &lt;strong&gt;engine level&lt;/strong&gt;. Many query engines that interact with Iceberg tables, such as Dremio, Snowflake or Apache Spark, allow administrators to define access control rules for users on the platform. These engines can enforce permissions on who can execute certain queries or interact with specific datasets. However, this model has one significant limitation: the access rules only apply when queries are run &lt;strong&gt;through the engine&lt;/strong&gt;. If someone accesses the data with a different engine, the engine-level permissions don&apos;t apply. This highlights the need for an additional layer of governance.&lt;/p&gt;
&lt;h3&gt;Catalog Level Governance&lt;/h3&gt;
&lt;p&gt;The most effective way to secure Apache Iceberg tables is at the &lt;strong&gt;catalog level&lt;/strong&gt;. With the recent adoption of the Iceberg REST Catalog specification  Catalogs can provide &lt;strong&gt;storage credentials through credential vending&lt;/strong&gt;, ensuring that users can access the data metadata files. So catalogs are now introducing features to specify access rules based on catalog credentials, This centralized model offers a significant advantage: governance is applied &lt;strong&gt;once&lt;/strong&gt;, and any engine or tool accessing the catalog adheres to the same access rules. This eliminates the need to configure permissions separately for each engine, simplifying governance and reducing the risk of misconfigurations.&lt;/p&gt;
&lt;p&gt;With catalog-level governance, organizations can control access to Iceberg tables based on catalog credentials, making it the most comprehensive and secure approach to table governance.&lt;/p&gt;
&lt;h2&gt;Catalog-Level Governance with Nessie and Apache Polaris&lt;/h2&gt;
&lt;p&gt;Now that we’ve established the importance of catalog-level governance for Iceberg tables, let’s explore how two prominent catalogs, &lt;strong&gt;Nessie&lt;/strong&gt; and &lt;strong&gt;Apache Polaris (Incubating)&lt;/strong&gt;, enable governance through catalog-level access controls.&lt;/p&gt;
&lt;h3&gt;Nessie: Catalog-Level Governance for Iceberg&lt;/h3&gt;
&lt;p&gt;&lt;strong&gt;Project Nessie&lt;/strong&gt; is an open-source catalog that supports Apache Iceberg and enables Git-like branching and versioning for your datasets. Nessie introduces an additional layer of governance by allowing you to control access to Iceberg tables based on &lt;strong&gt;branches&lt;/strong&gt; and &lt;strong&gt;references&lt;/strong&gt;. Here&apos;s how Nessie implements access control:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Reference-Based Access Control&lt;/strong&gt;: Nessie uses references such as branches and tags to control access to specific versions of your data. For instance, users may be granted read or write access to a specific branch (e.g., &lt;code&gt;prod&lt;/code&gt;), while other branches may remain restricted.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Path-Based Access Control&lt;/strong&gt;: In Nessie, Iceberg tables are identified by a path (e.g., &lt;code&gt;namespace1/table1&lt;/code&gt;), and permissions can be applied to paths. You can grant users or roles access to specific tables or namespaces, enabling fine-grained access controls.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Commit-Level Governance&lt;/strong&gt;: Nessie tracks changes through commits, and you can control who can commit changes to a branch. This ensures that only authorized users can make modifications to the Iceberg tables on a given branch.&lt;/p&gt;
&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;With Nessie, governance isn’t just about who can read or write a table—it extends to controlling access to specific versions of your datasets, allowing for better control over data modifications and auditing capabilities.&lt;/p&gt;
&lt;h3&gt;Apache Polaris (Incubating): Centralized Access Control with RBAC&lt;/h3&gt;
&lt;p&gt;&lt;strong&gt;Apache Polaris (Incubating)&lt;/strong&gt; takes catalog-level governance even further by implementing a robust &lt;strong&gt;Role-Based Access Control (RBAC)&lt;/strong&gt; model. Polaris provides a centralized platform to control access to Iceberg tables, namespaces, and even views. Here’s how Polaris manages access control:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Service Principals&lt;/strong&gt;: Polaris allows you to create &lt;strong&gt;service principals&lt;/strong&gt;, which are unique identities for users or services interacting with the catalog. These service principals are granted access to resources like tables and namespaces via &lt;strong&gt;principal roles&lt;/strong&gt;.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Principal Roles and Catalog Roles&lt;/strong&gt;: In Polaris, principal roles are used to group service principals logically. These roles can then be assigned &lt;strong&gt;catalog roles&lt;/strong&gt;, which define specific privileges (such as read, write, or manage) on tables, namespaces, or entire catalogs. This hierarchical approach simplifies access control by grouping permissions at a high level.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Credential Vending&lt;/strong&gt;: Polaris employs &lt;strong&gt;credential vending&lt;/strong&gt;, which securely provides short-lived credentials to query engines. This ensures that only authorized services can access Iceberg tables during query execution, enhancing security while maintaining performance.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Fine-Grained Privileges&lt;/strong&gt;: Polaris supports a wide range of table-level and namespace-level privileges, such as &lt;code&gt;TABLE_READ_DATA&lt;/code&gt;, &lt;code&gt;TABLE_WRITE_DATA&lt;/code&gt;, and &lt;code&gt;NAMESPACE_CREATE&lt;/code&gt;. This allows you to define exactly what users or services can do with each resource in the catalog.&lt;/p&gt;
&lt;/li&gt;
&lt;/ul&gt;
&lt;h3&gt;Comparing Nessie and Polaris for Governance&lt;/h3&gt;
&lt;p&gt;While both Nessie and Apache Polaris offer catalog-level governance for Iceberg tables, their approaches differ in important ways:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Branching and Versioning&lt;/strong&gt;: Nessie’s Git-like model emphasizes controlling access based on branches and commits, making it ideal for scenarios where versioned access is critical. Polaris, on the other hand, focuses on centralized RBAC, with precise control over roles and permissions across multiple services.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Integration with Query Engines&lt;/strong&gt;: Both catalogs integrate seamlessly with popular query engines like Apache Spark, Dremio, and Trino, providing a secure and governed layer for accessing Iceberg tables. Polaris&apos;s credential vending system adds an extra layer of security by issuing temporary credentials for query execution.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Governance Scope&lt;/strong&gt;: Nessie shines in scenarios where tracking changes and managing versions is key, while Polaris excels at providing a centralized governance model with fine-grained control over a large number of users and services.&lt;/p&gt;
&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;Apache Polaris offers a featured called &amp;quot;External Catalogs&amp;quot; so you can have a Nessie Catalog and Polaris Catalog and expose the Nessie Catalog for reading in that Polaris Catalog as external catalog. This allows you to take advantage of Nessie&apos;s git versioning and leverage Polaris&apos;s RBAC rules.&lt;/p&gt;
&lt;h3&gt;Conclusion&lt;/h3&gt;
&lt;p&gt;Governance of Apache Iceberg tables is crucial for ensuring that only authorized users can access and modify your datasets. While Apache Iceberg&apos;s specification does not include governance within its scope, table-level governance can be achieved by applying access controls at the &lt;strong&gt;file level&lt;/strong&gt;, &lt;strong&gt;engine level&lt;/strong&gt;, and most effectively, the &lt;strong&gt;catalog level&lt;/strong&gt;.&lt;/p&gt;
&lt;p&gt;Both &lt;strong&gt;Nessie&lt;/strong&gt; and &lt;strong&gt;Apache Polaris (Incubating)&lt;/strong&gt; provide powerful catalog-level governance for Iceberg tables, each with its unique features and strengths. Whether you need Git-like versioning and branching with Nessie or centralized role-based access control with Polaris, both options offer robust solutions for securing your Iceberg tables and ensuring proper governance across your data lakehouse.&lt;/p&gt;
</content:encoded><author>Alex Merced</author></item><item><title>Exploring Data Operations with PySpark, Pandas, DuckDB, Polars, and DataFusion in a Python Notebook</title><link>https://iceberglakehouse.com/posts/2024-10-exploring-data-operations-python/</link><guid isPermaLink="true">https://iceberglakehouse.com/posts/2024-10-exploring-data-operations-python/</guid><description>
- [Apache Iceberg Crash Course: What is a Data Lakehouse and a Table Format?](https://www.dremio.com/blog/apache-iceberg-crash-course-what-is-a-data-...</description><pubDate>Mon, 07 Oct 2024 09:00:00 GMT</pubDate><content:encoded>&lt;ul&gt;
&lt;li&gt;&lt;a href=&quot;https://www.dremio.com/blog/apache-iceberg-crash-course-what-is-a-data-lakehouse-and-a-table-format/?utm_source=ev_external_blog&amp;amp;utm_medium=influencer&amp;amp;utm_campaign=pythondata&amp;amp;utm_content=alexmerced&amp;amp;utm_term=external_blog&quot;&gt;Apache Iceberg Crash Course: What is a Data Lakehouse and a Table Format?&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&quot;https://hello.dremio.com/wp-apache-iceberg-the-definitive-guide-reg.html?utm_source=alexmerced&amp;amp;utm_medium=external_blog&amp;amp;utm_campaign=pythondata&quot;&gt;Free Copy of Apache Iceberg the Definitive Guide&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&quot;https://hello.dremio.com/webcast-an-apache-iceberg-lakehouse-crash-course-reg.html?utm_source=alexmerced&amp;amp;utm_medium=external_blog&amp;amp;utm_campaign=pythondata&quot;&gt;Free Apache Iceberg Crash Course&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&quot;https://www.youtube.com/watch?v=SIriNcVIGJQ&amp;amp;list=PLsLAVBjQJO0p0Yq1fLkoHvt2lEJj5pcYe&quot;&gt;Iceberg Lakehouse Engineering Video Playlist&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;Data engineers and scientists often work with a variety of tools to handle different types of data operations—from large-scale distributed processing to in-memory data manipulation. The &lt;code&gt;alexmerced/spark35nb&lt;/code&gt; Docker image simplifies this by offering a pre-configured environment where you can experiment with multiple popular data tools, including PySpark, Pandas, DuckDB, Polars, and DataFusion.&lt;/p&gt;
&lt;p&gt;In this blog, we&apos;ll guide you through setting up this environment and demonstrate how to perform basic data operations such as writing data, loading data, and executing queries and aggregations using these tools. Whether you’re dealing with large datasets or just need to manipulate small, in-memory data, you&apos;ll see how these different libraries can complement each other.&lt;/p&gt;
&lt;h2&gt;Section 1: Setting Up Your Environment&lt;/h2&gt;
&lt;h3&gt;1.1 Pull the Docker Image&lt;/h3&gt;
&lt;p&gt;To get started, you&apos;ll first need to pull the &lt;code&gt;alexmerced/spark35nb&lt;/code&gt; Docker image from Docker Hub. This image comes with a pre-configured environment that includes Spark 3.5.2, JupyterLab, and many popular data manipulation libraries like Pandas, DuckDB, and Polars.&lt;/p&gt;
&lt;p&gt;Run the following command to pull the image:&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-bash&quot;&gt;docker pull alexmerced/spark35nb
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Next, run the container using the following command:&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-bash&quot;&gt;docker run -p 8888:8888 -p 4040:4040 -p 7077:7077 -p 8080:8080 -p 18080:18080 -p 6066:6066 -p 7078:7078 -p 8081:8081  alexmerced/spark35nb
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Once the container is up and running, open your browser and navigate to localhost:8888 to access JupyterLab, where you will perform all your data operations.&lt;/p&gt;
&lt;p&gt;Now that you have your environment set up, we can move on to performing some basic data operations using PySpark, Pandas, DuckDB, Polars, and DataFusion.&lt;/p&gt;
&lt;h2&gt;Section 2: Working with PySpark&lt;/h2&gt;
&lt;h3&gt;2.1 What is PySpark?&lt;/h3&gt;
&lt;p&gt;PySpark is the Python API for Apache Spark, an open-source engine designed for large-scale data processing and distributed computing. It allows you to work with big data by distributing data and computations across a cluster. While Spark is usually run in a distributed cluster, this setup allows you to run it locally on a single node—perfect for development and testing.&lt;/p&gt;
&lt;p&gt;Using PySpark, you can perform data manipulation, SQL queries, machine learning, and more, all within a framework that handles big data efficiently. In this section, we&apos;ll walk through how to write and query data using PySpark in the JupyterLab environment.&lt;/p&gt;
&lt;h4&gt;2.2 Writing Data with PySpark&lt;/h4&gt;
&lt;p&gt;Let’s start by creating a simple dataset in PySpark. First, initialize a Spark session, which is necessary to interact with Spark&apos;s functionality. We will create a small DataFrame with sample data and display it.&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-python&quot;&gt;from pyspark.sql import SparkSession

# Initialize the Spark session
spark = SparkSession.builder.appName(&amp;quot;PySpark Example&amp;quot;).getOrCreate()

# Sample data: a list of tuples containing names and ages
data = [(&amp;quot;Alice&amp;quot;, 34), (&amp;quot;Bob&amp;quot;, 45), (&amp;quot;Catherine&amp;quot;, 29)]

# Create a DataFrame
df = spark.createDataFrame(data, [&amp;quot;Name&amp;quot;, &amp;quot;Age&amp;quot;])

# Show the DataFrame
df.show()
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;In this example, we created a DataFrame with three rows of data, representing people&apos;s names and ages. The df.show() function allows us to display the contents of the DataFrame, making it easy to inspect the data we just created.&lt;/p&gt;
&lt;h3&gt;2.3 Loading and Querying Data with PySpark&lt;/h3&gt;
&lt;p&gt;Next, let’s load a dataset from a file and run some basic queries. PySpark can handle various file formats, including CSV, JSON, and Parquet.&lt;/p&gt;
&lt;p&gt;For this example, let’s assume we have a CSV file with more data about people, which we’ll load into a DataFrame. Then we’ll demonstrate a simple filter query and aggregation to count the number of people in each age group.&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-python&quot;&gt;# Load a CSV file into a DataFrame
df_csv = spark.read.csv(&amp;quot;data/people.csv&amp;quot;, header=True, inferSchema=True)

# Show the first few rows of the DataFrame
df_csv.show()

# Filter the data to only include people older than 30
df_filtered = df_csv.filter(df_csv[&amp;quot;Age&amp;quot;] &amp;gt; 30)

# Show the filtered DataFrame
df_filtered.show()

# Group by Age and count the number of people in each age group
df_grouped = df_csv.groupBy(&amp;quot;Age&amp;quot;).count()

# Show the result of the grouping
df_grouped.show()
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;In this example, we loaded a CSV file into a PySpark DataFrame using &lt;code&gt;spark.read.csv()&lt;/code&gt;. Then, we applied two different operations:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Filtering&lt;/strong&gt;: We filtered the DataFrame to show only rows where the age is greater than 30.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Aggregation&lt;/strong&gt;: We grouped the data by age and counted how many people are in each age group.&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;With PySpark, you can perform more complex queries and aggregations on large datasets, making it a tool for big data processing.&lt;/p&gt;
&lt;p&gt;In the next section, we&apos;ll explore Pandas, which is great for smaller, in-memory data operations that don&apos;t require distributed processing.&lt;/p&gt;
&lt;h2&gt;Section 3: Data Manipulation with Pandas&lt;/h2&gt;
&lt;h3&gt;3.1 What is Pandas?&lt;/h3&gt;
&lt;p&gt;Pandas is one of the most widely used Python libraries for data manipulation and analysis. It provides easy-to-use data structures, like DataFrames, which allow you to work with tabular data in an intuitive way. Unlike PySpark, which is designed for large-scale distributed data processing, Pandas works in-memory, making it ideal for small to medium-sized datasets.&lt;/p&gt;
&lt;p&gt;With Pandas, you can read and write data from various formats, including CSV, Excel, and JSON, and perform common data operations like filtering, aggregating, and merging data with simple and readable syntax.&lt;/p&gt;
&lt;h3&gt;3.2 Loading Data with Pandas&lt;/h3&gt;
&lt;p&gt;Let’s start by loading a dataset into a Pandas DataFrame. We’ll read a CSV file, which is a common file format for data storage, and display the first few rows.&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-python&quot;&gt;import pandas as pd

# Load a CSV file into a Pandas DataFrame
df_pandas = pd.read_csv(&amp;quot;data/people.csv&amp;quot;)

# Display the first few rows of the DataFrame
print(df_pandas.head())
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;In this example, we read the CSV file people.csv using pd.read_csv() and loaded it into a Pandas DataFrame. The head() method lets you view the first few rows of the DataFrame, which is useful for quickly inspecting the data.&lt;/p&gt;
&lt;h3&gt;3.3 Basic Operations with Pandas&lt;/h3&gt;
&lt;p&gt;Now that we have loaded the data, let’s perform some basic operations, such as filtering rows and grouping data. Pandas allows you to apply these operations easily with simple Python syntax.&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-python&quot;&gt;# Filter the data to show only people older than 30
df_filtered = df_pandas[df_pandas[&amp;quot;Age&amp;quot;] &amp;gt; 30]

# Display the filtered data
print(df_filtered)

# Group the data by &apos;Age&apos; and count the number of people in each age group
df_grouped = df_pandas.groupby(&amp;quot;Age&amp;quot;).count()

# Display the grouped data
print(df_grouped)
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Here, we filtered the data to include only people older than 30 using a simple boolean expression. Then, we used the groupby() function to group the DataFrame by age and count the number of people in each age group.&lt;/p&gt;
&lt;p&gt;Pandas is incredibly efficient for in-memory data operations, making it a go-to tool for smaller datasets that can fit in your machine&apos;s memory. In the next section, we’ll explore DuckDB, a SQL-based tool that enables fast querying over in-memory data.&lt;/p&gt;
&lt;h2&gt;Section 4: Exploring DuckDB&lt;/h2&gt;
&lt;h3&gt;4.1 What is DuckDB?&lt;/h3&gt;
&lt;p&gt;DuckDB is an in-memory SQL database management system (DBMS) designed for analytical workloads. It offers high-performance, efficient querying of datasets directly within your Python environment. DuckDB is particularly well-suited for performing complex SQL queries on structured data, like CSVs or Parquet files, without needing to set up a separate database server.&lt;/p&gt;
&lt;p&gt;DuckDB is lightweight, yet powerful, and can be used as an alternative to tools like SQLite, especially when working with analytical queries on large datasets.&lt;/p&gt;
&lt;h3&gt;4.2 Writing Data into DuckDB&lt;/h3&gt;
&lt;p&gt;DuckDB can easily integrate with Pandas, allowing you to transfer data from a Pandas DataFrame into DuckDB for SQL-based queries. Here’s how to create a table in DuckDB using the data from Pandas.&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-python&quot;&gt;import duckdb

# Connect to an in-memory DuckDB instance
conn = duckdb.connect()

# Create a table in DuckDB from the Pandas DataFrame
conn.execute(&amp;quot;CREATE TABLE people AS SELECT * FROM df_pandas&amp;quot;)

# Show the content of the &apos;people&apos; table
conn.execute(&amp;quot;SELECT * FROM people&amp;quot;).df()
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;In this example, we connected to DuckDB and created a new table people from the Pandas DataFrame df_pandas. DuckDB’s &lt;code&gt;execute()&lt;/code&gt; function allows you to run SQL commands, making it easy to interact with data using SQL queries.&lt;/p&gt;
&lt;h3&gt;4.3 Querying Data in DuckDB&lt;/h3&gt;
&lt;p&gt;Once your data is loaded into DuckDB, you can run SQL queries to filter, aggregate, and analyze your data. DuckDB supports a wide range of SQL functionality, making it ideal for users who prefer SQL over Python for data manipulation.&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-python&quot;&gt;# Query to select people older than 30
result = conn.execute(&amp;quot;SELECT Name, Age FROM people WHERE Age &amp;gt; 30&amp;quot;).df()

# Display the result of the query
print(result)

# Query to group people by age and count the number of people in each age group
result_grouped = conn.execute(&amp;quot;SELECT Age, COUNT(*) as count FROM people GROUP BY Age&amp;quot;).df()

# Display the grouped result
print(result_grouped)
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;In this example, we used SQL to filter the people table, selecting only those who are older than 30. We then ran a grouping query to count the number of people in each age group.&lt;/p&gt;
&lt;p&gt;DuckDB is an excellent choice when you need SQL-like functionality directly in your Python environment. It allows you to leverage the power of SQL without the overhead of setting up and managing a database server. In the next section, we will explore Polars, a DataFrame library known for its speed and efficiency.&lt;/p&gt;
&lt;h2&gt;Section 5: Leveraging Polars for Fast DataFrame Operations&lt;/h2&gt;
&lt;h3&gt;5.1 What is Polars?&lt;/h3&gt;
&lt;p&gt;Polars is a DataFrame library designed for high-performance data manipulation. It’s known for its speed and efficiency, particularly when compared to libraries like Pandas. Polars is written in Rust and uses an optimized query engine to handle large datasets quickly and with minimal memory usage. It also provides a similar interface to Pandas, making it easy to learn and integrate into existing Python workflows.&lt;/p&gt;
&lt;p&gt;Polars is particularly well-suited for processing large datasets that might not fit into memory as easily or for scenarios where performance is a critical factor.&lt;/p&gt;
&lt;h3&gt;5.2 Working with Polars&lt;/h3&gt;
&lt;p&gt;Let’s start by creating a Polars DataFrame from a Python dictionary. We’ll then perform some basic operations like filtering and aggregating data.&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-python&quot;&gt;import polars as pl

# Create a Polars DataFrame
df_polars = pl.DataFrame({
    &amp;quot;Name&amp;quot;: [&amp;quot;Alice&amp;quot;, &amp;quot;Bob&amp;quot;, &amp;quot;Catherine&amp;quot;],
    &amp;quot;Age&amp;quot;: [34, 45, 29]
})

# Display the Polars DataFrame
print(df_polars)
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;In this example, we created a Polars DataFrame using a Python dictionary. The syntax is similar to Pandas, but the operations are optimized for speed. Polars offers lazy evaluation, which means it can optimize the execution of multiple operations at once, reducing computation time.&lt;/p&gt;
&lt;h3&gt;5.3 Filtering and Aggregating with Polars&lt;/h3&gt;
&lt;p&gt;Now, let’s perform some common data operations such as filtering and aggregating the data. These operations are highly optimized in Polars and can be done using a simple and expressive syntax.&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-python&quot;&gt;# Filter the DataFrame to show only people older than 30
df_filtered = df_polars.filter(pl.col(&amp;quot;Age&amp;quot;) &amp;gt; 30)

# Display the filtered DataFrame
print(df_filtered)

# Group by &apos;Age&apos; and count the number of people in each age group
df_grouped = df_polars.groupby(&amp;quot;Age&amp;quot;).count()

# Display the grouped result
print(df_grouped)
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;In this example, we filtered the data to show only rows where the age is greater than 30, and then we grouped the data by age to count how many people are in each group. These operations are highly efficient in Polars due to its optimized memory management and query execution engine.&lt;/p&gt;
&lt;p&gt;Polars is ideal when you need the speed of a DataFrame library for both small and large datasets, and when performance is a key requirement. Next, we will explore DataFusion, a tool for SQL-based querying over Apache Arrow data.&lt;/p&gt;
&lt;h2&gt;Section 6: DataFusion for Query Execution&lt;/h2&gt;
&lt;h3&gt;6.1 What is DataFusion?&lt;/h3&gt;
&lt;p&gt;DataFusion is an in-memory query execution engine built on top of Apache Arrow, an efficient columnar memory format for analytics. It provides a powerful SQL engine that allows users to run complex queries over structured data stored in Arrow format. DataFusion is part of the Apache Arrow ecosystem, which aims to provide fast data interoperability across different data processing tools.&lt;/p&gt;
&lt;p&gt;DataFusion is particularly well-suited for scenarios where you need to query large in-memory datasets using SQL without the overhead of traditional databases. Its integration with Arrow ensures that the data processing is both fast and memory-efficient.&lt;/p&gt;
&lt;h3&gt;6.2 Writing and Querying Data with DataFusion&lt;/h3&gt;
&lt;p&gt;DataFusion allows you to execute SQL queries on in-memory data using Apache Arrow. Let’s first create a DataFrame using DataFusion and then perform a few SQL queries on it.&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-python&quot;&gt;from datafusion import SessionContext

# Initialize a DataFusion session
ctx = SessionContext()

# Create a DataFrame with some data
data = [
    {&amp;quot;Name&amp;quot;: &amp;quot;Alice&amp;quot;, &amp;quot;Age&amp;quot;: 34},
    {&amp;quot;Name&amp;quot;: &amp;quot;Bob&amp;quot;, &amp;quot;Age&amp;quot;: 45},
    {&amp;quot;Name&amp;quot;: &amp;quot;Catherine&amp;quot;, &amp;quot;Age&amp;quot;: 29}
]

# Register the DataFrame as a table
df = ctx.create_dataframe(data)
ctx.register_table(&amp;quot;people&amp;quot;, df)

# Query the data to select people older than 30
result = ctx.sql(&amp;quot;SELECT Name, Age FROM people WHERE Age &amp;gt; 30&amp;quot;).collect()

# Display the result
print(result)
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;In this example, we used DataFusion’s SessionContext to create a DataFrame and registered it as a table. We then performed a simple SQL query to filter the data for people older than 30. DataFusion allows you to combine the power of SQL with the speed and efficiency of Apache Arrow’s in-memory format.&lt;/p&gt;
&lt;h3&gt;6.3 Aggregating Data with DataFusion&lt;/h3&gt;
&lt;p&gt;Just like in DuckDB, we can perform aggregation queries to group data by a specific field and count the number of records in each group. Let’s see how this works in DataFusion.&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-python&quot;&gt;# Group by &apos;Age&apos; and count the number of people in each age group
result_grouped = ctx.sql(&amp;quot;SELECT Age, COUNT(*) as count FROM people GROUP BY Age&amp;quot;).collect()

# Display the grouped result
print(result_grouped)
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;In this query, we grouped the data by the &apos;Age&apos; column and counted how many people were in each age group. DataFusion’s SQL execution engine ensures that queries run efficiently, even on large datasets stored in-memory.&lt;/p&gt;
&lt;p&gt;DataFusion is a great tool for users who need fast, SQL-based querying of large in-memory datasets and want to take advantage of Apache Arrow’s high-performance columnar data format. It’s particularly useful for building analytical pipelines that involve heavy querying of structured data.&lt;/p&gt;
&lt;h2&gt;Bonus Section: Integrating Dremio with Python&lt;/h2&gt;
&lt;h3&gt;What is Dremio?&lt;/h3&gt;
&lt;p&gt;Dremio is a powerful data lakehouse platform that helps organizations unify and query their data from various sources. It enables users to easily govern, join, and accelerate queries on their data without the need for expensive and complex data warehouse infrastructures. Dremio&apos;s ability to access and query data directly from formats like Apache Iceberg, Delta Lake, S3, RDBMS, and JSON files, along with its performance enhancements, reduces the workload on traditional data warehouses.&lt;/p&gt;
&lt;p&gt;Dremio is built on top of Apache Arrow, a high-performance columnar in-memory format, and utilizes Arrow Flight to accelerate the transmission of large datasets over the network. This integration provides blazing-fast query performance while enabling interoperability between various analytics tools.&lt;/p&gt;
&lt;p&gt;In this section, we will demonstrate how to set up Dremio in a Docker container and use Python to query Dremio&apos;s data sources using the &lt;code&gt;dremio-simple-query&lt;/code&gt; library.&lt;/p&gt;
&lt;h3&gt;6.1 Setting Up Dremio with Docker&lt;/h3&gt;
&lt;p&gt;To run Dremio on your local machine, use the following Docker command:&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-bash&quot;&gt;docker run -p 9047:9047 -p 31010:31010 -p 45678:45678 -p 32010:32010 -e DREMIO_JAVA_SERVER_EXTRA_OPTS=-Dpaths.dist=file:///opt/dremio/data/dist --name try-dremio dremio/dremio-oss
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Once Dremio is up and running, navigate to http://localhost:9047 in your browser to access the Dremio UI. Here, you can configure your data sources, create virtual datasets, and explore the platform&apos;s capabilities.&lt;/p&gt;
&lt;h3&gt;6.2 Querying Dremio with Python using dremio-simple-query&lt;/h3&gt;
&lt;p&gt;The dremio-simple-query library allows you to query Dremio using Apache Arrow Flight, providing a high-performance interface for fetching and analyzing data from Dremio sources. With this library, you can easily convert Dremio queries into Pandas, Polars, or DuckDB DataFrames, or work directly with Apache Arrow data.&lt;/p&gt;
&lt;p&gt;Here’s how to get started:&lt;/p&gt;
&lt;h4&gt;Step 1: Install the necessary libraries&lt;/h4&gt;
&lt;p&gt;Make sure you have the dremio-simple-query library installed (It is pre-installed on the alexmerced/spark35nb image). You can install it using pip:&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-bash&quot;&gt;pip install dremio-simple-query
&lt;/code&gt;&lt;/pre&gt;
&lt;h4&gt;Step 2: Set up your connection to Dremio&lt;/h4&gt;
&lt;p&gt;You’ll need your Dremio credentials to retrieve a token and establish a connection. Here’s a basic example:&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-python&quot;&gt;from dremio_simple_query.connect import get_token, DremioConnection
from os import getenv
from dotenv import load_dotenv

# Load environment variables (TOKEN and ARROW_ENDPOINT)
load_dotenv()

# Login to Dremio and get a token
login_endpoint = &amp;quot;http://{host}:9047/apiv2/login&amp;quot;
payload = {
    &amp;quot;userName&amp;quot;: &amp;quot;your_username&amp;quot;,
    &amp;quot;password&amp;quot;: &amp;quot;your_password&amp;quot;
}
token = get_token(uri=login_endpoint, payload=payload)

# Dremio Arrow Flight endpoint, make sure to put in the right host for your Dremio instance

arrow_endpoint = &amp;quot;grpc://{host}:32010&amp;quot;

# Establish connection to Dremio using Arrow Flight
dremio = DremioConnection(token, arrow_endpoint)
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;If you are running this locally using the docker run command, the host should be the IP address of the Dremio container on the docker network which you can find by running &lt;code&gt;docker inspect&lt;/code&gt;.&lt;/p&gt;
&lt;p&gt;In this code, we use the &lt;code&gt;get_token&lt;/code&gt; function to retrieve an authentication token from Dremio&apos;s REST API and establish a connection to Dremio&apos;s Arrow Flight endpoint.&lt;/p&gt;
&lt;h4&gt;Step 3: Query Dremio and retrieve data in various formats&lt;/h4&gt;
&lt;p&gt;Once connected, you can use the connection to query Dremio and retrieve results in different formats, including Arrow, Pandas, Polars, and DuckDB. Here’s how:&lt;/p&gt;
&lt;h5&gt;Querying Data and Returning as Arrow Table:&lt;/h5&gt;
&lt;pre&gt;&lt;code class=&quot;language-python&quot;&gt;# Query Dremio and return data as an Apache Arrow Table
stream = dremio.toArrow(&amp;quot;SELECT * FROM my_table;&amp;quot;)
arrow_table = stream.read_all()

# Display Arrow Table
print(arrow_table)
&lt;/code&gt;&lt;/pre&gt;
&lt;h5&gt;Converting to a Pandas DataFrame:&lt;/h5&gt;
&lt;pre&gt;&lt;code class=&quot;language-python&quot;&gt;# Query Dremio and return data as a Pandas DataFrame
df = dremio.toPandas(&amp;quot;SELECT * FROM my_table;&amp;quot;)
print(df)
&lt;/code&gt;&lt;/pre&gt;
&lt;h5&gt;Converting to a Polars DataFrame:&lt;/h5&gt;
&lt;pre&gt;&lt;code class=&quot;language-python&quot;&gt;# Query Dremio and return data as a Polars DataFrame
df_polars = dremio.toPolars(&amp;quot;SELECT * FROM my_table;&amp;quot;)
print(df_polars)
&lt;/code&gt;&lt;/pre&gt;
&lt;h5&gt;Querying with DuckDB:&lt;/h5&gt;
&lt;pre&gt;&lt;code class=&quot;language-python&quot;&gt;# Query Dremio and return as a DuckDB relation
duck_rel = dremio.toDuckDB(&amp;quot;SELECT * FROM my_table&amp;quot;)

# Perform a query on the DuckDB relation
result = duck_rel.query(&amp;quot;my_table&amp;quot;, &amp;quot;SELECT * FROM my_table WHERE Age &amp;gt; 30&amp;quot;).fetchall()

# Display results
print(result)
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;With the &lt;code&gt;dremio-simple-query&lt;/code&gt; library, you can efficiently query large datasets from Dremio and immediately start analyzing them with various tools like Pandas, Polars, and DuckDB, all while leveraging the high-performance Apache Arrow format under the hood.&lt;/p&gt;
&lt;h3&gt;6.3 Why Use Dremio?&lt;/h3&gt;
&lt;p&gt;Dremio provides several benefits that make it a powerful addition to your data stack:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Governance:&lt;/strong&gt; Centralize governance over all your data sources, ensuring compliance and control.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Data Federation:&lt;/strong&gt; Join data across various sources, such as Iceberg, Delta Lake, JSON, CSV, and relational databases, without moving the data.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Performance:&lt;/strong&gt; Accelerate your queries with the help of Dremio&apos;s query acceleration features and Apache Arrow Flight.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Cost Savings:&lt;/strong&gt; By offloading workloads from traditional data warehouses, Dremio can reduce infrastructure costs.&lt;/p&gt;
&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;Dremio&apos;s close relationship with Apache Arrow ensures that your queries are both fast and efficient, allowing you to seamlessly integrate various data sources and tools into your analytics workflows.&lt;/p&gt;
&lt;h3&gt;Conclusion&lt;/h3&gt;
&lt;p&gt;In this blog, we explored how to use a variety of powerful tools for data operations within a Python notebook environment. Starting with the &lt;code&gt;alexmerced/spark35nb&lt;/code&gt; Docker image, we demonstrated how to set up a development environment that includes PySpark, Pandas, DuckDB, Polars, and DataFusion—each optimized for different data processing needs. We showcased basic operations like writing, querying, and aggregating data using each tool’s unique strengths.&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;PySpark&lt;/strong&gt; enables scalable, distributed processing for large datasets, perfect for big data environments.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Pandas&lt;/strong&gt; offers in-memory, easy-to-use data manipulation for smaller datasets, making it the go-to tool for quick data exploration.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;DuckDB&lt;/strong&gt; provides an efficient, in-memory SQL engine, ideal for analytical queries without the need for complex infrastructure.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Polars&lt;/strong&gt; brings lightning-fast DataFrame operations, combining performance and simplicity for larger or performance-critical datasets.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;DataFusion&lt;/strong&gt;, with its foundation in Apache Arrow, allows for high-performance SQL querying, particularly for analytical workloads in memory.&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;Finally, we introduced &lt;strong&gt;Dremio&lt;/strong&gt;, which integrates with Apache Arrow to enable lightning-fast queries across a range of data sources. With the &lt;code&gt;dremio-simple-query&lt;/code&gt; library, Dremio allows analysts to quickly fetch and analyze data using tools like Pandas, Polars, and DuckDB, ensuring that data is available when and where it&apos;s needed without the overhead of traditional data warehouses.&lt;/p&gt;
&lt;p&gt;Whether you&apos;re working with small datasets or handling massive amounts of data in distributed environments, this setup provides a versatile, efficient, and scalable platform for any data engineering or data science project. By leveraging these tools together, you can cover the full spectrum of data processing, from exploration to large-scale analytics, with minimal setup and maximum performance.&lt;/p&gt;
</content:encoded><author>Alex Merced</author></item><item><title>Ultimate Directory of Apache Iceberg Resources</title><link>https://iceberglakehouse.com/posts/2024-10-ultimate-directory-of-Apache-Iceberg-Resources/</link><guid isPermaLink="true">https://iceberglakehouse.com/posts/2024-10-ultimate-directory-of-Apache-Iceberg-Resources/</guid><description>
This article is a comprehensive directory of Apache Iceberg resources, including educational materials, tutorials, and hands-on exercises. Whether yo...</description><pubDate>Sat, 05 Oct 2024 09:00:00 GMT</pubDate><content:encoded>&lt;p&gt;This article is a comprehensive directory of Apache Iceberg resources, including educational materials, tutorials, and hands-on exercises. Whether you&apos;re a beginner or an experienced data engineer, this guide will help you navigate the world of Apache Iceberg and its applications.&lt;/p&gt;
&lt;h2&gt;Apache Iceberg?&lt;/h2&gt;
&lt;h4&gt;What is Apache Iceberg?&lt;/h4&gt;
&lt;p&gt;Apache Iceberg is open-source data lakehouse table format. That means it is a standard for how metadata defining a group of files as a table is stored. This metadata enables the files to be read and written to in the same way as a table in a data warehouses by any tool that supports the standard with the same features and ACID guarantees.&lt;/p&gt;
&lt;h4&gt;Why Does it Matter?&lt;/h4&gt;
&lt;ul&gt;
&lt;li&gt;
&lt;p&gt;By operating off tables in a seperate storage layer, you can use all your favorite analytical tools on a single copy of your data.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;Reducing the number of copies needed can reduce your compute costs, storage costs and network costs of your overall data platform.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;By storing your data in a standard format, it reduces future migration costs when changing tooling or adopting new tools.&lt;/p&gt;
&lt;/li&gt;
&lt;/ul&gt;
&lt;h4&gt;Who does Apache Iceberg benefit?&lt;/h4&gt;
&lt;ul&gt;
&lt;li&gt;
&lt;p&gt;Data Engineers since it means less data movement so less data pipelines to manage.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;Data Analysts since it means they can have more immediate access to data since it requires fewer data movements to make available especially when paired with data virtualization available in tools like &lt;a href=&quot;https://www.dremio.com/blog/dremio-enables-data-unification-and-decentralization/&quot;&gt;Dremio which allows for Lakehouse Querying and Federated Querying (Virtualization) on one platform&lt;/a&gt;.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;Data Scientists cause they can also have more immediate data access when training their AI/ML models.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;Data Leaders since they can reduce their overall platform costs making it easier to fund other data initiatives.&lt;/p&gt;
&lt;/li&gt;
&lt;/ul&gt;
&lt;h2&gt;Apache Iceberg Directory&lt;/h2&gt;
&lt;ul&gt;
&lt;li&gt;&lt;a href=&quot;https://iceberg.apache.org/&quot;&gt;Apache Iceberg Documentation&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;h3&gt;Apache Iceberg Education&lt;/h3&gt;
&lt;p&gt;Here is a list of resources to help you learn Apache Iceberg:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;a href=&quot;https://www.dremio.com/blog/apache-iceberg-crash-course-what-is-a-data-lakehouse-and-a-table-format/?utm_source=ev_external_blog&amp;amp;utm_medium=influencer&amp;amp;utm_campaign=ultimate_directory_of_apache_iceberg_resources&amp;amp;utm_content=alexmerced&amp;amp;utm_term=external_blog&quot;&gt;Apache Iceberg Crash Course: What is a Data Lakehouse and a Table Format?&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&quot;https://hello.dremio.com/wp-apache-iceberg-the-definitive-guide-reg.html?utm_source=alexmerced&amp;amp;utm_medium=external_blog&amp;amp;utm_campaign=ultimate_directory_of_apache_iceberg_resources&quot;&gt;Free Copy of Apache Iceberg the Definitive Guide&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&quot;https://hello.dremio.com/webcast-an-apache-iceberg-lakehouse-crash-course-reg.html?utm_source=alexmerced&amp;amp;utm_medium=external_blog&amp;amp;utm_campaign=ultimate_directory_of_apache_iceberg_resources&quot;&gt;Free Apache Iceberg Crash Course&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&quot;https://www.youtube.com/watch?v=SIriNcVIGJQ&amp;amp;list=PLsLAVBjQJO0p0Yq1fLkoHvt2lEJj5pcYe&quot;&gt;Iceberg Lakehouse Engineering Video Playlist&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;h3&gt;Apache Iceberg Hands-on Tutorials&lt;/h3&gt;
&lt;p&gt;Here is a list of hands-on tutorials that will help you get started with Apache Iceberg:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;a href=&quot;https://www.dremio.com/blog/intro-to-dremio-nessie-and-apache-iceberg-on-your-laptop/?utm_source=ev_external_blog&amp;amp;utm_medium=influencer&amp;amp;utm_campaign=ultimate_directory_of_apache_iceberg_resources&amp;amp;utm_content=alexmerced&amp;amp;utm_term=external_blog&quot;&gt;Hands-on Intro with Apache iceberg&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&quot;https://www.dremio.com/blog/intro-to-dremio-nessie-and-apache-iceberg-on-your-laptop/?utm_source=ev_external_blog&amp;amp;utm_medium=influencer&amp;amp;utm_campaign=ultimate_directory_of_apache_iceberg_resources&amp;amp;utm_content=alexmerced&amp;amp;utm_term=external_blog&quot;&gt;Intro to Apache Iceberg, Nessie and Dremio on your Laptop&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&quot;https://www.dremio.com/blog/from-json-csv-and-parquet-to-dashboards-with-apache-iceberg-and-dremio/?utm_source=ev_external_blog&amp;amp;utm_medium=influencer&amp;amp;utm_campaign=ultimate_directory_of_apache_iceberg_resources&amp;amp;utm_content=alexmerced&amp;amp;utm_term=external_blog&quot;&gt;JSON/CSV/Parquet to Apache Iceberg to BI Dashboard&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&quot;https://www.dremio.com/blog/from-mongodb-to-dashboards-with-dremio-and-apache-iceberg/?utm_source=ev_external_blog&amp;amp;utm_medium=influencer&amp;amp;utm_campaign=ultimate_directory_of_apache_iceberg_resources&amp;amp;utm_content=alexmerced&amp;amp;utm_term=external_blog&quot;&gt;From MongoDB to Apache Iceberg to BI Dashboard&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&quot;https://www.dremio.com/blog/from-sqlserver-to-dashboards-with-dremio-and-apache-iceberg/?utm_source=ev_external_blog&amp;amp;utm_medium=influencer&amp;amp;utm_campaign=ultimate_directory_of_apache_iceberg_resources&amp;amp;utm_content=alexmerced&amp;amp;utm_term=external_blog&quot;&gt;From SQLServer to Apache Iceberg to BI Dashboard&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&quot;https://www.dremio.com/blog/from-postgres-to-dashboards-with-dremio-and-apache-iceberg/?utm_source=ev_external_blog&amp;amp;utm_medium=influencer&amp;amp;utm_campaign=ultimate_directory_of_apache_iceberg_resources&amp;amp;utm_content=alexmerced&amp;amp;utm_term=external_blog&quot;&gt;From Postgres to Apache Iceberg to BI Dashboard&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&quot;https://www.dremio.com/blog/experience-the-dremio-lakehouse-hands-on-with-dremio-nessie-iceberg-data-as-code-and-dbt/?utm_source=ev_external_blog&amp;amp;utm_medium=influencer&amp;amp;utm_campaign=ultimate_directory_of_apache_iceberg_resources&amp;amp;utm_content=alexmerced&amp;amp;utm_term=external_blog&quot;&gt;Mongo/Postgres to Apache Iceberg to BI Dashboard using Git for Data and DBT&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&quot;https://www.dremio.com/blog/from-elasticsearch-to-dashboards-with-dremio-and-apache-iceberg/?utm_source=ev_external_blog&amp;amp;utm_medium=influencer&amp;amp;utm_campaign=ultimate_directory_of_apache_iceberg_resources&amp;amp;utm_content=alexmerced&amp;amp;utm_term=external_blog&quot;&gt;Elasticsearch to Apache Iceberg to BI Dashboard&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&quot;https://www.dremio.com/blog/from-mysql-to-dashboards-with-dremio-and-apache-iceberg/?utm_source=ev_external_blog&amp;amp;utm_medium=influencer&amp;amp;utm_campaign=ultimate_directory_of_apache_iceberg_resources&amp;amp;utm_content=alexmerced&amp;amp;utm_term=external_blog&quot;&gt;MySQL to Apache Iceberg to BI Dashboard&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&quot;https://www.dremio.com/blog/from-apache-druid-to-dashboards-with-dremio-and-apache-iceberg/?utm_source=ev_external_blog&amp;amp;utm_medium=influencer&amp;amp;utm_campaign=ultimate_directory_of_apache_iceberg_resources&amp;amp;utm_content=alexmerced&amp;amp;utm_term=external_blog&quot;&gt;Apache Druid to Apache Iceberg to BI Dashboard&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&quot;https://www.dremio.com/blog/bi-dashboards-with-apache-iceberg-using-aws-glue-and-apache-superset/?utm_source=ev_external_blog&amp;amp;utm_medium=influencer&amp;amp;utm_campaign=ultimate_directory_of_apache_iceberg_resources&amp;amp;utm_content=alexmerced&amp;amp;utm_term=external_blog&quot;&gt;BI Dashboards with Apache Iceberg Using AWS Glue and Apache Superset&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&quot;https://medium.com/data-engineering-with-dremio/end-to-end-basic-data-engineering-tutorial-spark-dremio-superset-c076a56eaa75&quot;&gt;End-to-End Basic Data Engineering Tutorial (Spark, Apache Iceberg Dremio, Superset)&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;h3&gt;Apache Iceberg&apos;s Architecture&lt;/h3&gt;
&lt;p&gt;Here is a list of resources to help you learn Apache Iceberg&apos;s architecture and internals:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;a href=&quot;https://www.dremio.com/blog/the-life-of-a-read-query-for-apache-iceberg-tables/?utm_source=ev_external_blog&amp;amp;utm_medium=influencer&amp;amp;utm_campaign=ultimate_directory_of_apache_iceberg_resources&amp;amp;utm_content=alexmerced&amp;amp;utm_term=external_blog&quot;&gt;The Life of a Read Query for Apache Iceberg Tables&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&quot;https://www.dremio.com/blog/the-life-of-a-write-query-for-apache-iceberg-tables/?utm_source=ev_external_blog&amp;amp;utm_medium=influencer&amp;amp;utm_campaign=ultimate_directory_of_apache_iceberg_resources&amp;amp;utm_content=alexmerced&amp;amp;utm_term=external_blog&quot;&gt;The Life of a Write Query for Apache Iceberg Tables&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&quot;https://datalakehousehub.com/blog/2024-8-apache-iceberg-metadata-json&quot;&gt;Understanding Apache Iceberg&apos;s Metadata.json&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&quot;https://datalakehousehub.com/blog/2024-8-understanding-apache-iceberg-manifest-list?utm_source=ev_external_blog&amp;amp;utm_medium=influencer&amp;amp;utm_campaign=ultimate_directory_of_apache_iceberg_resources&amp;amp;utm_content=alexmerced&amp;amp;utm_term=external_blog&quot;&gt;Understanding the Apache Iceberg Manifest List (Snapshot)&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&quot;https://datalakehousehub.com/blog/2024-8-understanding-apache-iceberg-manifest?utm_source=ev_external_blog&amp;amp;utm_medium=influencer&amp;amp;utm_campaign=ultimate_directory_of_apache_iceberg_resources&amp;amp;utm_content=alexmerced&amp;amp;utm_term=external_blog&quot;&gt;Understanding the Apache Iceberg Manifest&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&quot;https://datalakehousehub.com/blog/2024-8-understanding-apache-iceberg-delete-files?utm_source=ev_external_blog&amp;amp;utm_medium=influencer&amp;amp;utm_campaign=ultimate_directory_of_apache_iceberg_resources&amp;amp;utm_content=alexmerced&amp;amp;utm_term=external_blog&quot;&gt;Understanding Apache Iceberg Delete Files&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&quot;https://www.dremio.com/blog/puffins-and-icebergs-additional-stats-for-apache-iceberg-tables/?utm_source=ev_external_blog&amp;amp;utm_medium=influencer&amp;amp;utm_campaign=ultimate_directory_of_apache_iceberg_resources&amp;amp;utm_content=alexmerced&amp;amp;utm_term=external_blog&quot;&gt;Puffins and Icebergs: Additional Stats for Apache Iceberg Tables&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&quot;https://www.dremio.com/blog/how-apache-iceberg-is-built-for-open-optimized-performance/&quot;&gt;How Apache Iceberg is Built for Open Optimized Performance&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&quot;https://www.dremio.com/blog/ensuring-high-performance-at-any-scale-with-apache-icebergs-object-store-file-layout/?utm_source=ev_external_blog&amp;amp;utm_medium=influencer&amp;amp;utm_campaign=ultimate_directory_of_apache_iceberg_resources&amp;amp;utm_content=alexmerced&amp;amp;utm_term=external_blog&quot;&gt;Ensuring High Performance at Any Scale with Apache Iceberg’s Object Store File Layout&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&quot;https://www.dremio.com/blog/row-level-changes-on-the-lakehouse-copy-on-write-vs-merge-on-read-in-apache-iceberg/?utm_source=ev_external_blog&amp;amp;utm_medium=influencer&amp;amp;utm_campaign=ultimate_directory_of_apache_iceberg_resources&amp;amp;utm_content=alexmerced&amp;amp;utm_term=external_blog&quot;&gt;Row-Level Changes on the Lakehouse: Copy-On-Write vs. Merge-On-Read in Apache Iceberg&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&quot;https://medium.com/data-engineering-with-dremio/acid-guarantees-and-apache-iceberg-turning-any-storage-into-a-data-warehouse-e2b6cdf8bf45?source=---------3&quot;&gt;ACID Guarantees and Apache Iceberg: Turning Any Storage into a Data Warehouse&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&quot;https://medium.com/data-engineering-with-dremio/apache-iceberg-reliability-8ef491ff055f?source=---------8&quot;&gt;Apache Iceberg Reliability&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;h3&gt;Getting Data into Apache Iceberg&lt;/h3&gt;
&lt;p&gt;Here is a list of resources to help you get data into Apache Iceberg:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;a href=&quot;https://www.dremio.com/blog/8-tools-for-ingesting-data-into-apache-iceberg/?utm_source=ev_external_blog&amp;amp;utm_medium=influencer&amp;amp;utm_campaign=ultimate_directory_of_apache_iceberg_resources&amp;amp;utm_content=alexmerced&amp;amp;utm_term=external_blog&quot;&gt;8 Tools For Ingesting Data Into Apache Iceberg&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&quot;https://www.dremio.com/blog/introducing-auto-ingest-pipes-event-driven-ingestion-made-easy/?utm_source=ev_external_blog&amp;amp;utm_medium=influencer&amp;amp;utm_campaign=ultimate_directory_of_apache_iceberg_resources&amp;amp;utm_content=alexmerced&amp;amp;utm_term=external_blog&quot;&gt;Event Based Ingestion for Apache Iceberg Tables&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&quot;https://www.dremio.com/blog/ingesting-data-into-apache-iceberg-tables-with-dremio-a-unified-path-to-iceberg/?utm_source=ev_external_blog&amp;amp;utm_medium=influencer&amp;amp;utm_campaign=ultimate_directory_of_apache_iceberg_resources&amp;amp;utm_content=alexmerced&amp;amp;utm_term=external_blog&quot;&gt;Ingesting Data Into Apache Iceberg Tables with Dremio: A Unified Path to Iceberg&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&quot;https://www.dremio.com/blog/how-to-create-a-lakehouse-with-airbyte-s3-apache-iceberg-and-dremio/?utm_source=ev_external_blog&amp;amp;utm_medium=influencer&amp;amp;utm_campaign=ultimate_directory_of_apache_iceberg_resources&amp;amp;utm_content=alexmerced&amp;amp;utm_term=external_blog&quot;&gt;How to Create a Lakehouse with Airbyte, S3, Apache Iceberg, and Dremio&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&quot;https://www.dremio.com/blog/how-to-convert-json-files-into-an-apache-iceberg-table-with-dremio/?utm_source=ev_external_blog&amp;amp;utm_medium=influencer&amp;amp;utm_campaign=ultimate_directory_of_apache_iceberg_resources&amp;amp;utm_content=alexmerced&amp;amp;utm_term=external_blog&quot;&gt;How to Convert JSON Files Into an Apache Iceberg Tables with Dremio&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&quot;https://www.dremio.com/blog/how-to-convert-csv-files-into-an-apache-iceberg-table-with-dremio/?utm_source=ev_external_blog&amp;amp;utm_medium=influencer&amp;amp;utm_campaign=ultimate_directory_of_apache_iceberg_resources&amp;amp;utm_content=alexmerced&amp;amp;utm_term=external_blog&quot;&gt;How to Convert CSV Files into an Apache Iceberg table with Dremio&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&quot;https://www.dremio.com/blog/building-your-data-lakehouse-just-got-a-whole-lot-easier-with-dremio-fivetran/?utm_source=ev_external_blog&amp;amp;utm_medium=influencer&amp;amp;utm_campaign=ultimate_directory_of_apache_iceberg_resources&amp;amp;utm_content=alexmerced&amp;amp;utm_term=external_blog&quot;&gt;Ingesting Data into Apache Iceberg using Fivetran&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;h3&gt;Apache Iceberg Migration&lt;/h3&gt;
&lt;p&gt;Here is a list of resources to help you migrate your data to Apache Iceberg:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;a href=&quot;https://www.dremio.com/blog/migration-guide-for-apache-iceberg-lakehouses/?utm_source=ev_external_blog&amp;amp;utm_medium=influencer&amp;amp;utm_campaign=ultimate_directory_of_apache_iceberg_resources&amp;amp;utm_content=alexmerced&amp;amp;utm_term=external_blog&quot;&gt;Migration Guide for Apache Iceberg Lakehouses&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&quot;https://www.dremio.com/blog/apache-xtable-converting-between-apache-iceberg-delta-lake-and-apache-hudi/?utm_source=ev_external_blog&amp;amp;utm_medium=influencer&amp;amp;utm_campaign=ultimate_directory_of_apache_iceberg_resources&amp;amp;utm_content=alexmerced&amp;amp;utm_term=external_blog&quot;&gt;Apache XTable: Converting Between Apache Iceberg, Delta Lake, and Apache Hudi&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&quot;https://www.dremio.com/blog/3-ways-to-convert-a-delta-lake-table-into-an-apache-iceberg-table/?utm_source=ev_external_blog&amp;amp;utm_medium=influencer&amp;amp;utm_campaign=ultimate_directory_of_apache_iceberg_resources&amp;amp;utm_content=alexmerced&amp;amp;utm_term=external_blog&quot;&gt;3 Ways to Convert a Delta Lake Table Into an Apache Iceberg Table&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&quot;https://www.dremio.com/blog/how-to-migrate-a-hive-table-to-an-iceberg-table/?utm_source=ev_external_blog&amp;amp;utm_medium=influencer&amp;amp;utm_campaign=ultimate_directory_of_apache_iceberg_resources&amp;amp;utm_content=alexmerced&amp;amp;utm_term=external_blog&quot;&gt;How to Migrate a Hive Table to an Iceberg Table&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&quot;https://www.dremio.com/blog/migrating-a-hive-table-to-an-iceberg-table-hands-on-tutorial/?utm_source=ev_external_blog&amp;amp;utm_medium=influencer&amp;amp;utm_campaign=ultimate_directory_of_apache_iceberg_resources&amp;amp;utm_content=alexmerced&amp;amp;utm_term=external_blog&quot;&gt;Migrating a Hive Table to an Iceberg Table Hands-on Tutorial&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;h3&gt;Streaming with Apache Iceberg&lt;/h3&gt;
&lt;p&gt;Here is a list of resources to help you stream data into Apache Iceberg:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;a href=&quot;https://www.dremio.com/blog/cdc-with-apache-iceberg/?utm_source=ev_external_blog&amp;amp;utm_medium=influencer&amp;amp;utm_campaign=ultimate_directory_of_apache_iceberg_resources&amp;amp;utm_content=alexmerced&amp;amp;utm_term=external_blog&quot;&gt;A Guide to Change Data Capture (CDC) with Apache Iceberg&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&quot;https://www.dremio.com/blog/ingesting-data-into-nessie-apache-iceberg-with-kafka-connect-and-querying-it-with-dremio/?utm_source=ev_external_blog&amp;amp;utm_medium=influencer&amp;amp;utm_campaign=ultimate_directory_of_apache_iceberg_resources&amp;amp;utm_content=alexmerced&amp;amp;utm_term=external_blog&quot;&gt;Apache Kafka to Apache Iceberg to Dremio&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&quot;https://www.dremio.com/blog/streaming-and-batch-data-lakehouses-with-apache-iceberg-dremio-and-upsolver/?utm_source=ev_external_blog&amp;amp;utm_medium=influencer&amp;amp;utm_campaign=ultimate_directory_of_apache_iceberg_resources&amp;amp;utm_content=alexmerced&amp;amp;utm_term=external_blog&quot;&gt;Streaming and Batch Data Lakehouses with Apache Iceberg, Dremio and Upsolver&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&quot;https://www.dremio.com/blog/using-flink-with-apache-iceberg-and-nessie/?utm_source=ev_external_blog&amp;amp;utm_medium=influencer&amp;amp;utm_campaign=ultimate_directory_of_apache_iceberg_resources&amp;amp;utm_content=alexmerced&amp;amp;utm_term=external_blog&quot;&gt;Using Flink with Apache Iceberg and Nessie&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&quot;https://www.dremio.com/blog/streaming-data-into-apache-iceberg-tables-using-aws-kinesis-and-aws-glue/?utm_source=ev_external_blog&amp;amp;utm_medium=influencer&amp;amp;utm_campaign=ultimate_directory_of_apache_iceberg_resources&amp;amp;utm_content=alexmerced&amp;amp;utm_term=external_blog&quot;&gt;Streaming Data into Apache Iceberg Tables Using AWS Kinesis and AWS Glue&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&quot;https://www.upsolver.com/blog/adapting-iceberg-for-high-scale-streaming-data&quot;&gt;Adapting Iceberg for high-scale streaming data&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;h3&gt;Partitioning with Apache Iceberg&lt;/h3&gt;
&lt;p&gt;Here is a list of resources to help you learn how to partition your data with Apache Iceberg:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;a href=&quot;https://www.dremio.com/blog/iceberg-reflections-partitioning/?utm_source=ev_external_blog&amp;amp;utm_medium=influencer&amp;amp;utm_campaign=ultimate_directory_of_apache_iceberg_resources&amp;amp;utm_content=alexmerced&amp;amp;utm_term=external_blog&quot;&gt;Simplifying Your Partition Strategies with Dremio Reflections and Apache Iceberg&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&quot;https://www.dremio.com/blog/future-proof-partitioning-and-fewer-table-rewrites-with-apache-iceberg/?utm_source=ev_external_blog&amp;amp;utm_medium=influencer&amp;amp;utm_campaign=ultimate_directory_of_apache_iceberg_resources&amp;amp;utm_content=alexmerced&amp;amp;utm_term=external_blog&quot;&gt;Partition Evolution: Future-Proof Partitioning and Fewer Table Rewrites with Apache Iceberg&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&quot;https://www.dremio.com/blog/fewer-accidental-full-table-scans-brought-to-you-by-apache-icebergs-hidden-partitioning/?utm_source=ev_external_blog&amp;amp;utm_medium=influencer&amp;amp;utm_campaign=ultimate_directory_of_apache_iceberg_resources&amp;amp;utm_content=alexmerced&amp;amp;utm_term=external_blog&quot;&gt;Fewer Accidental Full Table Scans Brought to You by Apache Iceberg’s Hidden Partitioning&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;h3&gt;Maintaining and Auditing Apache Iceberg Tables&lt;/h3&gt;
&lt;ul&gt;
&lt;li&gt;&lt;a href=&quot;https://www.dremio.com/blog/page/2/#:~:text=Guide%20to%20Maintaining%20an%20Apache%20Iceberg%20Lakehouse&quot;&gt;Guide to Maintaining an Apache Iceberg Lakehouse&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&quot;https://www.dremio.com/blog/compaction-in-apache-iceberg-fine-tuning-your-iceberg-tables-data-files/?utm_source=ev_external_blog&amp;amp;utm_medium=influencer&amp;amp;utm_campaign=ultimate_directory_of_apache_iceberg_resources&amp;amp;utm_content=alexmerced&amp;amp;utm_term=external_blog&quot;&gt;Compaction in Apache Iceberg: Fine-Tuning Your Iceberg Table’s Data Files&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&quot;https://www.dremio.com/blog/apache-iceberg-metadata-tables/?utm_source=ev_external_blog&amp;amp;utm_medium=influencer&amp;amp;utm_campaign=ultimate_directory_of_apache_iceberg_resources&amp;amp;utm_content=alexmerced&amp;amp;utm_term=external_blog&quot;&gt;Leveraging Apache Iceberg Metadata Tables in Dremio for Effective Data Lakehouse Auditing&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&quot;https://www.dremio.com/blog/what-is-dataops-automating-data-management-on-the-apache-iceberg-lakehouse/?utm_source=ev_external_blog&amp;amp;utm_medium=influencer&amp;amp;utm_campaign=ultimate_directory_of_apache_iceberg_resources&amp;amp;utm_content=alexmerced&amp;amp;utm_term=external_blog&quot;&gt;What is DataOps? Automating Data Management on the Apache Iceberg Lakehouse&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&quot;https://www.dremio.com/blog/how-z-ordering-in-apache-iceberg-helps-improve-performance/?utm_source=ev_external_blog&amp;amp;utm_medium=influencer&amp;amp;utm_campaign=ultimate_directory_of_apache_iceberg_resources&amp;amp;utm_content=alexmerced&amp;amp;utm_term=external_blog&quot;&gt;How Z-Ordering in Apache Iceberg Helps Improve Performance&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&quot;https://www.dremio.com/blog/maintaining-iceberg-tables-compaction-expiring-snapshots-and-more/&quot;&gt;Maintaining Iceberg Tables – Compaction, Expiring Snapshots, and More&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;h3&gt;Apache Iceberg Catalogs&lt;/h3&gt;
&lt;p&gt;Here is a list of resources to help you learn about Apache Iceberg Catalogs:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;a href=&quot;https://www.dremio.com/blog/the-evolution-of-apache-iceberg-catalogs/?utm_source=ev_external_blog&amp;amp;utm_medium=influencer&amp;amp;utm_campaign=ultimate_directory_of_apache_iceberg_resources&amp;amp;utm_content=alexmerced&amp;amp;utm_term=external_blog&quot;&gt;The Evolution of Apache Iceberg Catalogs&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&quot;https://www.dremio.com/blog/introducing-the-apache-iceberg-catalog-migration-tool/?utm_source=ev_external_blog&amp;amp;utm_medium=influencer&amp;amp;utm_campaign=ultimate_directory_of_apache_iceberg_resources&amp;amp;utm_content=alexmerced&amp;amp;utm_term=external_blog&quot;&gt;Introducing the Apache Iceberg Catalog Migration Tool&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&quot;https://medium.com/data-engineering-with-dremio/what-iceberg-rest-catalog-is-and-isnt-b4a6d056f493?source=---------7&quot;&gt;What Iceberg REST Catalog Is and Isn’t&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&quot;https://www.dremio.com/blog/why-thinking-about-apache-iceberg-catalogs-like-nessie-and-apache-polaris-incubating-matters/?utm_source=ev_external_blog&amp;amp;utm_medium=influencer&amp;amp;utm_campaign=ultimate_directory_of_apache_iceberg_resources&amp;amp;utm_content=alexmerced&amp;amp;utm_term=external_blog&quot;&gt;Why Thinking about Apache Iceberg Catalogs Like Nessie and Apache Polaris (incubating) Matters&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&quot;https://www.dremio.com/blog/use-nessie-with-iceberg-rest-catalog/?utm_source=ev_external_blog&amp;amp;utm_medium=influencer&amp;amp;utm_campaign=ultimate_directory_of_apache_iceberg_resources&amp;amp;utm_content=alexmerced&amp;amp;utm_term=external_blog&quot;&gt;Using Nessie’s REST Catalog Support for Working with Apache Iceberg Tables&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&quot;https://www.dremio.com/blog/the-nessie-ecosystem-and-the-reach-of-git-for-data-for-apache-iceberg/?utm_source=ev_external_blog&amp;amp;utm_medium=influencer&amp;amp;utm_campaign=ultimate_directory_of_apache_iceberg_resources&amp;amp;utm_content=alexmerced&amp;amp;utm_term=external_blog&quot;&gt;The Nessie Ecosystem and the Reach of Git for Data for Apache Iceberg&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&quot;https://www.dremio.com/blog/intro-to-polaris-data-catalog/&quot;&gt;Introduction to Apache Polaris (incubating) Data Catalog&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&quot;https://medium.com/data-engineering-with-dremio/understanding-the-polaris-iceberg-catalog-and-its-architecture-4fefd7655fd1&quot;&gt;Understanding the Polaris Iceberg Catalog and Its Architecture&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&quot;https://www.dremio.com/blog/getting-hands-on-with-snowflake-managed-polaris/?utm_source=ev_external_blog&amp;amp;utm_medium=influencer&amp;amp;utm_campaign=ultimate_directory_of_apache_iceberg_resources&amp;amp;utm_content=alexmerced&amp;amp;utm_term=external_blog&quot;&gt;Getting Hands-on with Snowflake Managed Polaris&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&quot;https://www.dremio.com/blog/getting-hands-on-with-polaris-oss-apache-iceberg-and-apache-spark/?utm_source=ev_external_blog&amp;amp;utm_medium=influencer&amp;amp;utm_campaign=ultimate_directory_of_apache_iceberg_resources&amp;amp;utm_content=alexmerced&amp;amp;utm_term=external_blog&quot;&gt;Getting Hands-on with Polaris OSS, Apache Iceberg and Apache Spark&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&quot;https://www.dremio.com/blog/why-use-nessie-catalog-versioning-and-dbt-code-versioning/?utm_source=ev_external_blog&amp;amp;utm_medium=influencer&amp;amp;utm_campaign=ultimate_directory_of_apache_iceberg_resources&amp;amp;utm_content=alexmerced&amp;amp;utm_term=external_blog&quot;&gt;The Importance of Versioning in Modern Data Platforms: Catalog Versioning with Nessie vs. Code Versioning with dbt&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;h3&gt;Querying Apache Iceberg Tables&lt;/h3&gt;
&lt;p&gt;Here is a list of resources to help you query your Apache Iceberg tables:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;a href=&quot;https://blog.min.io/query-iceberg-minio-dremio/&quot;&gt;Query Iceberg Tables on MinIO with Dremio&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&quot;https://www.dremio.com/blog/run-graph-queries-on-apache-iceberg-tables-with-dremio-puppygraph/?utm_source=ev_external_blog&amp;amp;utm_medium=influencer&amp;amp;utm_campaign=ultimate_directory_of_apache_iceberg_resources&amp;amp;utm_content=alexmerced&amp;amp;utm_term=external_blog&quot;&gt;Run Graph Queries on Apache Iceberg Tables with Dremio &amp;amp; Puppygraph&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;h3&gt;Hybrid Apache Iceberg Lakehouses&lt;/h3&gt;
&lt;p&gt;Here is a list of resources about implementing hybrid on-premises and cloud Apache Iceberg lakehouses:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;a href=&quot;https://www.dremio.com/blog/3-reasons-to-create-hybrid-apache-iceberg-data-lakehouses/?utm_source=ev_external_blog&amp;amp;utm_medium=influencer&amp;amp;utm_campaign=ultimate_directory_of_apache_iceberg_resources&amp;amp;utm_content=alexmerced&amp;amp;utm_term=external_blog&quot;&gt;3 Reasons to Create Hybrid Apache Iceberg Data Lakehouses&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&quot;https://www.dremio.com/blog/hybrid-lakehouse-storage-solutions-netapp/?utm_source=ev_external_blog&amp;amp;utm_medium=influencer&amp;amp;utm_campaign=ultimate_directory_of_apache_iceberg_resources&amp;amp;utm_content=alexmerced&amp;amp;utm_term=external_blog&quot;&gt;Hybrid Iceberg Lakehouse Storage Solutions: NetApp&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&quot;https://www.dremio.com/blog/hybrid-lakehouse-storage-solutions-minio/?utm_source=ev_external_blog&amp;amp;utm_medium=influencer&amp;amp;utm_campaign=ultimate_directory_of_apache_iceberg_resources&amp;amp;utm_content=alexmerced&amp;amp;utm_term=external_blog&quot;&gt;Hybrid Iceberg Lakehouse Storage Solutions: MinIO&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&quot;https://www.dremio.com/blog/hybrid-lakehouse-infrastructure-solutions-vast-data/?utm_source=ev_external_blog&amp;amp;utm_medium=influencer&amp;amp;utm_campaign=ultimate_directory_of_apache_iceberg_resources&amp;amp;utm_content=alexmerced&amp;amp;utm_term=external_blog&quot;&gt;Hybrid Iceberg Lakehouse Infrastructure Solutions: VAST Data&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&quot;https://www.dremio.com/blog/hybrid-lakehouse-storage-solutions-purestorage/?utm_source=ev_external_blog&amp;amp;utm_medium=influencer&amp;amp;utm_campaign=ultimate_directory_of_apache_iceberg_resources&amp;amp;utm_content=alexmerced&amp;amp;utm_term=external_blog&quot;&gt;Hybrid Lakehouse Storage Solutions: Pure Storage&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;h3&gt;Apache Iceberg and Other Formats&lt;/h3&gt;
&lt;p&gt;Here is a list of resources about Apache Iceberg and other formats (Apache Hudi, Apache Paimon, Delta Lake):&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;a href=&quot;https://www.dremio.com/blog/comparing-apache-iceberg-to-other-data-lakehouse-solutions/?utm_source=ev_external_blog&amp;amp;utm_medium=influencer&amp;amp;utm_campaign=ultimate_directory_of_apache_iceberg_resources&amp;amp;utm_content=alexmerced&amp;amp;utm_term=external_blog&quot;&gt;Comparing Apache Iceberg to Other Data Lakehouse Solutions&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&quot;https://www.dremio.com/blog/exploring-the-architecture-of-apache-iceberg-delta-lake-and-apache-hudi/?utm_source=ev_external_blog&amp;amp;utm_medium=influencer&amp;amp;utm_campaign=ultimate_directory_of_apache_iceberg_resources&amp;amp;utm_content=alexmerced&amp;amp;utm_term=external_blog&quot;&gt;Exploring the Architecture of Apache Iceberg, Delta Lake, and Apache Hudi&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&quot;https://www.dremio.com/blog/comparison-of-data-lake-table-formats-apache-iceberg-apache-hudi-and-delta-lake/?utm_source=ev_external_blog&amp;amp;utm_medium=influencer&amp;amp;utm_campaign=ultimate_directory_of_apache_iceberg_resources&amp;amp;utm_content=alexmerced&amp;amp;utm_term=external_blog&quot;&gt;Comparison of Data Lake Table Formats (Apache Iceberg, Apache Hudi and Delta Lake)&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&quot;https://www.dremio.com/blog/table-format-partitioning-comparison-apache-iceberg-apache-hudi-and-delta-lake/?utm_source=ev_external_blog&amp;amp;utm_medium=influencer&amp;amp;utm_campaign=ultimate_directory_of_apache_iceberg_resources&amp;amp;utm_content=alexmerced&amp;amp;utm_term=external_blog&quot;&gt;Table Format Partitioning Comparison: Apache Iceberg, Apache Hudi, and Delta Lake&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&quot;https://www.dremio.com/blog/table-format-governance-and-community-contributions-apache-iceberg-apache-hudi-and-delta-lake/?utm_source=ev_external_blog&amp;amp;utm_medium=influencer&amp;amp;utm_campaign=ultimate_directory_of_apache_iceberg_resources&amp;amp;utm_content=alexmerced&amp;amp;utm_term=external_blog&quot;&gt;Table Format Governance and Community Contributions: Apache Iceberg, Apache Hudi, and Delta Lake&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;h3&gt;Python and Apache Iceberg&lt;/h3&gt;
&lt;p&gt;Here is a list of resources about Apache Iceberg and Python:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;a href=&quot;https://www.dremio.com/blog/3-ways-to-use-python-with-apache-iceberg/?utm_source=ev_external_blog&amp;amp;utm_medium=influencer&amp;amp;utm_campaign=ultimate_directory_of_apache_iceberg_resources&amp;amp;utm_content=alexmerced&amp;amp;utm_term=external_blog&quot;&gt;3 Ways to Use Python with Apache Iceberg&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&quot;https://py.iceberg.apache.org/&quot;&gt;PyIceberg Docs&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&quot;https://www.dremio.com/blog/intro-to-pyiceberg/?utm_source=ev_external_blog&amp;amp;utm_medium=influencer&amp;amp;utm_campaign=ultimate_directory_of_apache_iceberg_resources&amp;amp;utm_content=alexmerced&amp;amp;utm_term=external_blog&quot;&gt;Hands-on with Apache Iceberg Tables using PyIceberg using Nessie and Minio&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;h3&gt;Governing Apache Iceberg Tables&lt;/h3&gt;
&lt;ul&gt;
&lt;li&gt;&lt;a href=&quot;https://www.dremio.com/blog/apache-iceberg-and-the-right-to-be-forgotten/?utm_source=ev_external_blog&amp;amp;utm_medium=influencer&amp;amp;utm_campaign=ultimate_directory_of_apache_iceberg_resources&amp;amp;utm_content=alexmerced&amp;amp;utm_term=external_blog&quot;&gt;Apache Iceberg and the Right to Be Forgotten&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&quot;https://amdatalakehouse.substack.com/p/a-brief-guide-to-the-governance-of&quot;&gt;A Brief Guide to the Governance of Apache Iceberg Tables&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;h3&gt;Miscellaneous Apache Iceberg Resources&lt;/h3&gt;
&lt;p&gt;Here is a list of miscellaneous resources to help you learn Apache Iceberg:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;a href=&quot;https://www.dremio.com/blog/iceberg-data-lakehouse/?utm_source=ev_external_blog&amp;amp;utm_medium=influencer&amp;amp;utm_campaign=ultimate_directory_of_apache_iceberg_resources&amp;amp;utm_content=alexmerced&amp;amp;utm_term=external_blog&quot;&gt;Introduction to the Iceberg Data Lakehouse&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&quot;https://www.dremio.com/blog/iceberg-lakehouse/?utm_source=ev_external_blog&amp;amp;utm_medium=influencer&amp;amp;utm_campaign=ultimate_directory_of_apache_iceberg_resources&amp;amp;utm_content=alexmerced&amp;amp;utm_term=external_blog&quot;&gt;The Iceberg Lakehouse: Key Benefits for Your Business&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&quot;https://www.dremio.com/blog/evolving-the-data-lake-from-csv-json-to-parquet-to-apache-iceberg/?utm_source=ev_external_blog&amp;amp;utm_medium=influencer&amp;amp;utm_campaign=ultimate_directory_of_apache_iceberg_resources&amp;amp;utm_content=alexmerced&amp;amp;utm_term=external_blog&quot;&gt;Evolving the Data Lake: From CSV/JSON to Parquet to Apache Iceberg&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&quot;https://www.dremio.com/blog/data-sharing-of-apache-iceberg-tables-and-other-data-in-the-dremio-lakehouse/&quot;&gt;Data Sharing of Apache Iceberg tables and other data in the Dremio Lakehouse&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&quot;https://www.dremio.com/blog/the-value-of-dremios-semantic-layer-and-the-apache-iceberg-lakehouse-to-the-snowflake-user/?utm_source=ev_external_blog&amp;amp;utm_medium=influencer&amp;amp;utm_campaign=ultimate_directory_of_apache_iceberg_resources&amp;amp;utm_content=alexmerced&amp;amp;utm_term=external_blog&quot;&gt;The Value of Dremio’s Semantic Layer and The Apache Iceberg Lakehouse to the Snowflake User&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&quot;https://www.dremio.com/blog/the-who-what-and-why-of-data-reflections-and-apache-iceberg-for-query-acceleration/?utm_source=ev_external_blog&amp;amp;utm_medium=influencer&amp;amp;utm_campaign=ultimate_directory_of_apache_iceberg_resources&amp;amp;utm_content=alexmerced&amp;amp;utm_term=external_blog&quot;&gt;The Who, What and Why of Data Reflections and Apache Iceberg for Query Acceleration&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&quot;https://www.dremio.com/blog/how-apache-iceberg-dremio-and-lakehouse-architecture-can-optimize-your-cloud-data-platform-costs/?utm_source=ev_external_blog&amp;amp;utm_medium=influencer&amp;amp;utm_campaign=ultimate_directory_of_apache_iceberg_resources&amp;amp;utm_content=alexmerced&amp;amp;utm_term=external_blog&quot;&gt;How Apache Iceberg, Dremio and Lakehouse Architecture can optimize your Cloud Data Platform Costs&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&quot;https://www.dremio.com/blog/dremios-commitment-to-being-the-ideal-platform-for-apache-iceberg-data-lakehouses/?utm_source=ev_external_blog&amp;amp;utm_medium=influencer&amp;amp;utm_campaign=ultimate_directory_of_apache_iceberg_resources&amp;amp;utm_content=alexmerced&amp;amp;utm_term=external_blog&quot;&gt;Dremio’s Commitment to being the Ideal Platform for Apache Iceberg Data Lakehouses&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&quot;https://www.dremio.com/blog/open-source-and-the-data-lakehouse-apache-arrow-apache-iceberg-nessie-and-dremio/?utm_source=ev_external_blog&amp;amp;utm_medium=influencer&amp;amp;utm_campaign=ultimate_directory_of_apache_iceberg_resources&amp;amp;utm_content=alexmerced&amp;amp;utm_term=external_blog&quot;&gt;Open Source and the Data Lakehouse: Apache Arrow, Apache Iceberg, Nessie and Dremio&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&quot;https://www.dremio.com/blog/the-why-and-how-of-using-apache-iceberg-on-databricks/?utm_source=ev_external_blog&amp;amp;utm_medium=influencer&amp;amp;utm_campaign=ultimate_directory_of_apache_iceberg_resources&amp;amp;utm_content=alexmerced&amp;amp;utm_term=external_blog&quot;&gt;The Why and How of Using Apache Iceberg on Databricks&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&quot;https://www.dremio.com/blog/deep-dive-into-configuring-your-apache-iceberg-catalog-with-apache-spark/?utm_source=ev_external_blog&amp;amp;utm_medium=influencer&amp;amp;utm_campaign=ultimate_directory_of_apache_iceberg_resources&amp;amp;utm_content=alexmerced&amp;amp;utm_term=external_blog&quot;&gt;Deep Dive Into Configuring Your Apache Iceberg Catalog with Apache Spark&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&quot;https://www.dremio.com/blog/connecting-tableau-to-apache-iceberg-tables-with-dremio/?utm_source=ev_external_blog&amp;amp;utm_medium=influencer&amp;amp;utm_campaign=ultimate_directory_of_apache_iceberg_resources&amp;amp;utm_content=alexmerced&amp;amp;utm_term=external_blog&quot;&gt;Connecting Tableau to Apache Iceberg Tables with Dremio&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&quot;https://www.dremio.com/lakehouse-deep-dives/apache-iceberg-101/?utm_source=ev_external_blog&amp;amp;utm_medium=influencer&amp;amp;utm_campaign=ultimate_directory_of_apache_iceberg_resources&amp;amp;utm_content=alexmerced&amp;amp;utm_term=external_blog&quot;&gt;Apache Iceberg 101&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&quot;https://www.dremio.com/blog/apache-iceberg-faq/?utm_source=ev_external_blog&amp;amp;utm_medium=influencer&amp;amp;utm_campaign=ultimate_directory_of_apache_iceberg_resources&amp;amp;utm_content=alexmerced&amp;amp;utm_term=external_blog&quot;&gt;Apache Iceberg FAQ&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&quot;https://medium.com/data-engineering-with-dremio/why-data-analysts-engineers-architects-and-scientists-should-care-about-dremio-and-apache-iceberg-361ba9e01f38?source=---------1&quot;&gt;Why Data Analysts, Engineers, Architects and Scientists Should Care about Dremio and Apache Iceberg&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&quot;https://blog.min.io/uncover-data-lake-nessie-dremio-iceberg/&quot;&gt;Data Lake Mysteries Unveiled: Nessie, Dremio, and MinIO Make Waves&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;
</content:encoded><author>Alex Merced</author></item><item><title>Change Data Capture (CDC) when there is no CDC</title><link>https://iceberglakehouse.com/posts/2024-10-cdc-when-there-is-no-cdc/</link><guid isPermaLink="true">https://iceberglakehouse.com/posts/2024-10-cdc-when-there-is-no-cdc/</guid><description>
- [Free Copy of Apache Iceberg: The Definitive Guide](https://hello.dremio.com/wp-apache-iceberg-the-definitive-guide-reg.html?utm_source=alexmerced&amp;...</description><pubDate>Fri, 04 Oct 2024 09:00:00 GMT</pubDate><content:encoded>&lt;ul&gt;
&lt;li&gt;&lt;a href=&quot;https://hello.dremio.com/wp-apache-iceberg-the-definitive-guide-reg.html?utm_source=alexmerced&amp;amp;utm_medium=external_blog&amp;amp;utm_campaign=cdc_when_there_is_no_cdc&quot;&gt;Free Copy of Apache Iceberg: The Definitive Guide&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&quot;https://hello.dremio.com/webcast-an-apache-iceberg-lakehouse-crash-course-reg.html?utm_source=alexmerced&amp;amp;utm_medium=external_blog&amp;amp;utm_campaign=cdc_when_there_is_no_cdc&quot;&gt;Free Apache Iceberg Crash Course&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&quot;https://www.dremio.com/blog/cdc-with-apache-iceberg/?utm_source=alexmerced&amp;amp;utm_medium=external_blog&amp;amp;utm_campaign=cdc_when_there_is_no_cdc&quot;&gt;A Guide to Change Data Capture (CDC) with Apache Iceberg&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&quot;https://www.dremio.com/blog/ingesting-data-into-nessie-apache-iceberg-with-kafka-connect-and-querying-it-with-dremio/?utm_source=alexmerced&amp;amp;utm_medium=external_blog&amp;amp;utm_campaign=cdc_when_there_is_no_cdc&quot;&gt;Using Apache Iceberg with Kafka Connect&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&quot;https://www.dremio.com/blog/using-flink-with-apache-iceberg-and-nessie/?utm_source=alexmerced&amp;amp;utm_medium=external_blog&amp;amp;utm_campaign=cdc_when_there_is_no_cdc&quot;&gt;Using Apache Iceberg with Flink&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&quot;https://www.dremio.com/blog/streaming-and-batch-data-lakehouses-with-apache-iceberg-dremio-and-upsolver/?utm_source=alexmerced&amp;amp;utm_medium=external_blog&amp;amp;utm_campaign=cdc_when_there_is_no_cdc&quot;&gt;Streaming and Batch Data Lakehouses with Apache Iceberg, Dremio and Upsolver&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;h2&gt;Introduction&lt;/h2&gt;
&lt;h3&gt;Overview of CDC&lt;/h3&gt;
&lt;p&gt;Change Data Capture (CDC) is the process of identifying and capturing changes made to data within a database. It&apos;s a critical technique in modern data architectures, enabling systems to stay synchronized, whether for analytical purposes, replication, or near-real-time data streaming. CDC helps minimize the need to reprocess entire datasets by focusing on only the incremental changes—new inserts, updates, and deletions.&lt;/p&gt;
&lt;h3&gt;Challenges with Systems Lacking Native CDC&lt;/h3&gt;
&lt;p&gt;Many databases offer native CDC features, such as change logs or triggers, that automatically track data modifications. However, when working with systems that don’t provide these built-in features, implementing CDC becomes more challenging. You need to design tables and processes that allow you to track changes manually while minimizing performance impact. Without proper design, you may face issues like slow updates, data inconsistencies, and complex synchronization logic.&lt;/p&gt;
&lt;h3&gt;Goals of This Blog&lt;/h3&gt;
&lt;p&gt;In this blog, we will explore:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;How to design tables&lt;/strong&gt; that enable efficient incremental updates, even in systems without CDC.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;How to write SQL&lt;/strong&gt; for applying changes to other tables effectively, reducing the overhead of full table scans or complete reloads.&lt;/li&gt;
&lt;/ul&gt;
&lt;h2&gt;Designing Tables for Incremental Updates&lt;/h2&gt;
&lt;h3&gt;Use of Timestamps and Version Columns&lt;/h3&gt;
&lt;p&gt;One of the simplest and most effective ways to track changes in systems without CDC is by adding &lt;code&gt;updated_at&lt;/code&gt; and &lt;code&gt;created_at&lt;/code&gt; columns to your tables. These timestamp columns can provide a clear audit trail of when rows were inserted or modified. For updates, the &lt;code&gt;updated_at&lt;/code&gt; field gets refreshed, allowing you to easily query records that have changed since the last update.&lt;/p&gt;
&lt;p&gt;Additionally, incorporating a version column helps track the number of times a record has been modified. Each time a row is updated, the version increases, making it easier to detect whether a record needs to be synchronized elsewhere. For soft deletes, adding a &lt;code&gt;deleted_at&lt;/code&gt; column can signal when a row has been marked for deletion without physically removing it.&lt;/p&gt;
&lt;h3&gt;Storing Historical Data&lt;/h3&gt;
&lt;p&gt;Maintaining historical data is another technique for simulating CDC. You can create history or audit tables that store every change made to a record, preserving its previous versions. This enables you to recreate the full history of changes while still maintaining a &amp;quot;live&amp;quot; table with only the current state.&lt;/p&gt;
&lt;p&gt;When capturing history:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Use a combination of &lt;code&gt;INSERT&lt;/code&gt; and &lt;code&gt;UPDATE&lt;/code&gt; triggers, or manually insert new versions into the history table.&lt;/li&gt;
&lt;li&gt;Each entry can include metadata like &lt;code&gt;modified_by&lt;/code&gt; and &lt;code&gt;operation_type&lt;/code&gt; (insert, update, delete) to clarify how the change occurred.&lt;/li&gt;
&lt;/ul&gt;
&lt;h3&gt;Partitioning and Indexing Strategies&lt;/h3&gt;
&lt;p&gt;For larger tables, partitioning by time or version can help optimize incremental updates. By partitioning on columns like &lt;code&gt;updated_at&lt;/code&gt;, you can narrow down the scope of queries to the most recent partitions, avoiding costly full table scans.&lt;/p&gt;
&lt;p&gt;Similarly, indexing on these columns can accelerate queries by making it faster to retrieve the most recently modified rows. Keep in mind that partitioning and indexing come with trade-offs, such as increased write overhead, so it’s important to balance these optimizations based on the use case.&lt;/p&gt;
&lt;h2&gt;Writing SQL for Efficient Incremental Updates&lt;/h2&gt;
&lt;h3&gt;Identifying the Changes&lt;/h3&gt;
&lt;p&gt;Once your tables are designed to track changes, the next step is to efficiently query for only the modified rows. This can be achieved by leveraging the &lt;code&gt;updated_at&lt;/code&gt; or &lt;code&gt;version&lt;/code&gt; columns.&lt;/p&gt;
&lt;p&gt;For example, to select all records updated in the last hour, you can use a query like:&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-sql&quot;&gt;SELECT * 
FROM your_table
WHERE updated_at &amp;gt; NOW() - INTERVAL &apos;1 hour&apos;;
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Alternatively, if you&apos;re using a versioning system, you can select all records with a version greater than the last processed version:&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-sql&quot;&gt;SELECT * 
FROM your_table
WHERE version &amp;gt; :last_processed_version;
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;These queries allow you to efficiently identify incremental changes without having to scan the entire table.&lt;/p&gt;
&lt;h3&gt;Merging Changes into Target Tables&lt;/h3&gt;
&lt;p&gt;After identifying the changes, you&apos;ll want to apply them to the target tables (such as a reporting or analytics table). Depending on the database you&apos;re using, you might employ different strategies. If the database supports the MERGE statement, it can handle inserts, updates, and deletions in a single query.&lt;/p&gt;
&lt;p&gt;Here’s an example of how you can merge changes from a source table into a target table:&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-sql&quot;&gt;MERGE INTO target_table AS t
USING source_table AS s
ON t.id = s.id
WHEN MATCHED AND s.updated_at &amp;gt; t.updated_at THEN
  UPDATE SET t.column1 = s.column1, t.updated_at = s.updated_at
WHEN NOT MATCHED THEN
  INSERT (id, column1, updated_at) 
  VALUES (s.id, s.column1, s.updated_at);
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;In systems where MERGE isn’t supported, you can use a combination of INSERT and UPDATE queries:&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-sql&quot;&gt;-- Update existing records
UPDATE target_table AS t
SET column1 = s.column1, updated_at = s.updated_at
FROM source_table AS s
WHERE t.id = s.id AND s.updated_at &amp;gt; t.updated_at;

-- Insert new records
INSERT INTO target_table (id, column1, updated_at)
SELECT s.id, s.column1, s.updated_at
FROM source_table AS s
WHERE NOT EXISTS (
    SELECT 1 FROM target_table t WHERE t.id = s.id
);
&lt;/code&gt;&lt;/pre&gt;
&lt;h3&gt;Handling Conflicts&lt;/h3&gt;
&lt;p&gt;When applying incremental changes, conflicts can arise, especially if multiple systems or users are updating the same data. One common conflict occurs when duplicate records are inserted or when simultaneous updates lead to inconsistencies.&lt;/p&gt;
&lt;p&gt;To manage this, databases often provide clauses like &lt;code&gt;ON CONFLICT&lt;/code&gt; (in PostgreSQL) or similar approaches in other systems. For example, to avoid conflicts during inserts, you can specify an action in case of a duplicate key:&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-sql&quot;&gt;INSERT INTO target_table (id, column1, updated_at)
VALUES (:id, :column1, :updated_at)
ON CONFLICT (id) DO UPDATE
SET column1 = EXCLUDED.column1, updated_at = EXCLUDED.updated_at;
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;This ensures that if the row already exists, it will be updated rather than throwing an error.&lt;/p&gt;
&lt;h2&gt;Batch Processing and Scheduling Incremental Updates&lt;/h2&gt;
&lt;h3&gt;Batching Updates&lt;/h3&gt;
&lt;p&gt;To avoid overwhelming your system or locking your database during large updates, batching incremental updates can be an effective strategy. Instead of applying all changes at once, process them in smaller, manageable chunks. For example, you can process updates in batches of 1,000 rows at a time.&lt;/p&gt;
&lt;p&gt;Here’s how you could implement batch processing in SQL:&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-sql&quot;&gt;-- Assume a batch size of 1000 rows
WITH batch AS (
    SELECT * FROM source_table 
    WHERE updated_at &amp;gt; :last_processed_time
    ORDER BY updated_at
    LIMIT 1000
)
-- Apply the batch to the target table
MERGE INTO target_table AS t
USING batch AS b
ON t.id = b.id
WHEN MATCHED AND b.updated_at &amp;gt; t.updated_at THEN
  UPDATE SET t.column1 = b.column1, t.updated_at = b.updated_at
WHEN NOT MATCHED THEN
  INSERT (id, column1, updated_at) 
  VALUES (b.id, b.column1, b.updated_at);
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;After processing each batch, you can update your tracking mechanism (e.g., the last processed timestamp or version) and continue with the next batch.&lt;/p&gt;
&lt;h3&gt;Efficient Scheduling of Updates&lt;/h3&gt;
&lt;p&gt;To ensure that your incremental updates happen regularly, you need an efficient scheduling mechanism. For databases without native job scheduling, you can rely on external tools such as cron jobs, Airflow, or other orchestration systems.&lt;/p&gt;
&lt;p&gt;Here’s an example of how you might schedule your updates using a cron job:&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-bash&quot;&gt;# Run every 10 minutes
*/10 * * * * /path/to/script/incremental_update.sh
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;In environments using more complex workflows, you can configure tools like Apache Airflow to orchestrate these updates, handling dependencies and retries in case of failure.&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-python&quot;&gt;# Airflow DAG for incremental updates
from airflow import DAG
from airflow.operators.bash_operator import BashOperator
from datetime import datetime

dag = DAG(&apos;incremental_update&apos;, start_date=datetime(2023, 1, 1))

run_incremental_update = BashOperator(
    task_id=&apos;run_incremental_update&apos;,
    bash_command=&apos;python /path/to/incremental_update.py&apos;,
    dag=dag
)

run_incremental_update
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;By automating these updates, you can ensure your target systems stay synchronized with minimal manual intervention.&lt;/p&gt;
&lt;h2&gt;Monitoring and Validating Incremental Updates&lt;/h2&gt;
&lt;h3&gt;Tracking Changes Applied&lt;/h3&gt;
&lt;p&gt;To ensure that your incremental updates are working as expected, it&apos;s essential to track and log the changes applied to your target tables. This can be done by maintaining control tables or logs that record the status of each update operation. These logs can store details such as the number of rows processed, the time the update occurred, and any errors that were encountered.&lt;/p&gt;
&lt;p&gt;For example, you can create a simple audit table to track update operations:&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-sql&quot;&gt;CREATE TABLE update_log (
    update_id SERIAL PRIMARY KEY,
    table_name TEXT,
    rows_updated INT,
    update_time TIMESTAMP DEFAULT NOW(),
    status TEXT,
    error_message TEXT
);
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;After each incremental update, you can insert a record into this log to track the operation:&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-sql&quot;&gt;INSERT INTO update_log (table_name, rows_updated, status)
VALUES (&apos;target_table&apos;, :rows_updated, &apos;success&apos;);
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;If an error occurs, you can capture it and log the details:&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-sql&quot;&gt;INSERT INTO update_log (table_name, status, error_message)
VALUES (&apos;target_table&apos;, &apos;failed&apos;, &apos;Error details...&apos;);
&lt;/code&gt;&lt;/pre&gt;
&lt;h3&gt;Data Validation and Consistency Checks&lt;/h3&gt;
&lt;p&gt;Beyond simply logging updates, you should also perform regular validation checks to ensure the correctness of the data. One approach is to compare record counts between the source and target tables to ensure they are in sync:&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-sql&quot;&gt;-- Count of records in the source table since last update
SELECT COUNT(*) FROM source_table WHERE updated_at &amp;gt; :last_update_time;

-- Count of records updated in the target table
SELECT COUNT(*) FROM target_table WHERE updated_at &amp;gt; :last_update_time;
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;If the counts don&apos;t match, it could indicate a data inconsistency or a problem with the update process.&lt;/p&gt;
&lt;p&gt;You can also compute checksums or hash values of critical columns to validate that the data was transferred without corruption:&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-sql&quot;&gt;-- Compute checksum on the source
SELECT MD5(ARRAY_AGG(column1 || column2)) AS checksum
FROM source_table WHERE updated_at &amp;gt; :last_update_time;

-- Compute checksum on the target
SELECT MD5(ARRAY_AGG(column1 || column2)) AS checksum
FROM target_table WHERE updated_at &amp;gt; :last_update_time;
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;If discrepancies are found, you can investigate and reprocess the affected batches. Regular consistency checks ensure that the target table accurately reflects the latest state of the source data.&lt;/p&gt;
&lt;h2&gt;Real-world Example: Applying Incremental Changes&lt;/h2&gt;
&lt;h3&gt;Scenario Setup&lt;/h3&gt;
&lt;p&gt;Let’s take a practical example where you are managing sales transactions in a source table and need to keep an analytics table in sync for reporting purposes. The source table contains new and updated transactions, and the target table is an aggregated summary of sales per product.&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Source Table (sales_transactions):&lt;/strong&gt; Contains individual transaction records with &lt;code&gt;transaction_id&lt;/code&gt;, &lt;code&gt;product_id&lt;/code&gt;, &lt;code&gt;amount&lt;/code&gt;, and &lt;code&gt;updated_at&lt;/code&gt;.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Target Table (product_sales):&lt;/strong&gt; Aggregates total sales by product with &lt;code&gt;product_id&lt;/code&gt;, &lt;code&gt;total_sales&lt;/code&gt;, and &lt;code&gt;last_updated&lt;/code&gt;.&lt;/li&gt;
&lt;/ul&gt;
&lt;h3&gt;Step-by-step Guide&lt;/h3&gt;
&lt;h4&gt;Design the source table:&lt;/h4&gt;
&lt;p&gt;Ensure the source table includes an &lt;code&gt;updated_at&lt;/code&gt; column to track when each transaction was last modified.&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-sql&quot;&gt;CREATE TABLE sales_transactions (
    transaction_id SERIAL PRIMARY KEY,
    product_id INT,
    amount DECIMAL(10, 2),
    updated_at TIMESTAMP DEFAULT NOW()
);
&lt;/code&gt;&lt;/pre&gt;
&lt;h4&gt;Write SQL for incremental updates:&lt;/h4&gt;
&lt;p&gt;Use SQL to identify the new or modified transactions since the last update. For example, select transactions that occurred in the last 24 hours:&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-sql&quot;&gt;SELECT product_id, SUM(amount) AS total_sales
FROM sales_transactions
WHERE updated_at &amp;gt; NOW() - INTERVAL &apos;24 hours&apos;
GROUP BY product_id;
&lt;/code&gt;&lt;/pre&gt;
&lt;h4&gt;Merge changes into the target table:&lt;/h4&gt;
&lt;p&gt;Use a &lt;code&gt;MERGE&lt;/code&gt; or &lt;code&gt;UPSERT&lt;/code&gt; query to update the &lt;code&gt;product_sales&lt;/code&gt; table with the latest totals. If the product exists, update its total_sales; otherwise, insert a new record:&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-sql&quot;&gt;MERGE INTO product_sales AS p
USING (SELECT product_id, SUM(amount) AS total_sales
       FROM sales_transactions
       WHERE updated_at &amp;gt; NOW() - INTERVAL &apos;24 hours&apos;
       GROUP BY product_id) AS s
ON p.product_id = s.product_id
WHEN MATCHED THEN
  UPDATE SET p.total_sales = p.total_sales + s.total_sales, p.last_updated = NOW()
WHEN NOT MATCHED THEN
  INSERT (product_id, total_sales, last_updated)
  VALUES (s.product_id, s.total_sales, NOW());
&lt;/code&gt;&lt;/pre&gt;
&lt;h3&gt;Handle batching and scheduling:&lt;/h3&gt;
&lt;p&gt;If the transaction volume is large, you can batch these updates by limiting the number of rows processed in each run. For example:&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-sql&quot;&gt;SELECT product_id, SUM(amount) AS total_sales
FROM sales_transactions
WHERE updated_at &amp;gt; :last_update_time
ORDER BY updated_at
LIMIT 1000
GROUP BY product_id;
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Schedule this query to run periodically using a job scheduler like cron or an orchestration tool like Airflow to ensure the data stays up-to-date.&lt;/p&gt;
&lt;h2&gt;Real-World CDC Alternatives for Non-CDC Systems&lt;/h2&gt;
&lt;h3&gt;Leveraging Triggers for Change Capture&lt;/h3&gt;
&lt;p&gt;In databases that support triggers but lack full CDC features, you can use triggers to manually implement a change tracking system. Triggers allow you to capture row-level changes on insert, update, or delete operations and store these changes in a separate table.&lt;/p&gt;
&lt;p&gt;For example, in PostgreSQL, you can create a trigger to capture changes:&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-sql&quot;&gt;CREATE TABLE change_log (
    id SERIAL PRIMARY KEY,
    table_name TEXT,
    operation_type TEXT,
    record_id INT,
    old_data JSONB,
    new_data JSONB,
    change_time TIMESTAMP DEFAULT NOW()
);

CREATE OR REPLACE FUNCTION log_changes()
RETURNS TRIGGER AS $$
BEGIN
    INSERT INTO change_log (table_name, operation_type, record_id, old_data, new_data)
    VALUES (TG_TABLE_NAME, TG_OP, NEW.id, OLD, NEW);
    RETURN NEW;
END;
$$ LANGUAGE plpgsql;

CREATE TRIGGER capture_changes
AFTER INSERT OR UPDATE OR DELETE
ON your_table
FOR EACH ROW
EXECUTE FUNCTION log_changes();
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;This approach simulates CDC by writing changes to a log table. You can then process this change log periodically to apply updates to your target systems.&lt;/p&gt;
&lt;h3&gt;Using External CDC Tools&lt;/h3&gt;
&lt;p&gt;In cases where implementing your own CDC system becomes too complex or resource-intensive, third-party CDC tools like Debezium can provide a reliable solution. Debezium, for instance, is an open-source platform that captures database changes and publishes them as events in Kafka, allowing you to stream changes to other systems in near-real-time.&lt;/p&gt;
&lt;p&gt;Debezium supports databases like MySQL, PostgreSQL, and MongoDB and can track insert, update, and delete operations via the database’s binlog or equivalent. This tool can be particularly useful when scaling up your CDC needs across multiple systems.&lt;/p&gt;
&lt;h3&gt;Batch Processing with ETL Pipelines&lt;/h3&gt;
&lt;p&gt;Another alternative is to simulate CDC through traditional ETL (Extract, Transform, Load) pipelines. Many ETL tools allow you to set up incremental data loads where only the changes since the last load are processed. This approach might not provide real-time changes, but it can work well for batch processing use cases.&lt;/p&gt;
&lt;p&gt;Tools like Apache NiFi, Airflow, and Talend allow you to build robust ETL workflows that can efficiently handle incremental updates. You can configure them to read from source tables based on timestamps or other tracking columns and apply those changes to target systems.&lt;/p&gt;
&lt;p&gt;This approach is often more suitable for less frequent updates or larger datasets where near-real-time processing is not necessary.&lt;/p&gt;
&lt;h2&gt;Conclusion&lt;/h2&gt;
&lt;h3&gt;Summary of Key Takeaways&lt;/h3&gt;
&lt;p&gt;In systems that don’t have native CDC (Change Data Capture) capabilities, it is still possible to design a process for capturing and applying incremental updates efficiently. By carefully structuring your tables with timestamps, version columns, or history tables, you can track changes without requiring full table scans. Writing efficient SQL for merging, batching, and scheduling updates ensures that the changes are applied in a scalable and reliable manner.&lt;/p&gt;
&lt;p&gt;Key points to remember:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Design your tables&lt;/strong&gt; to support change tracking by using columns like &lt;code&gt;updated_at&lt;/code&gt;, &lt;code&gt;created_at&lt;/code&gt;, and &lt;code&gt;version&lt;/code&gt;.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Use SQL strategies&lt;/strong&gt; like &lt;code&gt;MERGE&lt;/code&gt; or &lt;code&gt;UPSERT&lt;/code&gt; to apply changes to your target tables, minimizing resource consumption.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Batch your updates&lt;/strong&gt; to avoid overwhelming your database and schedule these operations for optimal performance.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Monitor and validate&lt;/strong&gt; changes to ensure your target data remains consistent and correct.&lt;/li&gt;
&lt;/ul&gt;
&lt;h3&gt;Further Improvements&lt;/h3&gt;
&lt;p&gt;While the approaches covered in this blog offers practical solutions for implementing CDC in systems without built-in features, there are always opportunities for further optimization:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Adopting middleware or third-party tools&lt;/strong&gt;: Tools like &lt;strong&gt;Debezium&lt;/strong&gt; or &lt;strong&gt;Apache Kafka&lt;/strong&gt; can provide change capture capabilities even for systems without native CDC support.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Moving towards CDC-enabled databases&lt;/strong&gt;: As data needs grow, switching to databases with native CDC features can offer better scalability and reliability.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Implementing more advanced validation mechanisms&lt;/strong&gt;: Consider using more sophisticated data quality tools or building in redundancy checks for mission-critical data.&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;With these principles, you can handle incremental updates in a variety of systems, helping to synchronize your data with efficiency and reliability.&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;a href=&quot;https://hello.dremio.com/wp-apache-iceberg-the-definitive-guide-reg.html?utm_source=alexmerced&amp;amp;utm_medium=external_blog&amp;amp;utm_campaign=cdc_when_there_is_no_cdc&quot;&gt;Free Copy of Apache Iceberg: The Definitive Guide&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&quot;https://hello.dremio.com/webcast-an-apache-iceberg-lakehouse-crash-course-reg.html?utm_source=alexmerced&amp;amp;utm_medium=external_blog&amp;amp;utm_campaign=cdc_when_there_is_no_cdc&quot;&gt;Free Apache Iceberg Crash Course&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&quot;https://www.dremio.com/blog/cdc-with-apache-iceberg/?utm_source=alexmerced&amp;amp;utm_medium=external_blog&amp;amp;utm_campaign=cdc_when_there_is_no_cdc&quot;&gt;A Guide to Change Data Capture (CDC) with Apache Iceberg&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&quot;https://www.dremio.com/blog/ingesting-data-into-nessie-apache-iceberg-with-kafka-connect-and-querying-it-with-dremio/?utm_source=alexmerced&amp;amp;utm_medium=external_blog&amp;amp;utm_campaign=cdc_when_there_is_no_cdc&quot;&gt;Using Apache Iceberg with Kafka Connect&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&quot;https://www.dremio.com/blog/using-flink-with-apache-iceberg-and-nessie/?utm_source=alexmerced&amp;amp;utm_medium=external_blog&amp;amp;utm_campaign=cdc_when_there_is_no_cdc&quot;&gt;Using Apache Iceberg with Flink&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&quot;https://www.dremio.com/blog/streaming-and-batch-data-lakehouses-with-apache-iceberg-dremio-and-upsolver/?utm_source=alexmerced&amp;amp;utm_medium=external_blog&amp;amp;utm_campaign=cdc_when_there_is_no_cdc&quot;&gt;Streaming and Batch Data Lakehouses with Apache Iceberg, Dremio and Upsolver&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;
</content:encoded><author>Alex Merced</author></item><item><title>Virtualization + Lakehouse + Mesh = Data At Scale</title><link>https://iceberglakehouse.com/posts/2024-9-virtualization-lakehouse-mesh-data-at-scale/</link><guid isPermaLink="true">https://iceberglakehouse.com/posts/2024-9-virtualization-lakehouse-mesh-data-at-scale/</guid><description>
- [Free Copy of Apache Iceberg: The Definitive Guide](https://hello.dremio.com/wp-apache-iceberg-the-definitive-guide-reg.html?utm_source=ev_external...</description><pubDate>Wed, 25 Sep 2024 09:00:00 GMT</pubDate><content:encoded>&lt;ul&gt;
&lt;li&gt;&lt;a href=&quot;https://hello.dremio.com/wp-apache-iceberg-the-definitive-guide-reg.html?utm_source=ev_external_blog&amp;amp;utm_medium=influencer&amp;amp;utm_campaign=decentcent&amp;amp;utm_content=alexmerced&amp;amp;utm_term=external_blog&quot;&gt;Free Copy of Apache Iceberg: The Definitive Guide&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&quot;https://hello.dremio.com/webcast-an-apache-iceberg-lakehouse-crash-course-reg.html?utm_source=ev_external_blog&amp;amp;utm_medium=influencer&amp;amp;utm_campaign=decentcent&amp;amp;utm_content=alexmerced&amp;amp;utm_term=external_blog&quot;&gt;Free Apache Iceberg Crash Course&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;As data continues to grow exponentially in scale, speed, and variety, organizations are grappling with the challenges of managing and leveraging vast amounts of information. Traditional data architectures, reliant on extensive pipelines and disparate data in databases, data lakes and warehouses each with their own user access and governance challenges, are proving too slow, rigid, and costly to meet modern business needs. The crux of the problem lies in data silos—isolated pockets of data curated by a central team—that hinder collaboration, slow decision-making, and lead to inefficiencies.&lt;/p&gt;
&lt;h2&gt;The Paradigm Shift: Centralized Access Curated by Many&lt;/h2&gt;
&lt;p&gt;To overcome these challenges, a better approach is to flip the script and instead of users accessing data scattered across many places curated by a central team, have users accessing data in a centralized place curated by many teams. This approach combines:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Data Unification&lt;/strong&gt;: Providing centralized access to all data, breaking down silos and enabling seamless analytics.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Data Decentralization&lt;/strong&gt;: Empowering individual teams to manage and prepare their own data assets, fostering flexibility and innovation.&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;By unifying data access while decentralizing its ownership and preparation, organizations can achieve enhanced collaboration, improved data quality, and faster time-to-insight.&lt;/p&gt;
&lt;h2&gt;Trends Driving the Transformation&lt;/h2&gt;
&lt;p&gt;Three key trends are propelling this shift:&lt;/p&gt;
&lt;ol&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Data Lakehouse&lt;/strong&gt;: A hybrid architecture that combines the storage capabilities of data lakes with the analytical power of data warehouses. It allows for unified storage and analytics using open formats, supporting diverse workloads and simplifying data management.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Data Virtualization&lt;/strong&gt;: Technology that provides real-time access to data across multiple sources without moving or duplicating it. It offers a unified view of data, reducing data movement, and enabling agile decision-making.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Data Mesh&lt;/strong&gt;: A decentralized approach assigning data ownership to domain-specific teams. It treats data as a product, managed with the same rigor as customer-facing offerings, enhancing scalability and innovation.&lt;/p&gt;
&lt;/li&gt;
&lt;/ol&gt;
&lt;h2&gt;Dremio: Bridging Centralized Access and Decentralized Management&lt;/h2&gt;
&lt;p&gt;Dremio is a data lakehouse platform that uniquely combines data unification and decentralization. Here&apos;s how Dremio enables this paradigm shift:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Unified Data Access&lt;/strong&gt;: Dremio&apos;s platform allows users to access and analyze data from various sources through a single interface, overcoming data silos without the need for data movement or duplication. Dremio provides access to databases (postgres, mongo, etc.), data lakes (S3, ADLS, Minio, etc.), data warehouses (Snowflake, Redshirt, etc.) and Lakehouse Catalogs (AWS Glue, Apache Polaris (incubating), Hive, etc.) all in one unified access point.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Empowering Teams&lt;/strong&gt;: By supporting data decentralization, Dremio enables domain teams to manage and prepare their own data using preferred tools and systems, ensuring data quality and relevance.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Open-Source Foundation&lt;/strong&gt;: Leveraging technologies like Apache Arrow for high-performance in-memory processing, Apache Iceberg for robust data lakehouse capabilities, and Project Nessie for version control and governance, Dremio ensures flexibility and avoids vendor lock-in.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Performance and Scalability&lt;/strong&gt;: Dremio&apos;s architecture, built on these open-source technologies, delivers enhanced query performance, scalability, and supports diverse analytics workloads.&lt;/p&gt;
&lt;/li&gt;
&lt;/ul&gt;
&lt;h2&gt;Benefits of the New Approach with Dremio&lt;/h2&gt;
&lt;ul&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Enhanced Collaboration&lt;/strong&gt;: Centralized access to data curated by various teams fosters collaboration and consistent data usage across the organization.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Improved Data Quality&lt;/strong&gt;: Domain experts manage their data products, leading to more accurate and contextually relevant datasets.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Operational Efficiency&lt;/strong&gt;: Reduces redundant efforts and streamlines workflows, lowering costs and resource utilization.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Agility and Innovation&lt;/strong&gt;: Decentralized teams can rapidly adapt and innovate without impacting the entire system, enabling quicker responses to market changes.&lt;/p&gt;
&lt;/li&gt;
&lt;/ul&gt;
&lt;h2&gt;Conclusion&lt;/h2&gt;
&lt;p&gt;Organizations must adopt innovative solutions to unlock the full potential of their data assets. By shifting to a model where users access data in a centralized place curated by many teams, businesses can overcome the limitations of traditional data architectures. Dremio&apos;s unique combination of data unification and decentralization, powered by cutting-edge open-source technologies, positions it as the ideal platform to enable this paradigm shift.&lt;/p&gt;
&lt;p&gt;&lt;a href=&quot;https://www.dremio.com/blog/dremio-enables-data-unification-and-decentralization/&quot;&gt;Read This Article for a Deeper Exploration of Dremio&apos;s Centralization through Decentralization&lt;/a&gt;&lt;/p&gt;
&lt;h2&gt;Resources to Learn More about Iceberg&lt;/h2&gt;
&lt;ul&gt;
&lt;li&gt;&lt;a href=&quot;https://www.dremio.com/lakehouse-deep-dives/apache-iceberg-101/?utm_source=ev_external_blog&amp;amp;utm_medium=influencer&amp;amp;utm_campaign=decentcent&amp;amp;utm_content=alexmerced&amp;amp;utm_term=external_blog&quot;&gt;Apache Iceberg 101&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&quot;https://www.dremio.com/blog/intro-to-dremio-nessie-and-apache-iceberg-on-your-laptop/?utm_source=ev_external_blog&amp;amp;utm_medium=influencer&amp;amp;utm_campaign=decentcent&amp;amp;utm_content=alexmerced&amp;amp;utm_term=external_blog&quot;&gt;Hands-on Intro with Apache iceberg&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&quot;https://hello.dremio.com/webcast-an-apache-iceberg-lakehouse-crash-course-reg.html?utm_source=ev_external_blog&amp;amp;utm_medium=influencer&amp;amp;utm_campaign=decentcent&amp;amp;utm_content=alexmerced&amp;amp;utm_term=external_blog&quot;&gt;Free Apache Iceberg Crash Course&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&quot;https://hello.dremio.com/wp-apache-iceberg-the-definitive-guide-reg.html?utm_source=ev_external_blog&amp;amp;utm_medium=influencer&amp;amp;utm_campaign=decentcent&amp;amp;utm_content=alexmerced&amp;amp;utm_term=external_blog&quot;&gt;Free Copy Of Apache Iceberg: The Definitive Guide&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;
</content:encoded><author>Alex Merced</author></item><item><title>Deep Dive into Data Apps with Streamlit</title><link>https://iceberglakehouse.com/posts/2024-9-deep-dive-into-data-apps-with-streamlit/</link><guid isPermaLink="true">https://iceberglakehouse.com/posts/2024-9-deep-dive-into-data-apps-with-streamlit/</guid><description>
# Introduction

The ability to quickly develop and deploy interactive applications is invaluable. **Streamlit** is a powerful tool that enables data ...</description><pubDate>Sun, 22 Sep 2024 09:00:00 GMT</pubDate><content:encoded>&lt;h1&gt;Introduction&lt;/h1&gt;
&lt;p&gt;The ability to quickly develop and deploy interactive applications is invaluable. &lt;strong&gt;Streamlit&lt;/strong&gt; is a powerful tool that enables data scientists and developers to create intuitive web apps with minimal code. Coupled with the &lt;a href=&quot;https://hub.docker.com/r/alexmerced/datanotebook&quot;&gt;&lt;strong&gt;Python Data Science Notebook Docker Image&lt;/strong&gt;&lt;/a&gt;, which comes pre-loaded with essential data science libraries, setting up a robust environment for building Streamlit apps has never been easier.&lt;/p&gt;
&lt;h2&gt;What is Streamlit?&lt;/h2&gt;
&lt;p&gt;&lt;strong&gt;Streamlit&lt;/strong&gt; is an open-source Python library that simplifies the process of creating interactive web applications for data science and machine learning projects. With Streamlit, you can transform your data scripts into shareable web apps in just a few minutes, all using pure Python. There&apos;s no need for front-end development skills or knowledge of web frameworks like Flask or Django.&lt;/p&gt;
&lt;p&gt;Key features of Streamlit include:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Easy to Use&lt;/strong&gt;: Build apps with a few lines of code using straightforward APIs.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Interactive Widgets&lt;/strong&gt;: Incorporate sliders, buttons, text inputs, and more to make your app interactive.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Real-Time Updates&lt;/strong&gt;: Automatically update app content when your data or code changes.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Data Visualization&lt;/strong&gt;: Seamlessly integrate with libraries like Matplotlib, Seaborn, Plotly, and Altair for rich visualizations.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Deployment Ready&lt;/strong&gt;: Deploy apps effortlessly on various platforms, including Streamlit Cloud, Heroku, and AWS.&lt;/li&gt;
&lt;/ul&gt;
&lt;h2&gt;Why Use Streamlit for Data Apps?&lt;/h2&gt;
&lt;p&gt;Streamlit offers several advantages that make it an ideal choice for developing data applications:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Rapid Prototyping&lt;/strong&gt;: Quickly turn ideas into functional apps without worrying about the underlying web infrastructure.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Pythonic Syntax&lt;/strong&gt;: Write apps entirely in Python, leveraging your existing skills without the need to learn HTML, CSS, or JavaScript.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Interactive Data Exploration&lt;/strong&gt;: Enable users to interact with data through widgets, making it easier to explore datasets and model results.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Community and Support&lt;/strong&gt;: Benefit from a growing community that contributes to a rich ecosystem of plugins and extensions.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Open Source&lt;/strong&gt;: Modify and extend the library to suit your needs, with the assurance of ongoing development and support.&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;By using Streamlit, data scientists can focus on data analysis and model building while providing stakeholders with interactive tools to visualize and understand the results.&lt;/p&gt;
&lt;h2&gt;Overview of the Python Data Science Notebook Docker Image&lt;/h2&gt;
&lt;p&gt;The &lt;strong&gt;Python Data Science Notebook Docker Image&lt;/strong&gt; is a Docker container designed to streamline your data science workflow. Built from the minimal &lt;code&gt;python:3.9-slim&lt;/code&gt; base image, it includes a comprehensive suite of pre-installed libraries that cater to various aspects of data science, including data manipulation, machine learning, visualization, and database connectivity.&lt;/p&gt;
&lt;h3&gt;Key Features:&lt;/h3&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Jupyter Notebook Access&lt;/strong&gt;: Run and access Jupyter Notebooks through your web browser, facilitating an interactive coding environment.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Pre-Installed Libraries&lt;/strong&gt;:
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Data Manipulation&lt;/strong&gt;: &lt;code&gt;pandas&lt;/code&gt;, &lt;code&gt;numpy&lt;/code&gt;, &lt;code&gt;polars&lt;/code&gt;, &lt;code&gt;dask&lt;/code&gt;, &lt;code&gt;ibis&lt;/code&gt;, &lt;code&gt;pyiceberg&lt;/code&gt;, &lt;code&gt;datafusion&lt;/code&gt;, &lt;code&gt;sqlframe&lt;/code&gt;&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Machine Learning&lt;/strong&gt;: &lt;code&gt;scikit-learn&lt;/code&gt;, &lt;code&gt;tensorflow&lt;/code&gt;, &lt;code&gt;torch&lt;/code&gt;, &lt;code&gt;xgboost&lt;/code&gt;, &lt;code&gt;lightgbm&lt;/code&gt;&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Visualization&lt;/strong&gt;: &lt;code&gt;matplotlib&lt;/code&gt;, &lt;code&gt;seaborn&lt;/code&gt;, &lt;code&gt;plotly&lt;/code&gt;&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Database Access&lt;/strong&gt;: &lt;code&gt;psycopg2-binary&lt;/code&gt;, &lt;code&gt;mysqlclient&lt;/code&gt;, &lt;code&gt;sqlalchemy&lt;/code&gt;, &lt;code&gt;duckdb&lt;/code&gt;, &lt;code&gt;pyarrow&lt;/code&gt;&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Object Storage&lt;/strong&gt;: &lt;code&gt;boto3&lt;/code&gt;, &lt;code&gt;s3fs&lt;/code&gt;, &lt;code&gt;minio&lt;/code&gt;&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Utilities&lt;/strong&gt;: &lt;code&gt;openpyxl&lt;/code&gt;, &lt;code&gt;requests&lt;/code&gt;, &lt;code&gt;beautifulsoup4&lt;/code&gt;, &lt;code&gt;lxml&lt;/code&gt;, &lt;code&gt;pyspark&lt;/code&gt;, &lt;code&gt;dremio-simple-query&lt;/code&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;User Configuration&lt;/strong&gt;: Operates under the user &lt;code&gt;pydata&lt;/code&gt; with the home directory set to &lt;code&gt;/home/pydata&lt;/code&gt;. The working directory is &lt;code&gt;/home/pydata/work&lt;/code&gt;.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Port Exposure&lt;/strong&gt;: Exposes port &lt;code&gt;8888&lt;/code&gt; to allow access to the Jupyter Notebook server.&lt;/li&gt;
&lt;/ul&gt;
&lt;h3&gt;Benefits:&lt;/h3&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Consistency&lt;/strong&gt;: Ensure a consistent development environment across different machines and team members.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Isolation&lt;/strong&gt;: Avoid conflicts with other projects and dependencies on your local machine.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Portability&lt;/strong&gt;: Easily move your development environment between systems or deploy it to a server.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Extendable&lt;/strong&gt;: Customize the Docker image by adding more libraries or configurations as needed.&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;By utilizing this Docker image, you can save time on setup and focus on developing your Streamlit applications, knowing that you have all the necessary tools and libraries at your disposal.&lt;/p&gt;
&lt;h1&gt;Setting Up the Environment&lt;/h1&gt;
&lt;p&gt;To get started with building Streamlit applications using the Python Data Science Notebook Docker Image, you&apos;ll need to set up your environment. This involves installing Docker, pulling the Docker image, running the container, and verifying that Streamlit is installed and functioning correctly.&lt;/p&gt;
&lt;h2&gt;Installing Docker&lt;/h2&gt;
&lt;p&gt;If Docker is not already installed on your machine, follow these steps:&lt;/p&gt;
&lt;ol&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Download Docker Desktop&lt;/strong&gt;:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Windows and macOS&lt;/strong&gt;: Visit the &lt;a href=&quot;https://www.docker.com/products/docker-desktop&quot;&gt;Docker Desktop download page&lt;/a&gt; and download the installer for your operating system.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Linux&lt;/strong&gt;: Refer to the official Docker installation guides for &lt;a href=&quot;https://docs.docker.com/engine/install/ubuntu/&quot;&gt;Ubuntu&lt;/a&gt;, &lt;a href=&quot;https://docs.docker.com/engine/install/debian/&quot;&gt;Debian&lt;/a&gt;, &lt;a href=&quot;https://docs.docker.com/engine/install/fedora/&quot;&gt;Fedora&lt;/a&gt;, or your specific distribution.&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Install Docker&lt;/strong&gt;:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Run the installer and follow the on-screen instructions.&lt;/li&gt;
&lt;li&gt;For Linux, follow the command-line instructions provided in the installation guide for your distribution.&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Verify the Installation&lt;/strong&gt;:&lt;/p&gt;
&lt;p&gt;Open a terminal or command prompt and run:&lt;/p&gt;
&lt;/li&gt;
&lt;/ol&gt;
&lt;pre&gt;&lt;code class=&quot;language-bash&quot;&gt;   docker --version
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;You should see the Docker version information displayed, confirming that Docker is installed.&lt;/p&gt;
&lt;h2&gt;Pulling the &lt;code&gt;alexmerced/datanotebook&lt;/code&gt; Docker Image&lt;/h2&gt;
&lt;p&gt;The &lt;code&gt;alexmerced/datanotebook&lt;/code&gt; Docker image includes a comprehensive Python environment with pre-installed data science libraries.&lt;/p&gt;
&lt;h3&gt;Pull the Docker Image:&lt;/h3&gt;
&lt;p&gt;In your terminal, execute:&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-bash&quot;&gt;docker pull alexmerced/datanotebook
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;This command downloads the image from Docker Hub to your local machine.&lt;/p&gt;
&lt;h3&gt;Confirm the Image is Pulled:&lt;/h3&gt;
&lt;p&gt;List all Docker images on your system:&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-bash&quot;&gt;docker images
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;You should see alexmerced/datanotebook listed among the images.&lt;/p&gt;
&lt;h3&gt;Running the Docker Container with Jupyter Notebook Access&lt;/h3&gt;
&lt;p&gt;Now, run a Docker container from the image and access the Jupyter Notebook server.&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;
&lt;p&gt;Navigate to Your Working Directory&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;Open a terminal and change to the directory where you want your Jupyter Notebooks and Streamlit apps to reside:&lt;/p&gt;
&lt;/li&gt;
&lt;/ul&gt;
&lt;pre&gt;&lt;code class=&quot;language-bash&quot;&gt;cd /path/to/your/project
&lt;/code&gt;&lt;/pre&gt;
&lt;h3&gt;Run the Docker Container:&lt;/h3&gt;
&lt;p&gt;Execute the following command:&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-bash&quot;&gt;docker run -p 8888:8888 -p 8501:8501 -v $(pwd):/home/pydata/work alexmerced/datanotebook
&lt;/code&gt;&lt;/pre&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Port Mapping:&lt;/strong&gt; &lt;code&gt;-p 8888:8888&lt;/code&gt; maps the container&apos;s port 8888 to your local machine, allowing access to Jupyter Notebook. &lt;code&gt;-p 8501:8501&lt;/code&gt; maps the container&apos;s port 8501 to your local machine, allowing access to Streamlit apps.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Volume Mounting:&lt;/strong&gt; -&lt;code&gt;v $(pwd):/home/pydata/work&lt;/code&gt; mounts your current directory into the container, enabling file sharing between your host and the container.&lt;/li&gt;
&lt;/ul&gt;
&lt;h3&gt;Access Jupyter Notebook:&lt;/h3&gt;
&lt;p&gt;Open your web browser and navigate to &lt;code&gt;http://localhost:8888&lt;/code&gt;.&lt;/p&gt;
&lt;p&gt;You should see the Jupyter Notebook interface without needing a password or token.&lt;/p&gt;
&lt;h3&gt;Verifying the Installation of Streamlit within the Container&lt;/h3&gt;
&lt;p&gt;Ensure that Streamlit is installed and functioning properly inside the Docker container.&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;
&lt;p&gt;Open a New Terminal in Jupyter Notebook.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;In the Jupyter interface, click on the New dropdown menu and select Terminal.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;In the terminal, run:&lt;/p&gt;
&lt;/li&gt;
&lt;/ul&gt;
&lt;pre&gt;&lt;code class=&quot;language-bash&quot;&gt;streamlit --version
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;If Streamlit is installed, the version number will be displayed.&lt;/p&gt;
&lt;p&gt;If not installed, install it using:&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-bash&quot;&gt;pip install streamlit
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Create a Test Streamlit App:&lt;/p&gt;
&lt;p&gt;In the Jupyter interface, click on New and select Text File.&lt;/p&gt;
&lt;p&gt;Save the file as app.py in your working directory.&lt;/p&gt;
&lt;p&gt;Add the following code to app.py:&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-python&quot;&gt;import streamlit as st

st.title(&amp;quot;Streamlit Test App&amp;quot;)
st.write(&amp;quot;Congratulations! Streamlit is working inside the Docker container.&amp;quot;)
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Save the file.&lt;/p&gt;
&lt;h3&gt;Run the Streamlit App:&lt;/h3&gt;
&lt;p&gt;In the Jupyter terminal, execute:&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-bash&quot;&gt;streamlit run app.py --server.enableCORS false --server.enableXsrfProtection false --server.port 8501 --server.address 0.0.0.0
&lt;/code&gt;&lt;/pre&gt;
&lt;h4&gt;Server Flags Explained:&lt;/h4&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;&lt;code&gt;--server.enableCORS false:&lt;/code&gt;&lt;/strong&gt; Disables Cross-Origin Resource Sharing protection.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;&lt;code&gt;--server.enableXsrfProtection false:&lt;/code&gt;&lt;/strong&gt; Disables Cross-Site Request Forgery protection.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;&lt;code&gt;--server.port 8501:&lt;/code&gt;&lt;/strong&gt; Runs the app on port 8501.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;&lt;code&gt;--server.address 0.0.0.0:&lt;/code&gt;&lt;/strong&gt; Makes the server accessible externally.&lt;/li&gt;
&lt;/ul&gt;
&lt;h3&gt;Access the Streamlit App:&lt;/h3&gt;
&lt;p&gt;Open a new tab in your web browser and navigate to http://localhost:8501.&lt;/p&gt;
&lt;p&gt;You should see the Streamlit app displaying the title and message.&lt;/p&gt;
&lt;h3&gt;Optional: Keep Streamlit Running in the Background:&lt;/h3&gt;
&lt;p&gt;To keep the Streamlit app running without occupying the terminal, you can run it in the background using nohup:&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-bash&quot;&gt;nohup streamlit run app.py --server.enableCORS false --server.enableXsrfProtection false --server.port 8501 --server.address 0.0.0.0 &amp;amp;
&lt;/code&gt;&lt;/pre&gt;
&lt;h3&gt;Exiting the Docker Container&lt;/h3&gt;
&lt;p&gt;In the Terminal Running the Container:&lt;/p&gt;
&lt;p&gt;Press &lt;code&gt;Ctrl + C&lt;/code&gt; to stop the container.&lt;/p&gt;
&lt;p&gt;Alternatively, Use Docker Commands:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;List running containers&lt;/li&gt;
&lt;/ul&gt;
&lt;pre&gt;&lt;code class=&quot;language-bash&quot;&gt;docker ps
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Stop the container using its Container ID:&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-bash&quot;&gt;docker stop &amp;lt;container_id&amp;gt;
&lt;/code&gt;&lt;/pre&gt;
&lt;h3&gt;Summary&lt;/h3&gt;
&lt;p&gt;You&apos;ve successfully set up your environment:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Installed Docker (if necessary).&lt;/li&gt;
&lt;li&gt;Pulled the alexmerced/datanotebook Docker image.&lt;/li&gt;
&lt;li&gt;Ran the Docker container with Jupyter Notebook access.&lt;/li&gt;
&lt;li&gt;Verified that Streamlit is installed and operational within the container.&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;With this setup, you&apos;re ready to develop and run Streamlit applications in a consistent and isolated environment, leveraging the powerful tools provided by the Docker image.&lt;/p&gt;
&lt;h1&gt;Getting Started with Streamlit&lt;/h1&gt;
&lt;p&gt;With your environment set up, it&apos;s time to dive into Streamlit and start building interactive applications. This section will guide you through creating your first Streamlit app, understanding the basic structure of a Streamlit script, and running Streamlit apps from within the Jupyter Notebook provided by the Docker container.&lt;/p&gt;
&lt;h2&gt;Creating Your First Streamlit App&lt;/h2&gt;
&lt;p&gt;Let&apos;s begin by creating a simple Streamlit application that displays text and a chart.&lt;/p&gt;
&lt;ol&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Create a New Python Script&lt;/strong&gt;:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;In the Jupyter Notebook interface, click on &lt;code&gt;New&lt;/code&gt; and select &lt;code&gt;Text File&lt;/code&gt;.&lt;/li&gt;
&lt;li&gt;Save the file as &lt;code&gt;app.py&lt;/code&gt; in your working directory (&lt;code&gt;/home/pydata/work&lt;/code&gt;).&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Write the Streamlit Code&lt;/strong&gt;:&lt;/p&gt;
&lt;p&gt;Open &lt;code&gt;app.py&lt;/code&gt; and add the following code:&lt;/p&gt;
&lt;/li&gt;
&lt;/ol&gt;
&lt;pre&gt;&lt;code class=&quot;language-python&quot;&gt;   import streamlit as st
   import pandas as pd
   import numpy as np

   st.title(&amp;quot;My First Streamlit App&amp;quot;)

   st.write(&amp;quot;Welcome to my first Streamlit application!&amp;quot;)

   # Create a random dataframe
   df = pd.DataFrame(
       np.random.randn(20, 3),
       columns=[&apos;Column A&apos;, &apos;Column B&apos;, &apos;Column C&apos;]
   )

   st.write(&amp;quot;Here is a random dataframe:&amp;quot;)
   st.dataframe(df)

   st.write(&amp;quot;Line chart of the data:&amp;quot;)
   st.line_chart(df)
&lt;/code&gt;&lt;/pre&gt;
&lt;h3&gt;Explanation:&lt;/h3&gt;
&lt;ul&gt;
&lt;li&gt;Imports necessary libraries.&lt;/li&gt;
&lt;li&gt;Sets the title and writes introductory text.&lt;/li&gt;
&lt;li&gt;Generates a random DataFrame.&lt;/li&gt;
&lt;li&gt;Displays the DataFrame and a line chart based on the data.&lt;/li&gt;
&lt;/ul&gt;
&lt;h3&gt;Save the Script:&lt;/h3&gt;
&lt;ul&gt;
&lt;li&gt;Ensure that you save app.py after adding the code.&lt;/li&gt;
&lt;/ul&gt;
&lt;h3&gt;Understanding the Basic Structure of a Streamlit Script&lt;/h3&gt;
&lt;p&gt;A Streamlit script is a standard Python script with the streamlit library functions to create interactive elements.&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Import Streamlit:&lt;/li&gt;
&lt;/ul&gt;
&lt;pre&gt;&lt;code class=&quot;language-python&quot;&gt;import streamlit as st
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Set the Title and Headers:&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-python&quot;&gt;st.title(&amp;quot;App Title&amp;quot;)
st.header(&amp;quot;This is a header&amp;quot;)
st.subheader(&amp;quot;This is a subheader&amp;quot;)
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Write Text:&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-python&quot;&gt;st.text(&amp;quot;This is a simple text.&amp;quot;)
st.markdown(&amp;quot;This is a text with **markdown** formatting.&amp;quot;)
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Display Data:&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-python&quot;&gt;st.dataframe(df)  # Displays an interactive table
st.table(df)      # Displays a static table
&lt;/code&gt;&lt;/pre&gt;
&lt;h3&gt;Display Charts:&lt;/h3&gt;
&lt;pre&gt;&lt;code class=&quot;language-python&quot;&gt;st.line_chart(data)
st.bar_chart(data)
st.area_chart(data)
&lt;/code&gt;&lt;/pre&gt;
&lt;h3&gt;Add Interactive Widgets:&lt;/h3&gt;
&lt;pre&gt;&lt;code class=&quot;language-python&quot;&gt;name = st.text_input(&amp;quot;Enter your name:&amp;quot;)
st.write(f&amp;quot;Hello, {name}!&amp;quot;)

age = st.slider(&amp;quot;Select your age:&amp;quot;, 0, 100)
st.write(f&amp;quot;You are {age} years old.&amp;quot;)
&lt;/code&gt;&lt;/pre&gt;
&lt;h3&gt;Layout Elements:&lt;/h3&gt;
&lt;pre&gt;&lt;code class=&quot;language-python&quot;&gt;with st.sidebar:
    st.write(&amp;quot;This is the sidebar.&amp;quot;)

col1, col2 = st.columns(2)
col1.write(&amp;quot;Content in column 1&amp;quot;)
col2.write(&amp;quot;Content in column 2&amp;quot;)
&lt;/code&gt;&lt;/pre&gt;
&lt;h3&gt;Running Streamlit Apps from Within the Jupyter Notebook&lt;/h3&gt;
&lt;p&gt;To run your Streamlit app within the Docker container and access it from your host machine:&lt;/p&gt;
&lt;p&gt;Open a Terminal in Jupyter Notebook:&lt;/p&gt;
&lt;p&gt;In the Jupyter interface, click on New and select Terminal.&lt;/p&gt;
&lt;p&gt;Navigate to the Working Directory:&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-bash&quot;&gt;cd /home/pydata/work
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Run the Streamlit App:&lt;/p&gt;
&lt;p&gt;Execute the following command:&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-bash&quot;&gt;streamlit run app.py --server.enableCORS false --server.enableXsrfProtection false --server.port 8501 --server.address 0.0.0.0
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Explanation of Flags:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;&lt;code&gt;--server.enableCORS false:&lt;/code&gt;&lt;/strong&gt; Disables Cross-Origin Resource Sharing protection.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;&lt;code&gt;--server.enableXsrfProtection false:&lt;/code&gt;&lt;/strong&gt; Disables Cross-Site Request Forgery protection.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;&lt;code&gt;--server.port 8501:&lt;/code&gt;&lt;/strong&gt; Sets the port to 8501.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;&lt;code&gt;--server.address 0.0.0.0:&lt;/code&gt;&lt;/strong&gt; Makes the app accessible externally.&lt;/li&gt;
&lt;/ul&gt;
&lt;h3&gt;Access the Streamlit App:&lt;/h3&gt;
&lt;ul&gt;
&lt;li&gt;Open your web browser and navigate to &lt;code&gt;http://localhost:8501&lt;/code&gt;.&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;You should see your Streamlit app running.
Interact with the App:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Modify &lt;code&gt;app.py&lt;/code&gt; to add more features or interactive elements.&lt;/li&gt;
&lt;li&gt;Save the changes, and the app will automatically reload in the browser.&lt;/li&gt;
&lt;/ul&gt;
&lt;h3&gt;Tips for Running Streamlit in Docker&lt;/h3&gt;
&lt;p&gt;Expose the Correct Port:&lt;/p&gt;
&lt;p&gt;When running the Docker container, ensure you expose the port used by Streamlit. If you use port 8501, run the container with:&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-bash&quot;&gt;docker run -p 8888:8888 -p 8501:8501 -v $(pwd):/home/pydata/work alexmerced/datanotebook
&lt;/code&gt;&lt;/pre&gt;
&lt;h3&gt;Running Multiple Apps:&lt;/h3&gt;
&lt;p&gt;Use different ports for each app and expose them accordingly.&lt;/p&gt;
&lt;h4&gt;Background Execution:&lt;/h4&gt;
&lt;p&gt;To run the Streamlit app without tying up the terminal, use:&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-bash&quot;&gt;nohup streamlit run app.py --server.enableCORS false --server.enableXsrfProtection false --server.port 8501 --server.address 0.0.0.0 &amp;amp;
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;This runs the app in the background and outputs logs to nohup.out.&lt;/p&gt;
&lt;h3&gt;Summary&lt;/h3&gt;
&lt;p&gt;In this section, you:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Created your first Streamlit app using the pre-configured Docker environment.&lt;/li&gt;
&lt;li&gt;Learned about the basic structure and components of a Streamlit script.&lt;/li&gt;
&lt;li&gt;Ran the Streamlit app from within the Jupyter Notebook environment.&lt;/li&gt;
&lt;li&gt;Accessed and interacted with the app via your web browser.&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;With these foundational skills, you&apos;re ready to explore more advanced features of Streamlit to build sophisticated data applications.&lt;/p&gt;
&lt;h1&gt;Building Interactive Data Visualizations&lt;/h1&gt;
&lt;p&gt;Data visualization is a crucial aspect of data analysis and communication. Streamlit simplifies the process of creating interactive and dynamic visualizations that can help users explore and understand data more effectively. In this section, we&apos;ll explore how to use Streamlit&apos;s built-in functions and integrate popular visualization libraries to build interactive data visualizations.&lt;/p&gt;
&lt;h2&gt;Using Streamlit&apos;s Built-in Chart Functions&lt;/h2&gt;
&lt;p&gt;Streamlit provides easy-to-use functions for creating basic charts directly from data structures like Pandas DataFrames and NumPy arrays.&lt;/p&gt;
&lt;h3&gt;Line Chart&lt;/h3&gt;
&lt;pre&gt;&lt;code class=&quot;language-python&quot;&gt;import streamlit as st
import pandas as pd
import numpy as np

# Generate random data
data = np.random.randn(100, 3)
columns = [&apos;Feature A&apos;, &apos;Feature B&apos;, &apos;Feature C&apos;]
df = pd.DataFrame(data, columns=columns)

# Display line chart
st.line_chart(df)
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Explanation: The st.line_chart() function takes a DataFrame or array-like object and renders an interactive line chart.&lt;/p&gt;
&lt;h3&gt;Bar Chart&lt;/h3&gt;
&lt;pre&gt;&lt;code class=&quot;language-python&quot;&gt;# Display bar chart
st.bar_chart(df)
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Explanation: st.bar_chart() displays a bar chart. It&apos;s useful for categorical data or comparing different groups.&lt;/p&gt;
&lt;h3&gt;Area Chart&lt;/h3&gt;
&lt;pre&gt;&lt;code class=&quot;language-python&quot;&gt;# Display area chart
st.area_chart(df)
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Explanation: st.area_chart() creates an area chart, which is similar to a line chart but with the area below the line filled.&lt;/p&gt;
&lt;h3&gt;Customizing Charts with Altair&lt;/h3&gt;
&lt;p&gt;For more advanced visualizations, Streamlit supports libraries like Altair, which provides a declarative statistical visualization library for Python.&lt;/p&gt;
&lt;p&gt;Creating an Altair Chart&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-python&quot;&gt;import altair as alt

# Create an Altair chart
chart = alt.Chart(df.reset_index()).mark_circle(size=60).encode(
    x=&apos;index&apos;,
    y=&apos;Feature A&apos;,
    color=&apos;Feature B&apos;,
    tooltip=[&apos;Feature A&apos;, &apos;Feature B&apos;, &apos;Feature C&apos;]
).interactive()

st.altair_chart(chart, use_container_width=True)
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Explanation: This code creates an interactive scatter plot using Altair, where you can hover over points to see tooltips.&lt;/p&gt;
&lt;h3&gt;Interactive Widgets for User Input&lt;/h3&gt;
&lt;p&gt;Streamlit allows you to add widgets that enable users to interact with your visualizations.&lt;/p&gt;
&lt;h4&gt;Adding a Slider&lt;/h4&gt;
&lt;pre&gt;&lt;code class=&quot;language-python&quot;&gt;# Slider to select number of data points
num_points = st.slider(&apos;Select number of data points&apos;, min_value=10, max_value=100, value=50)

# Generate data based on slider
data = np.random.randn(num_points, 3)
df = pd.DataFrame(data, columns=columns)

# Display updated chart
st.line_chart(df)
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Explanation: The slider widget allows users to select the number of data points, and the chart updates accordingly.&lt;/p&gt;
&lt;h4&gt;Selectbox for Options&lt;/h4&gt;
&lt;pre&gt;&lt;code class=&quot;language-python&quot;&gt;# Selectbox to choose a feature
feature = st.selectbox(&apos;Select a feature to display&apos;, columns)

# Display the selected feature
st.line_chart(df[feature])
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Explanation: The selectbox lets users choose which feature to visualize.&lt;/p&gt;
&lt;h3&gt;Integrating Plotly for Advanced Visualizations&lt;/h3&gt;
&lt;p&gt;Plotly is another powerful library for creating interactive graphs.&lt;/p&gt;
&lt;h4&gt;Plotly Example&lt;/h4&gt;
&lt;pre&gt;&lt;code class=&quot;language-python&quot;&gt;import plotly.express as px

# Create a Plotly figure
fig = px.scatter(df, x=&apos;Feature A&apos;, y=&apos;Feature B&apos;, size=&apos;Feature C&apos;, color=&apos;Feature C&apos;, hover_name=&apos;Feature C&apos;)

# Display the Plotly figure in Streamlit
st.plotly_chart(fig, use_container_width=True)
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Explanation: This code creates an interactive scatter plot with Plotly, which includes zooming, panning, and tooltips.&lt;/p&gt;
&lt;h3&gt;Combining Widgets and Visualizations&lt;/h3&gt;
&lt;p&gt;You can combine multiple widgets and charts to create a rich interactive experience.&lt;/p&gt;
&lt;h4&gt;Example: Interactive Data Filtering&lt;/h4&gt;
&lt;pre&gt;&lt;code class=&quot;language-python&quot;&gt;# Multiselect to choose features
selected_features = st.multiselect(&apos;Select features to visualize&apos;, columns, default=columns)

# Checkbox to toggle data normalization
normalize = st.checkbox(&apos;Normalize data&apos;)

# Process data based on user input
if normalize:
    df_normalized = (df - df.mean()) / df.std()
    data_to_plot = df_normalized[selected_features]
else:
    data_to_plot = df[selected_features]

# Display line chart of selected features
st.line_chart(data_to_plot)
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Explanation: Users can select which features to visualize and whether to normalize the data, and the chart updates accordingly.&lt;/p&gt;
&lt;h3&gt;Best Practices for Interactive Visualizations&lt;/h3&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Limit Data Size:&lt;/strong&gt; Large datasets can slow down your app. Consider sampling or aggregating data.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Use Caching:&lt;/strong&gt; Use @st.cache_data decorator to cache data loading and computation functions.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Provide Instructions:&lt;/strong&gt; Use st.markdown() or st.write() to guide users on how to interact with your app.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Optimize Layout:&lt;/strong&gt; Organize widgets and charts using columns and expanders for a clean interface.&lt;/li&gt;
&lt;/ul&gt;
&lt;h3&gt;Example of Layout Optimization&lt;/h3&gt;
&lt;pre&gt;&lt;code class=&quot;language-python&quot;&gt;# Create columns
col1, col2 = st.columns(2)

with col1:
    st.header(&apos;User Inputs&apos;)
    # Add widgets here
    num_points = st.slider(&apos;Number of points&apos;, 10, 100, 50)
    feature = st.selectbox(&apos;Feature&apos;, columns)

with col2:
    st.header(&apos;Visualization&apos;)
    # Generate and display chart
    data = np.random.randn(num_points, len(columns))
    df = pd.DataFrame(data, columns=columns)
    st.line_chart(df[feature])
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Explanation: This layout separates user inputs and visualizations into two columns, making the app more organized.&lt;/p&gt;
&lt;h3&gt;Summary&lt;/h3&gt;
&lt;p&gt;In this section, you&apos;ve learned how to:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Use Streamlit&apos;s built-in chart functions to create quick visualizations.&lt;/li&gt;
&lt;li&gt;Customize charts using Altair and Plotly for more advanced visualizations.&lt;/li&gt;
&lt;li&gt;Add interactive widgets like sliders and selectboxes to make your visualizations dynamic.&lt;/li&gt;
&lt;li&gt;Combine widgets and charts to build a user-friendly data exploration tool.&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;By leveraging these features, you can create powerful interactive applications that make data exploration and analysis more accessible to your audience.&lt;/p&gt;
&lt;h1&gt;Advanced Streamlit Features&lt;/h1&gt;
&lt;p&gt;As you become more familiar with Streamlit, you&apos;ll discover a wealth of advanced features that allow you to build more sophisticated and powerful applications. In this section, we&apos;ll delve into some of these capabilities, including state management, dynamic content creation, file handling, and performance optimization through caching.&lt;/p&gt;
&lt;h2&gt;State Management with &lt;code&gt;st.session_state&lt;/code&gt;&lt;/h2&gt;
&lt;p&gt;Streamlit runs your script from top to bottom every time a user interacts with a widget. To maintain state across these reruns, you can use &lt;code&gt;st.session_state&lt;/code&gt;, which is a dictionary-like object that persists throughout the user&apos;s session.&lt;/p&gt;
&lt;h3&gt;Example: Counter Application&lt;/h3&gt;
&lt;pre&gt;&lt;code class=&quot;language-python&quot;&gt;import streamlit as st

# Initialize counter in session state
if &apos;counter&apos; not in st.session_state:
    st.session_state.counter = 0

# Increment counter on button click
if st.button(&apos;Increment&apos;):
    st.session_state.counter += 1

st.write(f&amp;quot;Counter value: {st.session_state.counter}&amp;quot;)
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Explanation: The counter value is stored in st.session_state.counter, ensuring it persists across interactions.&lt;/p&gt;
&lt;h3&gt;Dynamic Content with st.expander and st.tabs&lt;/h3&gt;
&lt;p&gt;Streamlit provides layout elements to organize content and improve user experience.&lt;/p&gt;
&lt;h4&gt;Using st.expander&lt;/h4&gt;
&lt;pre&gt;&lt;code class=&quot;language-python&quot;&gt;import streamlit as st

st.write(&amp;quot;This is visible content&amp;quot;)

with st.expander(&amp;quot;Click to expand&amp;quot;):
    st.write(&amp;quot;This content is hidden by default&amp;quot;)
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Explanation: st.expander creates a collapsible section that users can expand or collapse.&lt;/p&gt;
&lt;h4&gt;Using st.tabs&lt;/h4&gt;
&lt;pre&gt;&lt;code class=&quot;language-python&quot;&gt;import streamlit as st

tab1, tab2 = st.tabs([&amp;quot;Tab 1&amp;quot;, &amp;quot;Tab 2&amp;quot;])

with tab1:
    st.write(&amp;quot;Content in Tab 1&amp;quot;)

with tab2:
    st.write(&amp;quot;Content in Tab 2&amp;quot;)
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Explanation: st.tabs allows you to organize content into tabs for better navigation.&lt;/p&gt;
&lt;h3&gt;Uploading and Handling Files with st.file_uploader&lt;/h3&gt;
&lt;p&gt;Allow users to upload files directly into your app for processing.&lt;/p&gt;
&lt;h4&gt;Example: CSV File Uploader&lt;/h4&gt;
&lt;pre&gt;&lt;code class=&quot;language-python&quot;&gt;import streamlit as st
import pandas as pd

uploaded_file = st.file_uploader(&amp;quot;Choose a CSV file&amp;quot;, type=&amp;quot;csv&amp;quot;)

if uploaded_file is not None:
    df = pd.read_csv(uploaded_file)
    st.write(&amp;quot;Uploaded Data:&amp;quot;)
    st.dataframe(df)
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Explanation: Users can upload a CSV file, which the app reads and displays as a DataFrame.&lt;/p&gt;
&lt;h3&gt;Caching with @st.cache_data for Performance Optimization&lt;/h3&gt;
&lt;p&gt;Heavy computations or data loading can slow down your app. Use caching to store results and avoid redundant processing.&lt;/p&gt;
&lt;h4&gt;Using @st.cache_data&lt;/h4&gt;
&lt;pre&gt;&lt;code class=&quot;language-python&quot;&gt;import streamlit as st
import pandas as pd

@st.cache_data
def load_data(url):
    return pd.read_csv(url)

data_url = &apos;https://path-to-large-dataset.csv&apos;
df = load_data(data_url)
st.write(&amp;quot;Data loaded successfully&amp;quot;)
st.dataframe(df.head())
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Explanation: The @st.cache_data decorator caches the load_data function&apos;s output, improving performance on subsequent runs.&lt;/p&gt;
&lt;h3&gt;Customizing the App Layout&lt;/h3&gt;
&lt;p&gt;Enhance user experience by customizing your app&apos;s layout and appearance.&lt;/p&gt;
&lt;p&gt;Setting Page Configuration&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-python&quot;&gt;import streamlit as st

st.set_page_config(
    page_title=&amp;quot;Advanced Streamlit Features&amp;quot;,
    page_icon=&amp;quot;🚀&amp;quot;,
    layout=&amp;quot;wide&amp;quot;,
    initial_sidebar_state=&amp;quot;expanded&amp;quot;,
)
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Explanation: st.set_page_config sets global configurations like the page title, icon, layout, and sidebar state.&lt;/p&gt;
&lt;h3&gt;Using Columns and Containers&lt;/h3&gt;
&lt;pre&gt;&lt;code class=&quot;language-python&quot;&gt;import streamlit as st

col1, col2 = st.columns(2)

with col1:
    st.header(&amp;quot;Column 1&amp;quot;)
    st.write(&amp;quot;Content for the first column&amp;quot;)

with col2:
    st.header(&amp;quot;Column 2&amp;quot;)
    st.write(&amp;quot;Content for the second column&amp;quot;)
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Explanation: Columns help organize content side by side.&lt;/p&gt;
&lt;h3&gt;Theming and Styling&lt;/h3&gt;
&lt;p&gt;Apply custom themes to match your app&apos;s branding or preferred aesthetics.&lt;/p&gt;
&lt;h4&gt;Applying a Custom Theme&lt;/h4&gt;
&lt;p&gt;Create a .streamlit/config.toml file in your app directory with the following content:&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-toml&quot;&gt;[theme]
primaryColor=&amp;quot;#d33682&amp;quot;
backgroundColor=&amp;quot;#002b36&amp;quot;
secondaryBackgroundColor=&amp;quot;#586e75&amp;quot;
textColor=&amp;quot;#ffffff&amp;quot;
font=&amp;quot;sans serif&amp;quot;
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Explanation: The theme settings adjust the app&apos;s color scheme and font.&lt;/p&gt;
&lt;h3&gt;Interactive Widgets for Advanced User Input&lt;/h3&gt;
&lt;p&gt;Streamlit offers a variety of widgets for complex user interactions.&lt;/p&gt;
&lt;h4&gt;Date Input and Time Input&lt;/h4&gt;
&lt;pre&gt;&lt;code class=&quot;language-python&quot;&gt;import streamlit as st

date = st.date_input(&amp;quot;Select a date&amp;quot;)
time = st.time_input(&amp;quot;Select a time&amp;quot;)

st.write(f&amp;quot;You selected {date} at {time}&amp;quot;)
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Explanation: Allows users to input dates and times.&lt;/p&gt;
&lt;h4&gt;Color Picker&lt;/h4&gt;
&lt;pre&gt;&lt;code class=&quot;language-python&quot;&gt;import streamlit as st

color = st.color_picker(&apos;Pick A Color&apos;, &apos;#00f900&apos;)
st.write(&apos;The current color is&apos;, color)
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Explanation: Users can select a color, which can be used in visualizations or styling.
Advanced Callbacks and Event Handling
Respond to user interactions with callbacks.&lt;/p&gt;
&lt;h4&gt;Using Button Callbacks&lt;/h4&gt;
&lt;pre&gt;&lt;code class=&quot;language-python&quot;&gt;import streamlit as st

def on_button_click():
    st.write(&amp;quot;Button was clicked!&amp;quot;)

st.button(&amp;quot;Click Me&amp;quot;, on_click=on_button_click)
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Explanation: The on_click parameter specifies a function to execute when the button is clicked.&lt;/p&gt;
&lt;h3&gt;Integrating with External APIs&lt;/h3&gt;
&lt;p&gt;Fetch and display data from external sources.&lt;/p&gt;
&lt;h4&gt;Example: Fetching Data from an API&lt;/h4&gt;
&lt;pre&gt;&lt;code class=&quot;language-python&quot;&gt;import streamlit as st
import requests

st.write(&amp;quot;Fetch data from an API&amp;quot;)

response = requests.get(&apos;https://api.example.com/data&apos;)
if response.status_code == 200:
    data = response.json()
    st.write(data)
else:
    st.error(&amp;quot;Failed to fetch data&amp;quot;)
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Explanation: Uses the requests library to fetch data from an API and display it.&lt;/p&gt;
&lt;h3&gt;Real-Time Data Updates with WebSockets&lt;/h3&gt;
&lt;p&gt;Streamlit supports bi-directional communication for real-time updates.&lt;/p&gt;
&lt;h4&gt;Using st.experimental_get_query_params&lt;/h4&gt;
&lt;pre&gt;&lt;code class=&quot;language-python&quot;&gt;import streamlit as st

params = st.experimental_get_query_params()
st.write(&amp;quot;Query parameters:&amp;quot;, params)
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Explanation: Access query parameters from the URL to control app behavior dynamically.&lt;/p&gt;
&lt;h3&gt;Modularizing Code with Components&lt;/h3&gt;
&lt;p&gt;Break down your app into reusable components.&lt;/p&gt;
&lt;h4&gt;Creating a Custom Component&lt;/h4&gt;
&lt;pre&gt;&lt;code class=&quot;language-python&quot;&gt;# components.py
import streamlit as st

def display_header():
    st.title(&amp;quot;Advanced Streamlit Features&amp;quot;)
    st.write(&amp;quot;This is a custom component&amp;quot;)

# main app
import streamlit as st
from components import display_header

display_header()
st.write(&amp;quot;Main app content&amp;quot;)
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Explanation: Organize code by splitting it into modules for better maintainability.&lt;/p&gt;
&lt;h3&gt;Localization and Internationalization&lt;/h3&gt;
&lt;p&gt;Make your app accessible to a global audience.&lt;/p&gt;
&lt;h3&gt;Setting the Language&lt;/h3&gt;
&lt;pre&gt;&lt;code class=&quot;language-python&quot;&gt;import streamlit as st

st.write(&amp;quot;Hello, World!&amp;quot;)

# Use gettext or other localization libraries for translations
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Explanation: While Streamlit doesn&apos;t provide built-in localization, you can use Python&apos;s localization libraries.&lt;/p&gt;
&lt;h3&gt;Accessibility Features&lt;/h3&gt;
&lt;p&gt;Ensure your app is usable by people with disabilities.&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Use Semantic HTML:&lt;/strong&gt; Streamlit automatically generates accessible HTML elements.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Provide Alt Text:&lt;/strong&gt; When displaying images, use the caption parameter.&lt;/p&gt;
&lt;/li&gt;
&lt;/ul&gt;
&lt;pre&gt;&lt;code class=&quot;language-python&quot;&gt;st.image(&apos;image.png&apos;, caption=&apos;Descriptive text&apos;)
&lt;/code&gt;&lt;/pre&gt;
&lt;h3&gt;Summary&lt;/h3&gt;
&lt;p&gt;In this section, we&apos;ve explored several advanced features of Streamlit that empower you to build more interactive and efficient applications:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;State Management:&lt;/strong&gt; Use st.session_state to preserve data across user interactions.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Dynamic Layouts:&lt;/strong&gt; Organize content with expanders, tabs, columns, and containers.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;File Handling:&lt;/strong&gt; Allow users to upload and interact with files directly in the app.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Performance Optimization:&lt;/strong&gt; Improve app speed with caching decorators like @st.cache_data.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Customization:&lt;/strong&gt; Enhance the look and feel with custom themes and page configurations.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Advanced Widgets:&lt;/strong&gt; Utilize a variety of input widgets for richer user interactions.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;External Integrations:&lt;/strong&gt; Connect your app to external APIs and services.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Code Organization:&lt;/strong&gt; Modularize your code for better readability and maintenance.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Global Reach:&lt;/strong&gt; Consider localization and accessibility to reach a wider audience.&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;By mastering these advanced features, you can create sophisticated Streamlit applications that provide a seamless and engaging user experience.&lt;/p&gt;
&lt;h1&gt;Integrating Machine Learning Models&lt;/h1&gt;
&lt;p&gt;Streamlit excels at making machine learning models accessible through interactive web applications. In this section, we&apos;ll explore how to integrate machine learning models into your Streamlit apps using the pre-installed libraries in the Python Data Science Notebook Docker Image, such as TensorFlow, PyTorch, and scikit-learn.&lt;/p&gt;
&lt;h2&gt;Loading Pre-trained Models with TensorFlow and PyTorch&lt;/h2&gt;
&lt;p&gt;The Docker image comes with TensorFlow and PyTorch installed, allowing you to work with complex neural network models.&lt;/p&gt;
&lt;h3&gt;Using TensorFlow&lt;/h3&gt;
&lt;h4&gt;Loading a Pre-trained Model&lt;/h4&gt;
&lt;pre&gt;&lt;code class=&quot;language-python&quot;&gt;import streamlit as st
import tensorflow as tf

# Load a pre-trained model, e.g., MobileNetV2
model = tf.keras.applications.MobileNetV2(weights=&apos;imagenet&apos;)

st.write(&amp;quot;Model loaded successfully.&amp;quot;)
&lt;/code&gt;&lt;/pre&gt;
&lt;h4&gt;Making Predictions&lt;/h4&gt;
&lt;pre&gt;&lt;code class=&quot;language-python&quot;&gt;from tensorflow.keras.applications.mobilenet_v2 import preprocess_input, decode_predictions
from PIL import Image
import numpy as np

uploaded_file = st.file_uploader(&amp;quot;Upload an image&amp;quot;, type=[&amp;quot;jpg&amp;quot;, &amp;quot;jpeg&amp;quot;, &amp;quot;png&amp;quot;])

if uploaded_file is not None:
    # Load and preprocess the image
    image = Image.open(uploaded_file)
    st.image(image, caption=&apos;Uploaded Image&apos;, use_column_width=True)

    img = image.resize((224, 224))
    img_array = np.array(img)
    img_array = preprocess_input(img_array)
    img_array = np.expand_dims(img_array, axis=0)

    # Make prediction
    predictions = model.predict(img_array)
    results = decode_predictions(predictions, top=3)[0]

    # Display predictions
    st.write(&amp;quot;Top Predictions:&amp;quot;)
    for i, res in enumerate(results):
        st.write(f&amp;quot;{i+1}. {res[1]}: {round(res[2]*100, 2)}%&amp;quot;)
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Explanation: Users can upload an image, and the app displays the top predictions from the pre-trained MobileNetV2 model.&lt;/p&gt;
&lt;h3&gt;Using PyTorch&lt;/h3&gt;
&lt;h4&gt;Loading a Pre-trained Model&lt;/h4&gt;
&lt;pre&gt;&lt;code class=&quot;language-python&quot;&gt;import streamlit as st
import torch
from torchvision import models, transforms

# Load a pre-trained ResNet model
model = models.resnet18(pretrained=True)
model.eval()

st.write(&amp;quot;PyTorch model loaded successfully.&amp;quot;)
&lt;/code&gt;&lt;/pre&gt;
&lt;h4&gt;Making Predictions&lt;/h4&gt;
&lt;pre&gt;&lt;code class=&quot;language-python&quot;&gt;from PIL import Image
import torchvision.transforms as T

uploaded_file = st.file_uploader(&amp;quot;Upload an image&amp;quot;, type=[&amp;quot;jpg&amp;quot;, &amp;quot;jpeg&amp;quot;, &amp;quot;png&amp;quot;])

if uploaded_file is not None:
    # Load and preprocess the image
    image = Image.open(uploaded_file)
    st.image(image, caption=&apos;Uploaded Image&apos;, use_column_width=True)

    preprocess = T.Compose([
        T.Resize(256),
        T.CenterCrop(224),
        T.ToTensor(),
        T.Normalize(
            mean=[0.485, 0.456, 0.406], 
            std=[0.229, 0.224, 0.225]
        )
    ])
    img_t = preprocess(image)
    batch_t = torch.unsqueeze(img_t, 0)

    # Make prediction
    with torch.no_grad():
        out = model(batch_t)
    probabilities = torch.nn.functional.softmax(out[0], dim=0)

    # Load labels
    with open(&amp;quot;imagenet_classes.txt&amp;quot;) as f:
        labels = [line.strip() for line in f.readlines()]

    # Show top 3 predictions
    top3_prob, top3_catid = torch.topk(probabilities, 3)
    st.write(&amp;quot;Top Predictions:&amp;quot;)
    for i in range(top3_prob.size(0)):
        st.write(f&amp;quot;{i+1}. {labels[top3_catid[i]]}: {round(top3_prob[i].item()*100, 2)}%&amp;quot;)
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Note: Ensure that the imagenet_classes.txt file is available in your working directory.&lt;/p&gt;
&lt;h3&gt;Building a Simple Prediction App with scikit-learn&lt;/h3&gt;
&lt;p&gt;Let&apos;s build a simple regression app using scikit-learn.&lt;/p&gt;
&lt;h4&gt;Training a Model&lt;/h4&gt;
&lt;pre&gt;&lt;code class=&quot;language-python&quot;&gt;import streamlit as st
from sklearn.datasets import load_boston
from sklearn.ensemble import RandomForestRegressor
from sklearn.model_selection import train_test_split

# Load dataset
data = load_boston()
X = data.data
y = data.target

# Split data
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)

# Train model
model = RandomForestRegressor()
model.fit(X_train, y_train)

st.write(&amp;quot;Model trained successfully.&amp;quot;)
&lt;/code&gt;&lt;/pre&gt;
&lt;h4&gt;Making Predictions with User Input&lt;/h4&gt;
&lt;pre&gt;&lt;code class=&quot;language-python&quot;&gt;import numpy as np

st.header(&amp;quot;Boston Housing Price Prediction&amp;quot;)

# Feature input sliders
CRIM = st.number_input(&apos;Per capita crime rate by town&apos;, min_value=0.0, value=0.1)
ZN = st.number_input(&apos;Proportion of residential land zoned for lots over 25,000 sq.ft.&apos;, min_value=0.0, value=0.0)
# ... add inputs for other features

# For brevity, we&apos;ll use default values for the rest of the features
input_features = np.array([[CRIM, ZN] + [0]*(X.shape[1]-2)])

# Predict
prediction = model.predict(input_features)
st.write(f&amp;quot;Predicted median value of owner-occupied homes: ${prediction[0]*1000:.2f}&amp;quot;)
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Explanation: Users can input values for features, and the app predicts housing prices.&lt;/p&gt;
&lt;h3&gt;Visualizing Model Outputs and Performance Metrics&lt;/h3&gt;
&lt;p&gt;Visualizations help in understanding model performance.&lt;/p&gt;
&lt;h4&gt;Displaying Metrics&lt;/h4&gt;
&lt;pre&gt;&lt;code class=&quot;language-python&quot;&gt;from sklearn.metrics import mean_squared_error, r2_score

# Predict on test set
y_pred = model.predict(X_test)

# Calculate metrics
mse = mean_squared_error(y_test, y_pred)
r2 = r2_score(y_test, y_pred)

# Display metrics
st.write(&amp;quot;Model Performance on Test Set:&amp;quot;)
st.write(f&amp;quot;Mean Squared Error: {mse:.2f}&amp;quot;)
st.write(f&amp;quot;R² Score: {r2:.2f}&amp;quot;)
&lt;/code&gt;&lt;/pre&gt;
&lt;h4&gt;Plotting Actual vs. Predicted Values&lt;/h4&gt;
&lt;pre&gt;&lt;code class=&quot;language-python&quot;&gt;import pandas as pd

df = pd.DataFrame({&apos;Actual&apos;: y_test, &apos;Predicted&apos;: y_pred})

st.write(&amp;quot;Actual vs. Predicted Values&amp;quot;)
st.line_chart(df)
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Explanation: The line chart shows how closely the model&apos;s predictions match the actual values.&lt;/p&gt;
&lt;h3&gt;Tips for Integrating Machine Learning Models in Streamlit&lt;/h3&gt;
&lt;p&gt;&lt;strong&gt;Model Serialization:&lt;/strong&gt; For complex models, consider saving and loading models using joblib or pickle to avoid retraining every time.&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-python&quot;&gt;import joblib

# Save model
joblib.dump(model, &apos;model.joblib&apos;)

# Load model
model = joblib.load(&apos;model.joblib&apos;)
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Use Caching for Models: Cache the model loading or training functions to improve performance.&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-python&quot;&gt;@st.cache_resource
def load_model():
    # Load or train model
    return model
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;&lt;strong&gt;Handle Large Models:&lt;/strong&gt; Be mindful of resource limitations. Use efficient data structures and consider offloading heavy computations.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Provide Clear Instructions:&lt;/strong&gt; Guide users on how to interact with the app, especially when expecting specific input formats.&lt;/p&gt;
&lt;h3&gt;Summary&lt;/h3&gt;
&lt;p&gt;In this section, you&apos;ve learned how to:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Load and use pre-trained models with TensorFlow and PyTorch in your Streamlit apps.&lt;/li&gt;
&lt;li&gt;Build a simple prediction app using scikit-learn.&lt;/li&gt;
&lt;li&gt;Collect user input to make predictions and display results.&lt;/li&gt;
&lt;li&gt;Visualize model outputs and performance metrics to evaluate model effectiveness.&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;By integrating machine learning models into your Streamlit applications, you can create powerful tools that make complex models accessible to end-users in an interactive and user-friendly manner.&lt;/p&gt;
&lt;h1&gt;Database Connectivity&lt;/h1&gt;
&lt;p&gt;In many data science projects, interacting with databases is essential for retrieving, processing, and storing data. Streamlit, combined with the powerful libraries included in the Python Data Science Notebook Docker Image, makes it straightforward to connect to various databases and integrate them into your applications. In this section, we&apos;ll explore how to connect to databases using &lt;code&gt;sqlalchemy&lt;/code&gt;, &lt;code&gt;psycopg2&lt;/code&gt;, and specifically how to interact with &lt;strong&gt;Dremio&lt;/strong&gt; using the &lt;code&gt;dremio-simple-query&lt;/code&gt; library.&lt;/p&gt;
&lt;h2&gt;Connecting to Dremio Using &lt;code&gt;dremio-simple-query&lt;/code&gt;&lt;/h2&gt;
&lt;p&gt;&lt;strong&gt;Dremio&lt;/strong&gt; is a data lakehouse platform that enables you to govern, join, and accelerate queries across various data sources such as Iceberg, Delta Lake, S3, JSON, CSV, RDBMS, and more. The &lt;code&gt;dremio-simple-query&lt;/code&gt; library simplifies querying a Dremio source using Apache Arrow Flight, providing performant data retrieval for analytics.&lt;/p&gt;
&lt;h3&gt;Installing the &lt;code&gt;dremio-simple-query&lt;/code&gt; Library&lt;/h3&gt;
&lt;p&gt;First, ensure that the &lt;code&gt;dremio-simple-query&lt;/code&gt; library is installed in your environment:&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-bash&quot;&gt;pip install dremio-simple-query
&lt;/code&gt;&lt;/pre&gt;
&lt;h3&gt;Setting Up the Connection to Dremio&lt;/h3&gt;
&lt;p&gt;To connect to Dremio, you&apos;ll need to obtain your Dremio Arrow Flight endpoint and an authentication token.&lt;/p&gt;
&lt;h4&gt;Obtaining the Arrow Flight Endpoint&lt;/h4&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Dremio Cloud (NA):&lt;/strong&gt; &lt;code&gt;grpc+tls://data.dremio.cloud:443&lt;/code&gt;&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Dremio Cloud (EU):&lt;/strong&gt; &lt;code&gt;grpc+tls://data.eu.dremio.cloud:443&lt;/code&gt;&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Dremio Software (SSL):&lt;/strong&gt; &lt;code&gt;grpc+tls://&amp;lt;ip-address&amp;gt;:32010&lt;/code&gt;&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Dremio Software (No SSL):&lt;/strong&gt; &lt;code&gt;grpc://&amp;lt;ip-address&amp;gt;:32010&lt;/code&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;h3&gt;Getting Your Authentication Token&lt;/h3&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Dremio Cloud:&lt;/strong&gt; Obtain the token from the Dremio interface or via the REST API.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Dremio Software:&lt;/strong&gt; Obtain the token using the REST API.&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;You can use the &lt;code&gt;get_token&lt;/code&gt; function from the &lt;code&gt;dremio-simple-query&lt;/code&gt; library to retrieve the token programmatically.&lt;/p&gt;
&lt;h3&gt;Connecting to Dremio&lt;/h3&gt;
&lt;pre&gt;&lt;code class=&quot;language-python&quot;&gt;import streamlit as st
from dremio_simple_query.connect import get_token, DremioConnection
from os import getenv
from dotenv import load_dotenv

# Load environment variables from a .env file (optional)
load_dotenv()

# Retrieve Dremio credentials and endpoints
username = st.secrets[&amp;quot;dremio_username&amp;quot;]
password = st.secrets[&amp;quot;dremio_password&amp;quot;]
arrow_endpoint = st.secrets[&amp;quot;dremio_arrow_endpoint&amp;quot;]  # e.g., &amp;quot;grpc+tls://data.dremio.cloud:443&amp;quot;
login_endpoint = st.secrets[&amp;quot;dremio_login_endpoint&amp;quot;]  # e.g., &amp;quot;https://your-dremio-server:9047/apiv2/login&amp;quot;

# Get authentication token
payload = {
    &amp;quot;userName&amp;quot;: username,
    &amp;quot;password&amp;quot;: password
}
token = get_token(uri=login_endpoint, payload=payload)

# Establish connection to Dremio
dremio = DremioConnection(token, arrow_endpoint)

# Test the connection
try:
    st.success(&amp;quot;Successfully connected to Dremio.&amp;quot;)
except Exception as e:
    st.error(f&amp;quot;Failed to connect to Dremio: {e}&amp;quot;)
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Note: Ensure that you securely manage your credentials using Streamlit&apos;s secrets management or environment variables.&lt;/p&gt;
&lt;h3&gt;Querying Data from Dremio&lt;/h3&gt;
&lt;p&gt;You can now query data from Dremio and retrieve it in various formats.&lt;/p&gt;
&lt;h4&gt;Retrieving Data as an Arrow Table&lt;/h4&gt;
&lt;pre&gt;&lt;code class=&quot;language-python&quot;&gt;# Query data and get a FlightStreamReader object
stream = dremio.toArrow(&amp;quot;SELECT * FROM your_space.your_table LIMIT 100&amp;quot;)

# Convert the stream to an Arrow Table
arrow_table = stream.read_all()

# Optionally, display the data in Streamlit
df = arrow_table.to_pandas()
st.write(&amp;quot;Data from Dremio:&amp;quot;)
st.dataframe(df)
&lt;/code&gt;&lt;/pre&gt;
&lt;h4&gt;Retrieving Data as a Pandas DataFrame&lt;/h4&gt;
&lt;pre&gt;&lt;code class=&quot;language-python&quot;&gt;# Directly get a Pandas DataFrame
df = dremio.toPandas(&amp;quot;SELECT * FROM your_space.your_table LIMIT 100&amp;quot;)
st.write(&amp;quot;Data from Dremio:&amp;quot;)
st.dataframe(df)
&lt;/code&gt;&lt;/pre&gt;
&lt;h4&gt;Retrieving Data as a Polars DataFrame&lt;/h4&gt;
&lt;pre&gt;&lt;code class=&quot;language-python&quot;&gt;# Get a Polars DataFrame
df_polars = dremio.toPolars(&amp;quot;SELECT * FROM your_space.your_table LIMIT 100&amp;quot;)
st.write(&amp;quot;Data from Dremio (Polars DataFrame):&amp;quot;)
st.write(df_polars)
&lt;/code&gt;&lt;/pre&gt;
&lt;h3&gt;Querying with DuckDB&lt;/h3&gt;
&lt;p&gt;You can leverage DuckDB for in-memory analytics on the data retrieved from Dremio.&lt;/p&gt;
&lt;h4&gt;Using the DuckDB Relation API&lt;/h4&gt;
&lt;pre&gt;&lt;code class=&quot;language-python&quot;&gt;# Retrieve data as a DuckDB relation
duck_rel = dremio.toDuckDB(&amp;quot;SELECT * FROM your_space.your_table LIMIT 100&amp;quot;)

# Perform queries on the DuckDB relation
result = duck_rel.filter(&amp;quot;column_name &amp;gt; 50&amp;quot;).df()

# Display the result
st.write(&amp;quot;Filtered Data using DuckDB:&amp;quot;)
st.dataframe(result)
&lt;/code&gt;&lt;/pre&gt;
&lt;h4&gt;Querying Arrow Objects with DuckDB&lt;/h4&gt;
&lt;p&gt;Alternatively, you can query Arrow Tables using DuckDB:&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-python&quot;&gt;import duckdb

# Get data from Dremio as an Arrow Table
stream = dremio.toArrow(&amp;quot;SELECT * FROM your_space.your_table LIMIT 100&amp;quot;)
arrow_table = stream.read_all()

# Create a DuckDB connection
con = duckdb.connect()

# Register the Arrow Table with DuckDB
con.register(&amp;quot;dremio_table&amp;quot;, arrow_table)

# Perform SQL queries using DuckDB
query = &amp;quot;SELECT * FROM dremio_table WHERE column_name &amp;gt; 50&amp;quot;
result = con.execute(query).fetch_df()

# Display the result
st.write(&amp;quot;Filtered Data using DuckDB on Arrow Table:&amp;quot;)
st.dataframe(result)
&lt;/code&gt;&lt;/pre&gt;
&lt;h3&gt;Best Practices for Using Dremio with Streamlit&lt;/h3&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Secure Credentials:&lt;/strong&gt; Always handle your Dremio credentials securely. Use Streamlit&apos;s secrets management or environment variables to avoid hardcoding sensitive information.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Efficient Data Retrieval:&lt;/strong&gt; Optimize your SQL queries to retrieve only the necessary data. Use LIMIT clauses and filters to reduce data transfer and improve performance.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Error Handling:&lt;/strong&gt; Implement try-except blocks to manage exceptions and provide informative error messages to users.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Environment Configuration:&lt;/strong&gt; Ensure that your arrow_endpoint and login_endpoint are correctly configured based on your Dremio deployment (Cloud or Software, with or without SSL).&lt;/li&gt;
&lt;/ul&gt;
&lt;h3&gt;Connecting to Databases Using sqlalchemy and psycopg2&lt;/h3&gt;
&lt;p&gt;In addition to Dremio, you might need to connect to other databases like PostgreSQL or MySQL. The Docker image comes with sqlalchemy, psycopg2-binary, and other database drivers pre-installed.&lt;/p&gt;
&lt;h4&gt;Setting Up a Connection to a PostgreSQL Database&lt;/h4&gt;
&lt;pre&gt;&lt;code class=&quot;language-python&quot;&gt;from sqlalchemy import create_engine
import pandas as pd

# Database connection parameters
DB_USER = st.secrets[&amp;quot;db_user&amp;quot;]
DB_PASSWORD = st.secrets[&amp;quot;db_password&amp;quot;]
DB_HOST = st.secrets[&amp;quot;db_host&amp;quot;]
DB_PORT = st.secrets[&amp;quot;db_port&amp;quot;]
DB_NAME = st.secrets[&amp;quot;db_name&amp;quot;]

# Create a database engine
engine = create_engine(f&apos;postgresql+psycopg2://{DB_USER}:{DB_PASSWORD}@{DB_HOST}:{DB_PORT}/{DB_NAME}&apos;)

# Test the connection
try:
    with engine.connect() as connection:
        st.success(&amp;quot;Successfully connected to the PostgreSQL database.&amp;quot;)
except Exception as e:
    st.error(f&amp;quot;Failed to connect to the database: {e}&amp;quot;)
&lt;/code&gt;&lt;/pre&gt;
&lt;h4&gt;Querying Data from the Database&lt;/h4&gt;
&lt;pre&gt;&lt;code class=&quot;language-python&quot;&gt;# Sample query
query = &amp;quot;SELECT * FROM your_table LIMIT 10&amp;quot;

# Execute the query and load data into a DataFrame
df = pd.read_sql(query, engine)

# Display the data
st.write(&amp;quot;Data from PostgreSQL:&amp;quot;)
st.dataframe(df)
&lt;/code&gt;&lt;/pre&gt;
&lt;h3&gt;Handling Large Datasets with Dask&lt;/h3&gt;
&lt;p&gt;When dealing with large datasets, performance can become an issue. Dask is a parallel computing library that integrates with Pandas to handle larger-than-memory datasets efficiently.&lt;/p&gt;
&lt;h4&gt;Using Dask to Query Large Tables&lt;/h4&gt;
&lt;pre&gt;&lt;code class=&quot;language-python&quot;&gt;import dask.dataframe as dd

# Read data from SQL using Dask
df = dd.read_sql_table(
    table=&apos;large_table&apos;,
    uri=f&apos;postgresql+psycopg2://{DB_USER}:{DB_PASSWORD}@{DB_HOST}:{DB_PORT}/{DB_NAME}&apos;,
    index_col=&apos;id&apos;
)

# Perform computations with Dask DataFrame
filtered_df = df[df[&apos;value&apos;] &amp;gt; 100]

# Compute the result and convert to Pandas DataFrame
result = filtered_df.compute()

# Display the result
st.write(&amp;quot;Filtered Data:&amp;quot;)
st.dataframe(result)
&lt;/code&gt;&lt;/pre&gt;
&lt;h3&gt;Best Practices for Database Connectivity&lt;/h3&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Secure Credentials:&lt;/strong&gt; Use Streamlit&apos;s secrets management or environment variables to store sensitive information.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Parameterized Queries:&lt;/strong&gt; Always use parameterized queries to prevent SQL injection.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Connection Management:&lt;/strong&gt; Use context managers (with statements) to ensure connections are properly closed.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Error Handling:&lt;/strong&gt; Implement try-except blocks to handle exceptions and provide user-friendly error messages.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Limit Data Fetching:&lt;/strong&gt; When displaying data in the app, limit the number of rows fetched to prevent performance issues.&lt;/li&gt;
&lt;/ul&gt;
&lt;h3&gt;Summary&lt;/h3&gt;
&lt;p&gt;In this section, you&apos;ve learned how to:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Connect to Dremio using the dremio-simple-query library and retrieve data efficiently using Apache Arrow Flight.&lt;/li&gt;
&lt;li&gt;Query data from Dremio and convert it into various formats such as Arrow Tables, Pandas DataFrames, Polars DataFrames, and DuckDB relations.&lt;/li&gt;
&lt;li&gt;Utilize DuckDB for in-memory analytics on data retrieved from Dremio.&lt;/li&gt;
&lt;li&gt;Connect to other databases like PostgreSQL using sqlalchemy and psycopg2.&lt;/li&gt;
&lt;li&gt;Handle large datasets efficiently using Dask.
Implement best practices for secure and efficient database connectivity.&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;By integrating Dremio and other data systems into your Streamlit applications, you can create powerful data-driven apps that interact with live data sources, enabling real-time analysis and insights.&lt;/p&gt;
&lt;h1&gt;Deploying Streamlit Apps&lt;/h1&gt;
&lt;p&gt;With your Streamlit app developed and tested within the Docker environment, the next step is to deploy it so that others can access and use it. Deploying Streamlit apps can be done in several ways, including running the app locally, containerizing it with Docker, and deploying it to cloud platforms like Streamlit Community Cloud, Heroku, AWS, or other hosting services.&lt;/p&gt;
&lt;p&gt;In this section, we&apos;ll explore how to:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Run your Streamlit app outside of Jupyter Notebook&lt;/strong&gt;&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Containerize your Streamlit app with Docker&lt;/strong&gt;&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Deploy your app to cloud platforms&lt;/strong&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;h2&gt;Running Streamlit Apps Outside of Jupyter Notebook&lt;/h2&gt;
&lt;p&gt;While developing within Jupyter Notebook is convenient, deploying your app typically involves running it as a standalone script.&lt;/p&gt;
&lt;h3&gt;Steps to Run the App Locally&lt;/h3&gt;
&lt;ol&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Ensure Streamlit is Installed&lt;/strong&gt;&lt;/p&gt;
&lt;p&gt;If you followed the previous sections, Streamlit should already be installed in your Docker container. If not, install it using:&lt;/p&gt;
&lt;/li&gt;
&lt;/ol&gt;
&lt;pre&gt;&lt;code class=&quot;language-bash&quot;&gt;   pip install streamlit
&lt;/code&gt;&lt;/pre&gt;
&lt;h4&gt;Exit the Jupyter Notebook Environment&lt;/h4&gt;
&lt;ul&gt;
&lt;li&gt;
&lt;p&gt;Stop the Jupyter Notebook server if it&apos;s still running.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;Navigate to Your App Directory&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;Open a terminal and navigate to the directory containing your app.py file:&lt;/p&gt;
&lt;/li&gt;
&lt;/ul&gt;
&lt;pre&gt;&lt;code class=&quot;language-bash&quot;&gt;cd /home/pydata/work
&lt;/code&gt;&lt;/pre&gt;
&lt;h4&gt;Run the Streamlit App&lt;/h4&gt;
&lt;p&gt;Execute the following command:&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-bash&quot;&gt;streamlit run app.py
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;This command starts the Streamlit server and serves your app at http://localhost:8501 by default.&lt;/p&gt;
&lt;h4&gt;Access the App in Your Browser&lt;/h4&gt;
&lt;p&gt;Open your web browser and navigate to http://localhost:8501 to interact with your app.&lt;/p&gt;
&lt;h3&gt;Containerizing Your Streamlit App with Docker&lt;/h3&gt;
&lt;p&gt;Containerizing your app ensures consistency across different environments and simplifies deployment.&lt;/p&gt;
&lt;h3&gt;Creating a Dockerfile for Your Streamlit App&lt;/h3&gt;
&lt;ul&gt;
&lt;li&gt;
&lt;p&gt;Create a Dockerfile&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;In your app directory, create a file named Dockerfile with the following content:&lt;/p&gt;
&lt;/li&gt;
&lt;/ul&gt;
&lt;pre&gt;&lt;code class=&quot;language-dockerfile&quot;&gt;# Use the official Python image as base
FROM python:3.9-slim

# Set the working directory
WORKDIR /app

# Copy the requirements file
COPY requirements.txt .

# Install dependencies
RUN pip install --no-cache-dir -r requirements.txt

# Copy the rest of the application code
COPY . .

# Expose the port Streamlit uses
EXPOSE 8501

# Run the Streamlit app
CMD [&amp;quot;streamlit&amp;quot;, &amp;quot;run&amp;quot;, &amp;quot;app.py&amp;quot;, &amp;quot;--server.port=8501&amp;quot;, &amp;quot;--server.address=0.0.0.0&amp;quot;]
&lt;/code&gt;&lt;/pre&gt;
&lt;h4&gt;Create a requirements.txt File&lt;/h4&gt;
&lt;p&gt;List all your Python dependencies in a file named requirements.txt:&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-text&quot;&gt;streamlit
pandas
numpy
# Add any other dependencies your app requires
&lt;/code&gt;&lt;/pre&gt;
&lt;h4&gt;Build the Docker Image&lt;/h4&gt;
&lt;p&gt;In your terminal, run:&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-bash&quot;&gt;docker build -t my-streamlit-app .
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;This builds the Docker image and tags it as my-streamlit-app.&lt;/p&gt;
&lt;h4&gt;Run the Docker Container&lt;/h4&gt;
&lt;pre&gt;&lt;code class=&quot;language-bash&quot;&gt;docker run -p 8501:8501 my-streamlit-app
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Maps port 8501 in the container to port 8501 on your host machine.&lt;/p&gt;
&lt;h4&gt;Access the App&lt;/h4&gt;
&lt;p&gt;Open your web browser and navigate to http://localhost:8501.&lt;/p&gt;
&lt;h4&gt;Pushing the Docker Image to a Registry (Optional)&lt;/h4&gt;
&lt;p&gt;If you plan to deploy your app using Docker images, you may need to push it to a Docker registry like Docker Hub or a private registry.&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-bash&quot;&gt;# Tag the image for Docker Hub
docker tag my-streamlit-app your-dockerhub-username/my-streamlit-app


# Log in to Docker Hub
docker login

# Push the image
docker push your-dockerhub-username/my-streamlit-app
&lt;/code&gt;&lt;/pre&gt;
&lt;h3&gt;Deploying to Cloud Platforms&lt;/h3&gt;
&lt;p&gt;There are several cloud platforms that support deploying Streamlit apps. Below, we&apos;ll cover deploying to Streamlit Community Cloud, Heroku, and AWS Elastic Beanstalk.&lt;/p&gt;
&lt;h4&gt;Deploying to Streamlit Community Cloud&lt;/h4&gt;
&lt;p&gt;Streamlit offers a free hosting service for public GitHub repositories.&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;
&lt;p&gt;Push Your App to GitHub&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;Ensure your app code is in a GitHub repository.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;Sign Up for Streamlit Community Cloud&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;Go to streamlit.io/cloud and sign up using your GitHub account.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;Deploy Your App&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;Click on &amp;quot;New app&amp;quot;.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;Select your GitHub repository and branch.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;Specify the location of your app.py file.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;Click &amp;quot;Deploy&amp;quot;.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;Access Your App&lt;/p&gt;
&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;Once deployed, you&apos;ll receive a URL where your app is hosted.&lt;/p&gt;
&lt;h4&gt;Deploying to Heroku&lt;/h4&gt;
&lt;p&gt;Heroku is a cloud platform that supports deploying applications using Docker.&lt;/p&gt;
&lt;h5&gt;Create a Procfile&lt;/h5&gt;
&lt;p&gt;In your app directory, create a file named Procfile with the following content:&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-text&quot;&gt;web: streamlit run app.py --server.port=$PORT --server.address=0.0.0.0
&lt;/code&gt;&lt;/pre&gt;
&lt;h5&gt;Create a requirements.txt File&lt;/h5&gt;
&lt;ul&gt;
&lt;li&gt;
&lt;p&gt;Ensure you have a requirements.txt file listing your dependencies.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;Initialize a Git Repository&lt;/p&gt;
&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;If you haven&apos;t already:&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-bash&quot;&gt;git init
git add .
git commit -m &amp;quot;Initial commit&amp;quot;
&lt;/code&gt;&lt;/pre&gt;
&lt;h5&gt;Create a Heroku App&lt;/h5&gt;
&lt;p&gt;Install the Heroku CLI and log in:&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-bash&quot;&gt;heroku login
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Create a new app:&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-bash&quot;&gt;heroku create your-app-name
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Deploy Your App&lt;/p&gt;
&lt;p&gt;Push your code to Heroku:&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-bash&quot;&gt;git push heroku master
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Scale the Web Process&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-bash&quot;&gt;heroku ps:scale web=1
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Access Your App&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-bash&quot;&gt;heroku open
&lt;/code&gt;&lt;/pre&gt;
&lt;h4&gt;Deploying to AWS Elastic Beanstalk&lt;/h4&gt;
&lt;p&gt;AWS Elastic Beanstalk supports deploying applications in Docker containers.&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;
&lt;p&gt;Install the AWS Elastic Beanstalk CLI&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;Follow the official AWS documentation to install the EB CLI.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;Initialize Elastic Beanstalk&lt;/p&gt;
&lt;/li&gt;
&lt;/ul&gt;
&lt;pre&gt;&lt;code class=&quot;language-bash&quot;&gt;eb init -p docker my-streamlit-app
&lt;/code&gt;&lt;/pre&gt;
&lt;h5&gt;Create an Environment&lt;/h5&gt;
&lt;pre&gt;&lt;code class=&quot;language-bash&quot;&gt;eb create my-streamlit-env
&lt;/code&gt;&lt;/pre&gt;
&lt;h5&gt;Deploy Your App&lt;/h5&gt;
&lt;pre&gt;&lt;code class=&quot;language-bash&quot;&gt;eb deploy
&lt;/code&gt;&lt;/pre&gt;
&lt;h5&gt;Access Your App&lt;/h5&gt;
&lt;pre&gt;&lt;code class=&quot;language-bash&quot;&gt;eb open
&lt;/code&gt;&lt;/pre&gt;
&lt;h4&gt;Deploying with Other Services&lt;/h4&gt;
&lt;p&gt;You can deploy your Streamlit app using other platforms like:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Google Cloud Run:&lt;/strong&gt; For serverless container deployments.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Azure App Service:&lt;/strong&gt; For deploying web apps on Azure.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Kubernetes:&lt;/strong&gt; For scalable and managed deployments.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Docker Compose:&lt;/strong&gt; For multi-container applications.&lt;/li&gt;
&lt;/ul&gt;
&lt;h4&gt;Example: Deploying to Google Cloud Run&lt;/h4&gt;
&lt;p&gt;Build and Push the Docker Image to Google Container Registry&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-bash&quot;&gt;# Build the Docker image
docker build -t gcr.io/your-project-id/my-streamlit-app .

# Push the image
docker push gcr.io/your-project-id/my-streamlit-app
&lt;/code&gt;&lt;/pre&gt;
&lt;h5&gt;Deploy to Cloud Run&lt;/h5&gt;
&lt;pre&gt;&lt;code class=&quot;language-bash&quot;&gt;gcloud run deploy my-streamlit-app \
  --image gcr.io/your-project-id/my-streamlit-app \
  --platform managed \
  --region us-central1 \
  --allow-unauthenticated
&lt;/code&gt;&lt;/pre&gt;
&lt;h3&gt;Best Practices for Deployment&lt;/h3&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Environment Variables:&lt;/strong&gt; Use environment variables to manage secrets and configuration settings.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Logging:&lt;/strong&gt; Implement logging to monitor your app&apos;s performance and errors.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Security:&lt;/strong&gt; Ensure your app is secure by handling user input appropriately and securing API keys.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Scalability:&lt;/strong&gt; Choose deployment options that allow your app to scale with user demand.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Continuous Integration/Continuous Deployment (CI/CD):&lt;/strong&gt; Set up CI/CD pipelines to automate the deployment process.&lt;/li&gt;
&lt;/ul&gt;
&lt;h4&gt;Managing Secrets and Configuration&lt;/h4&gt;
&lt;p&gt;Use environment variables to store sensitive information:&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-python&quot;&gt;import os

API_KEY = os.getenv(&amp;quot;API_KEY&amp;quot;)
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Set the environment variable in your deployment platform&apos;s settings or configuration.&lt;/p&gt;
&lt;h4&gt;Implementing Logging&lt;/h4&gt;
&lt;p&gt;Use Python&apos;s built-in logging library:&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-python&quot;&gt;import logging

logging.basicConfig(level=logging.INFO)
logging.info(&amp;quot;This is an info message&amp;quot;)
&lt;/code&gt;&lt;/pre&gt;
&lt;h4&gt;Handling User Input Securely&lt;/h4&gt;
&lt;p&gt;Validate and sanitize all user inputs to prevent security vulnerabilities like injection attacks.&lt;/p&gt;
&lt;h3&gt;Summary&lt;/h3&gt;
&lt;p&gt;In this section, you&apos;ve learned how to:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Run your Streamlit app outside of the development environment&lt;/li&gt;
&lt;li&gt;Containerize your app using Docker for consistent deployments&lt;/li&gt;
&lt;li&gt;Deploy your app to cloud platforms like Streamlit Community Cloud, Heroku, and AWS Elastic Beanstalk&lt;/li&gt;
&lt;li&gt;Apply best practices for deploying and maintaining your Streamlit applications&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;By deploying your Streamlit app, you make it accessible to a wider audience, allowing others to benefit from your interactive data applications.&lt;/p&gt;
&lt;h1&gt;Best Practices and Tips&lt;/h1&gt;
&lt;p&gt;Developing Streamlit applications involves not just coding but also adhering to best practices that ensure your app is efficient, maintainable, and user-friendly. In this section, we&apos;ll cover some essential tips and best practices to help you optimize your Streamlit apps.&lt;/p&gt;
&lt;h2&gt;Organizing Your Streamlit Codebase&lt;/h2&gt;
&lt;p&gt;A well-organized codebase enhances readability and maintainability, especially as your application grows in complexity.&lt;/p&gt;
&lt;h3&gt;Use Modular Code Structure&lt;/h3&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Separate Concerns&lt;/strong&gt;: Break down your code into modules or scripts based on functionality, such as data loading, preprocessing, visualization, and utility functions.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Create a &lt;code&gt;components&lt;/code&gt; Module&lt;/strong&gt;: Encapsulate reusable UI components in a separate module to avoid code duplication.&lt;/li&gt;
&lt;/ul&gt;
&lt;pre&gt;&lt;code class=&quot;language-python&quot;&gt;  # components.py
  import streamlit as st

  def sidebar_filters():
      st.sidebar.header(&amp;quot;Filters&amp;quot;)
      # Add filter widgets
&lt;/code&gt;&lt;/pre&gt;
&lt;pre&gt;&lt;code class=&quot;language-python&quot;&gt;# main app.py
import streamlit as st
from components import sidebar_filters

sidebar_filters()
# Rest of your app code
&lt;/code&gt;&lt;/pre&gt;
&lt;h3&gt;Follow Naming Conventions&lt;/h3&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Consistent Naming:&lt;/strong&gt; Use meaningful variable and function names that follow Python&apos;s naming conventions.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Folder Structure:&lt;/strong&gt; Organize files into folders such as data, models, utils, and pages if using Streamlit&apos;s multipage apps.&lt;/li&gt;
&lt;/ul&gt;
&lt;h3&gt;Use Virtual Environments&lt;/h3&gt;
&lt;ul&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Environment Isolation:&lt;/strong&gt; Use virtual environments (e.g., venv, conda, or pipenv) to manage dependencies and avoid conflicts.
Version Control&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Git:&lt;/strong&gt; Use Git for version control to track changes and collaborate with others.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;.gitignore:&lt;/strong&gt; Include a .gitignore file to exclude unnecessary files from your repository.&lt;/p&gt;
&lt;/li&gt;
&lt;/ul&gt;
&lt;pre&gt;&lt;code class=&quot;language-text&quot;&gt;__pycache__/
.DS_Store
venv/
.env
&lt;/code&gt;&lt;/pre&gt;
&lt;h3&gt;Enhancing User Experience with Custom Themes and Layouts&lt;/h3&gt;
&lt;p&gt;A polished UI enhances the user experience and makes your app more engaging.&lt;/p&gt;
&lt;h4&gt;Custom Themes&lt;/h4&gt;
&lt;ul&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Streamlit Themes:&lt;/strong&gt; Customize the appearance of your app using Streamlit&apos;s theming options.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Modify config.toml:&lt;/strong&gt; Create a &lt;code&gt;.streamlit/config.toml&lt;/code&gt; file to define your theme settings.&lt;/p&gt;
&lt;/li&gt;
&lt;/ul&gt;
&lt;pre&gt;&lt;code class=&quot;language-toml&quot;&gt;[theme]
primaryColor=&amp;quot;#6eb52f&amp;quot;
backgroundColor=&amp;quot;#f0f0f5&amp;quot;
secondaryBackgroundColor=&amp;quot;#e0e0ef&amp;quot;
textColor=&amp;quot;#262730&amp;quot;
font=&amp;quot;sans serif&amp;quot;
&lt;/code&gt;&lt;/pre&gt;
&lt;h3&gt;Responsive Layouts&lt;/h3&gt;
&lt;p&gt;&lt;strong&gt;Use Columns and Containers:&lt;/strong&gt; Organize content using &lt;code&gt;st.columns()&lt;/code&gt;, &lt;code&gt;st.container()&lt;/code&gt;, and &lt;code&gt;st.expander()&lt;/code&gt; for a clean layout.&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-python&quot;&gt;col1, col2 = st.columns(2)

with col1:
    st.header(&amp;quot;Section 1&amp;quot;)
    # Content for section 1

with col2:
    st.header(&amp;quot;Section 2&amp;quot;)
    # Content for section 2
&lt;/code&gt;&lt;/pre&gt;
&lt;h3&gt;Interactive Elements&lt;/h3&gt;
&lt;p&gt;&lt;strong&gt;Feedback:&lt;/strong&gt; Use &lt;code&gt;st.progress()&lt;/code&gt;, &lt;code&gt;st.spinner()&lt;/code&gt;, and &lt;code&gt;st.toast()&lt;/code&gt; to provide feedback during long computations.&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-python&quot;&gt;with st.spinner(&apos;Loading data...&apos;):
    df = load_data()
st.success(&apos;Data loaded successfully!&apos;)
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;&lt;strong&gt;Tooltips and Help Text:&lt;/strong&gt; Add tooltips or help text to widgets to guide users.&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-python&quot;&gt;st.text_input(&amp;quot;Username&amp;quot;, help=&amp;quot;Enter your user ID assigned by the administrator&amp;quot;)
&lt;/code&gt;&lt;/pre&gt;
&lt;h3&gt;Accessibility&lt;/h3&gt;
&lt;p&gt;&lt;strong&gt;Alt Text for Images:&lt;/strong&gt; Use the caption parameter in &lt;code&gt;st.image()&lt;/code&gt; to provide descriptions.&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-python&quot;&gt;st.image(&apos;chart.png&apos;, caption=&apos;Sales over time&apos;)
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;&lt;strong&gt;Keyboard Navigation:&lt;/strong&gt; Ensure that all interactive elements can be navigated using the keyboard.&lt;/p&gt;
&lt;h3&gt;Debugging Common Issues in Streamlit Apps&lt;/h3&gt;
&lt;p&gt;Being able to identify and fix issues quickly is crucial for smooth app development.&lt;/p&gt;
&lt;h4&gt;Common Issues and Solutions&lt;/h4&gt;
&lt;h5&gt;App Crashes or Freezes&lt;/h5&gt;
&lt;ul&gt;
&lt;li&gt;Infinite Loops: Ensure that your code doesn&apos;t have infinite loops that can block the app.
Large Data Loading: Use caching with @st.cache_data to prevent reloading data on every interaction.&lt;/li&gt;
&lt;/ul&gt;
&lt;h5&gt;Slow Performance&lt;/h5&gt;
&lt;ul&gt;
&lt;li&gt;Heavy Computations: Optimize code by using efficient algorithms or leveraging libraries like NumPy and Pandas.&lt;/li&gt;
&lt;li&gt;Caching: Use @st.cache_data and @st.cache_resource to cache expensive operations.
Widget State Not Preserved&lt;/li&gt;
&lt;li&gt;Session State: Use st.session_state to maintain state across interactions.&lt;/li&gt;
&lt;/ul&gt;
&lt;pre&gt;&lt;code class=&quot;language-python&quot;&gt;if &apos;counter&apos; not in st.session_state:
    st.session_state.counter = 0

increment = st.button(&apos;Increment&apos;)
if increment:
    st.session_state.counter += 1

st.write(f&amp;quot;Counter: {st.session_state.counter}&amp;quot;)
&lt;/code&gt;&lt;/pre&gt;
&lt;h3&gt;Errors When Deploying&lt;/h3&gt;
&lt;p&gt;&lt;strong&gt;Dependency Mismatches:&lt;/strong&gt; Ensure that all dependencies are listed in requirements.txt and versions are compatible.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Environment Variables:&lt;/strong&gt; Check that all required environment variables are set in the deployment environment.&lt;/p&gt;
&lt;h3&gt;Streamlit Version Issues&lt;/h3&gt;
&lt;p&gt;&lt;strong&gt;API Changes:&lt;/strong&gt; If you encounter deprecated functions, update your code to match the latest Streamlit API.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Version Pinning:&lt;/strong&gt; Specify the Streamlit version in your requirements.txt to maintain consistency.&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-text&quot;&gt;streamlit==1.25.0
&lt;/code&gt;&lt;/pre&gt;
&lt;h3&gt;Logging and Error Tracking&lt;/h3&gt;
&lt;h5&gt;Use Logging&lt;/h5&gt;
&lt;pre&gt;&lt;code class=&quot;language-python&quot;&gt;import logging

logging.basicConfig(level=logging.INFO)
logging.info(&amp;quot;This is an info message&amp;quot;)
&lt;/code&gt;&lt;/pre&gt;
&lt;h5&gt;Display Errors&lt;/h5&gt;
&lt;p&gt;Use &lt;code&gt;st.error()&lt;/code&gt; to display error messages to the user.&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-python&quot;&gt;try:
    result = perform_calculation()
    st.write(result)
except Exception as e:
    st.error(f&amp;quot;An error occurred: {e}&amp;quot;)
&lt;/code&gt;&lt;/pre&gt;
&lt;h3&gt;Testing&lt;/h3&gt;
&lt;p&gt;&lt;strong&gt;Unit Tests:&lt;/strong&gt; Write unit tests for your functions using unittest or pytest.
&lt;strong&gt;Test Scripts:&lt;/strong&gt; Create test scripts to simulate user interactions and verify app behavior.&lt;/p&gt;
&lt;h3&gt;Performance Optimization&lt;/h3&gt;
&lt;p&gt;Optimizing your app&apos;s performance ensures a better user experience.&lt;/p&gt;
&lt;h3&gt;Efficient Data Handling&lt;/h3&gt;
&lt;p&gt;&lt;strong&gt;Lazy Loading:&lt;/strong&gt; Load data only when necessary, perhaps in response to user input.
&lt;strong&gt;Data Sampling:&lt;/strong&gt; For large datasets, consider using a sample for initial display and provide options to load more data.&lt;/p&gt;
&lt;h3&gt;Use of Caching&lt;/h3&gt;
&lt;h5&gt;Cache Data Loading&lt;/h5&gt;
&lt;pre&gt;&lt;code class=&quot;language-python&quot;&gt;@st.cache_data
def load_data():
    # Load data from source
    return data
&lt;/code&gt;&lt;/pre&gt;
&lt;h5&gt;Cache Computations&lt;/h5&gt;
&lt;pre&gt;&lt;code class=&quot;language-python&quot;&gt;@st.cache_data
def compute_expensive_operation(params):
    # Perform computation
    return result
&lt;/code&gt;&lt;/pre&gt;
&lt;h3&gt;Optimize Resource Usage&lt;/h3&gt;
&lt;p&gt;&lt;strong&gt;Avoid Redundant Computations:&lt;/strong&gt; Structure code to prevent unnecessary re-execution of functions.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Clear Session State When Needed:&lt;/strong&gt; Manage st.session_state to free up memory if variables are no longer needed.&lt;/p&gt;
&lt;h3&gt;Security Considerations&lt;/h3&gt;
&lt;p&gt;Ensure your app is secure, especially when handling sensitive data.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Input Validation:&lt;/strong&gt; Always validate and sanitize user inputs.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Secrets Management:&lt;/strong&gt; Use Streamlit&apos;s secrets management to handle API keys and passwords.&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-python&quot;&gt;import os

API_KEY = st.secrets[&amp;quot;api_key&amp;quot;]
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;&lt;strong&gt;HTTPS:&lt;/strong&gt; Deploy your app using HTTPS to encrypt data in transit.&lt;/p&gt;
&lt;h3&gt;Documentation and User Guides&lt;/h3&gt;
&lt;p&gt;Provide documentation to help users understand and navigate your app.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Inline Documentation:&lt;/strong&gt; Use st.markdown() or st.write() to include instructions and explanations within the app.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;User Manuals:&lt;/strong&gt; Provide a downloadable or linked user guide for complex applications.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Tooltips:&lt;/strong&gt; Utilize the help parameter in widgets to give users quick hints.&lt;/p&gt;
&lt;h3&gt;Keep Up with Streamlit Updates&lt;/h3&gt;
&lt;p&gt;Streamlit is actively developed, and staying updated can help you leverage new features.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Changelog:&lt;/strong&gt; Regularly check the Streamlit changelog for updates.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Community Forums:&lt;/strong&gt; Participate in the Streamlit community forums to learn from others and share your experiences.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Update Dependencies:&lt;/strong&gt; Periodically update your dependencies to benefit from performance improvements and security patches.&lt;/p&gt;
&lt;h3&gt;Summary&lt;/h3&gt;
&lt;p&gt;By following these best practices and tips, you can:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Enhance the maintainability and readability of your code.&lt;/li&gt;
&lt;li&gt;Create a more engaging and user-friendly app interface.&lt;/li&gt;
&lt;li&gt;Quickly identify and resolve issues during development.&lt;/li&gt;
&lt;li&gt;Optimize your app&apos;s performance for a better user experience.&lt;/li&gt;
&lt;li&gt;Ensure the security and integrity of your application and data.&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;Implementing these strategies will help you develop professional, robust, and efficient Streamlit applications that meet the needs of your users and stakeholders.&lt;/p&gt;
&lt;h1&gt;Conclusion&lt;/h1&gt;
&lt;p&gt;In this comprehensive guide, we&apos;ve embarked on a journey to master Streamlit using the Python Data Science Notebook Docker Image. Throughout the chapters, we&apos;ve explored how to set up a robust environment, harness the power of Streamlit for building interactive data applications, and leverage advanced features to enhance functionality and user experience.&lt;/p&gt;
&lt;h2&gt;Recap of Key Learnings&lt;/h2&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Environment Setup&lt;/strong&gt;: Established a consistent and portable development environment using Docker, ensuring all necessary libraries and tools are readily available.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Getting Started with Streamlit&lt;/strong&gt;: Created our first Streamlit app, understanding the basic structure and core components that make up a Streamlit application.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Interactive Data Visualizations&lt;/strong&gt;: Leveraged built-in Streamlit functions and integrated libraries like Altair and Plotly to build dynamic and interactive visualizations.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Advanced Features&lt;/strong&gt;: Utilized state management with &lt;code&gt;st.session_state&lt;/code&gt;, dynamic content creation with layout elements, and performance optimization through caching mechanisms.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Integrating Machine Learning Models&lt;/strong&gt;: Loaded and interacted with machine learning models using TensorFlow, PyTorch, and scikit-learn, making predictions and visualizing outcomes within Streamlit apps.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Database Connectivity&lt;/strong&gt;: Connected to various databases, including Dremio, PostgreSQL, and MySQL, using powerful libraries to query and manipulate data efficiently.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Deploying Streamlit Apps&lt;/strong&gt;: Explored different deployment strategies, from running apps locally to containerizing with Docker and deploying on cloud platforms like Streamlit Community Cloud, Heroku, and AWS.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Best Practices and Tips&lt;/strong&gt;: Emphasized the importance of code organization, user experience enhancements, debugging techniques, performance optimization, and security considerations to build professional and robust applications.&lt;/li&gt;
&lt;/ul&gt;
&lt;h2&gt;Next Steps to Further Explore Streamlit&lt;/h2&gt;
&lt;p&gt;While we&apos;ve covered a significant amount of ground, there&apos;s always more to learn and explore:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Dive Deeper into Streamlit Components&lt;/strong&gt;: Experiment with custom components and the Streamlit Component API to extend the functionality of your apps.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Explore Streamlit&apos;s Multipage Apps&lt;/strong&gt;: Organize complex applications into multiple pages for better user navigation and structure.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Integrate Additional Libraries&lt;/strong&gt;: Incorporate other data science and machine learning libraries to expand the capabilities of your applications.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Contribute to the Community&lt;/strong&gt;: Share your apps and components with the Streamlit community, contribute to open-source projects, and engage in discussions to learn from others.&lt;/li&gt;
&lt;/ul&gt;
&lt;h2&gt;Additional Resources and Communities&lt;/h2&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Official Streamlit Documentation&lt;/strong&gt;: &lt;a href=&quot;https://docs.streamlit.io/&quot;&gt;docs.streamlit.io&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Streamlit Forums&lt;/strong&gt;: Engage with the community on the &lt;a href=&quot;https://discuss.streamlit.io/&quot;&gt;Streamlit Discourse&lt;/a&gt; platform.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Streamlit on GitHub&lt;/strong&gt;: Explore the source code and contribute at &lt;a href=&quot;https://github.com/streamlit/streamlit&quot;&gt;github.com/streamlit/streamlit&lt;/a&gt;.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Tutorials and Courses&lt;/strong&gt;: Look for online tutorials, courses, and webinars that cover advanced topics and real-world use cases.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Blogs and Articles&lt;/strong&gt;: Follow blogs and articles by data science professionals who share insights and best practices.&lt;/li&gt;
&lt;/ul&gt;
&lt;h2&gt;Final Thoughts&lt;/h2&gt;
&lt;p&gt;Streamlit has revolutionized the way we create and share data applications, making it accessible for data scientists and developers to build interactive web apps with ease. By combining Streamlit with the Python Data Science Notebook Docker Image, we&apos;ve established a powerful workflow that simplifies environment setup and accelerates application development.&lt;/p&gt;
&lt;p&gt;As you continue your journey, remember that the key to mastery is consistent practice and exploration. Don&apos;t hesitate to experiment with new ideas, seek feedback, and iterate on your applications. The world of data science is ever-evolving, and tools like Streamlit are at the forefront of making data more accessible and engaging for everyone.&lt;/p&gt;
</content:encoded><author>Alex Merced</author></item><item><title>A Deep Dive into Docker Compose</title><link>https://iceberglakehouse.com/posts/2024-9-a-deep-dive-into-docker-compose/</link><guid isPermaLink="true">https://iceberglakehouse.com/posts/2024-9-a-deep-dive-into-docker-compose/</guid><description>
## Understanding the Docker Compose File Structure

Docker Compose uses a YAML file (`docker-compose.yml`) to define services, networks, and volumes ...</description><pubDate>Sat, 21 Sep 2024 09:00:00 GMT</pubDate><content:encoded>&lt;h2&gt;Understanding the Docker Compose File Structure&lt;/h2&gt;
&lt;p&gt;Docker Compose uses a YAML file (&lt;code&gt;docker-compose.yml&lt;/code&gt;) to define services, networks, and volumes that make up your application. The structure is easy to understand and is highly configurable, allowing you to manage multiple containers with a single file.&lt;/p&gt;
&lt;p&gt;Here’s an overview of the basic components of a &lt;code&gt;docker-compose.yml&lt;/code&gt; file:&lt;/p&gt;
&lt;h3&gt;Version&lt;/h3&gt;
&lt;p&gt;The &lt;code&gt;version&lt;/code&gt; key defines which version of the Docker Compose file format is being used. Some features in Docker Compose may only be available in certain versions. For example:&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-yaml&quot;&gt;version: &apos;3&apos;
&lt;/code&gt;&lt;/pre&gt;
&lt;h3&gt;Services&lt;/h3&gt;
&lt;p&gt;The services section is where you define each container that will be part of your application. Each service is essentially a container that you configure with parameters like image, build options, environment variables, ports, etc.&lt;/p&gt;
&lt;p&gt;Here’s an example of defining a basic web service:&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-yaml&quot;&gt;services:
  web:
    image: nginx:latest
    ports:
      - &amp;quot;8080:80&amp;quot;
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;In this case, we&apos;re using an existing image (nginx) from Docker Hub and exposing port 80 from the container to port 8080 on the host machine.&lt;/p&gt;
&lt;h3&gt;Networks&lt;/h3&gt;
&lt;p&gt;By default, Docker Compose creates a bridge network for all services to communicate. However, you can define custom networks to better control communication between services.&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-yaml&quot;&gt;networks:
  my_network:
    driver: bridge
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Once a network is defined, you can assign services to this network for better isolation and control.&lt;/p&gt;
&lt;h3&gt;Volumes&lt;/h3&gt;
&lt;p&gt;The volumes section allows you to create and manage persistent storage that is not tied to the container&apos;s lifecycle. This is useful when you need to persist data across container restarts.&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-yaml&quot;&gt;volumes:
  my_volume:
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;You can then attach this volume to a service to persist data:&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-yaml&quot;&gt;services:
  db:
    image: postgres
    volumes:
      - my_volume:/var/lib/postgresql/data
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;In this example, the Postgres database data is stored in the volume my_volume, ensuring that the data is not lost when the container stops or restarts.&lt;/p&gt;
&lt;p&gt;With these basic components, you can already start defining a multi-container application. The flexibility of Docker Compose makes it easy to scale and manage services as your project grows.&lt;/p&gt;
&lt;h2&gt;Configuring Services in Docker Compose&lt;/h2&gt;
&lt;p&gt;In Docker Compose, services represent individual containers that run different parts of your application. You can define as many services as needed, and Docker Compose will manage them, making it easier to orchestrate multi-container applications.&lt;/p&gt;
&lt;h3&gt;Defining Services&lt;/h3&gt;
&lt;p&gt;Each service is defined under the &lt;code&gt;services&lt;/code&gt; section in the &lt;code&gt;docker-compose.yml&lt;/code&gt; file. The most basic configuration for a service includes specifying an image or a build option, ports to expose, and any additional service-specific configurations like volumes, networks, or environment variables.&lt;/p&gt;
&lt;p&gt;Here’s an example of a simple setup with a web server and a database service:&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-yaml&quot;&gt;services:
  web:
    image: nginx:latest
    ports:
      - &amp;quot;8080:80&amp;quot;
    networks:
      - app_network

  db:
    image: postgres:13
    environment:
      POSTGRES_USER: admin
      POSTGRES_PASSWORD: secret
    volumes:
      - db_data:/var/lib/postgresql/data
    networks:
      - app_network
&lt;/code&gt;&lt;/pre&gt;
&lt;h3&gt;Key Service Configuration Options&lt;/h3&gt;
&lt;ul&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Service Names:&lt;/strong&gt; The name you give a service (e.g., web, db) is important because Docker Compose uses these names for automatic DNS resolution between containers. Services can communicate with each other using their names as hostnames, without needing IP addresses.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Images:&lt;/strong&gt; You can use pre-built images from Docker Hub or any other registry by specifying the image option. In the above example, the web service uses the nginx image, and the database uses the postgres image.&lt;/p&gt;
&lt;/li&gt;
&lt;/ul&gt;
&lt;pre&gt;&lt;code class=&quot;language-yaml&quot;&gt;image: nginx:latest
&lt;/code&gt;&lt;/pre&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Ports:&lt;/strong&gt; The ports option exposes container ports to the host machine. This is useful for services like web servers or APIs that need to be accessible outside the Docker network.&lt;/li&gt;
&lt;/ul&gt;
&lt;pre&gt;&lt;code class=&quot;language-yaml&quot;&gt;ports:
  - &amp;quot;8080:80&amp;quot;
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;In this case, port 80 inside the container is mapped to port 8080 on the host machine.&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Environment Variables:&lt;/strong&gt; Many services require environment variables for configuration. In Docker Compose, you can easily define environment variables in the environment section. This is particularly useful for setting up databases or any other service that requires external configuration.&lt;/li&gt;
&lt;/ul&gt;
&lt;pre&gt;&lt;code class=&quot;language-yaml&quot;&gt;environment:
  POSTGRES_USER: admin
  POSTGRES_PASSWORD: secret
&lt;/code&gt;&lt;/pre&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Volumes:&lt;/strong&gt; Volumes are used to persist data between container restarts. In the example above, a volume is mounted for the Postgres database to ensure that data is not lost when the container stops or is removed.&lt;/li&gt;
&lt;/ul&gt;
&lt;pre&gt;&lt;code class=&quot;language-yaml&quot;&gt;volumes:
  - db_data:/var/lib/postgresql/data
&lt;/code&gt;&lt;/pre&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Networks:&lt;/strong&gt; Services are assigned to networks to manage how they communicate with each other. By placing the web and db services on the same network, we enable them to communicate using the service names (web, db) as hostnames.&lt;/li&gt;
&lt;/ul&gt;
&lt;pre&gt;&lt;code class=&quot;language-yaml&quot;&gt;networks:
  - app_network
&lt;/code&gt;&lt;/pre&gt;
&lt;h3&gt;Service Dependencies&lt;/h3&gt;
&lt;p&gt;Sometimes, one service depends on another. For example, in a web application, the web server may depend on the database being fully ready. Docker Compose allows you to define dependencies between services using the depends_on option.&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-yaml&quot;&gt;services:
  web:
    image: nginx:latest
    depends_on:
      - db
    ports:
      - &amp;quot;8080:80&amp;quot;

  db:
    image: postgres:13
    environment:
      POSTGRES_USER: admin
      POSTGRES_PASSWORD: secret
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;In this setup, Docker Compose ensures that the db service starts before the web service.&lt;/p&gt;
&lt;p&gt;With these service configuration options, you have the building blocks to define your application&apos;s architecture. Docker Compose makes it simple to manage the lifecycle of each service and its relationships to others in the stack.&lt;/p&gt;
&lt;h2&gt;Working with Environment Variables in Docker Compose&lt;/h2&gt;
&lt;p&gt;Environment variables are an essential part of configuring services in Docker Compose. They allow you to customize the behavior of each service without hardcoding values in your &lt;code&gt;docker-compose.yml&lt;/code&gt; file. This flexibility is especially useful when working with different environments, such as development, testing, and production.&lt;/p&gt;
&lt;h3&gt;Defining Environment Variables&lt;/h3&gt;
&lt;p&gt;There are several ways to define environment variables in Docker Compose:&lt;/p&gt;
&lt;h3&gt;1. Directly in the &lt;code&gt;docker-compose.yml&lt;/code&gt; File&lt;/h3&gt;
&lt;p&gt;You can define environment variables directly under the &lt;code&gt;environment&lt;/code&gt; key for each service. This approach is useful for simple configurations, but it might clutter the file if you have many variables.&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-yaml&quot;&gt;services:
  app:
    image: myapp:latest
    environment:
      - APP_ENV=production
      - APP_DEBUG=false
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;In this example, &lt;code&gt;APP_ENV&lt;/code&gt; is set to production, and &lt;code&gt;APP_DEBUG&lt;/code&gt; is disabled by setting it to false.&lt;/p&gt;
&lt;h3&gt;2. Using an &lt;code&gt;.env&lt;/code&gt; File&lt;/h3&gt;
&lt;p&gt;A more common practice is to separate environment variables from the docker-compose.yml file by using an &lt;code&gt;.env&lt;/code&gt; file. This file contains key-value pairs and allows you to manage your environment variables more cleanly. Docker Compose will automatically load the &lt;code&gt;.env&lt;/code&gt; file if it is in the same directory as the &lt;code&gt;docker-compose.yml&lt;/code&gt; file.&lt;/p&gt;
&lt;h4&gt;Example .env File:&lt;/h4&gt;
&lt;pre&gt;&lt;code class=&quot;language-bash&quot;&gt;APP_ENV=production
APP_DEBUG=false
DATABASE_URL=postgres://admin:secret@db:5432/mydb
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;In your docker-compose.yml file, reference these variables like this:&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-yaml&quot;&gt;services:
  app:
    image: myapp:latest
    environment:
      - APP_ENV
      - APP_DEBUG
      - DATABASE_URL
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Docker Compose will substitute the values from the .env file automatically when it starts the services.&lt;/p&gt;
&lt;h3&gt;3. Using Environment Files with the &lt;code&gt;env_file&lt;/code&gt; Option&lt;/h3&gt;
&lt;p&gt;Alternatively, you can load environment variables from a file explicitly by using the &lt;code&gt;env_file&lt;/code&gt; option in the &lt;code&gt;docker-compose.yml&lt;/code&gt; file. This is useful when you want to load variables from multiple files or have separate files for different environments (e.g., .env.development, .env.production).&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-yaml&quot;&gt;services:
  app:
    image: myapp:latest
    env_file:
      - .env
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;You can also specify multiple environment files:&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-yaml&quot;&gt;services:
  app:
    image: myapp:latest
    env_file:
      - .env
      - .env.custom
&lt;/code&gt;&lt;/pre&gt;
&lt;h3&gt;4. Overriding Environment Variables&lt;/h3&gt;
&lt;p&gt;If you define environment variables in both the &lt;code&gt;docker-compose.yml&lt;/code&gt; file and the .env file, the variables in &lt;code&gt;docker-compose.yml&lt;/code&gt; will take precedence. This allows you to have default values in your &lt;code&gt;.env&lt;/code&gt; file while overriding them on a per-service basis.&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-yaml&quot;&gt;services:
  app:
    image: myapp:latest
    environment:
      - APP_ENV=development  # Overrides the value from .env
&lt;/code&gt;&lt;/pre&gt;
&lt;h3&gt;Example: Configuring a Database Service with Environment Variables&lt;/h3&gt;
&lt;p&gt;Here’s an example of a database service configuration using environment variables:&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-yaml&quot;&gt;services:
  db:
    image: postgres:13
    environment:
      - POSTGRES_USER=${POSTGRES_USER}
      - POSTGRES_PASSWORD=${POSTGRES_PASSWORD}
      - POSTGRES_DB=${POSTGRES_DB}
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;In the .env file:&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-bash&quot;&gt;POSTGRES_USER=admin
POSTGRES_PASSWORD=secret
POSTGRES_DB=my_database
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;In this setup, the Postgres service will use the credentials defined in the .env file. This setup ensures that sensitive information like passwords is not hardcoded in the docker-compose.yml file.&lt;/p&gt;
&lt;h3&gt;Security Considerations&lt;/h3&gt;
&lt;p&gt;While using environment variables makes it easier to configure services, it’s important to be mindful of security:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Never commit .env files containing sensitive data (e.g., API keys, passwords) to version control systems. Use .gitignore to exclude .env files.&lt;/li&gt;
&lt;li&gt;Use Docker secrets for sensitive information in production environments. Docker Compose has native support for managing secrets more securely, especially in a Swarm setup.&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;Environment variables are a powerful feature in Docker Compose, providing flexibility and control over your services&apos; configurations. Whether you’re working in development or deploying to production, properly managing environment variables will help keep your configurations clean, secure, and scalable.&lt;/p&gt;
&lt;h2&gt;Networking in Docker Compose&lt;/h2&gt;
&lt;p&gt;One of the key strengths of Docker Compose is its ability to create networks for services to communicate with each other. Docker automatically sets up a network for your application, allowing services to communicate internally using their service names as hostnames. Understanding how networking works in Docker Compose will help you design more efficient, secure, and scalable applications.&lt;/p&gt;
&lt;h3&gt;1. Default Network Behavior&lt;/h3&gt;
&lt;p&gt;When you run &lt;code&gt;docker-compose up&lt;/code&gt; for the first time, Docker automatically creates a default network for your services. All services in the &lt;code&gt;docker-compose.yml&lt;/code&gt; file are attached to this network unless you define custom networks. Services can communicate with each other using their service names as DNS hostnames.&lt;/p&gt;
&lt;p&gt;For example, if you have the following configuration:&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-yaml&quot;&gt;services:
  web:
    image: nginx
    ports:
      - &amp;quot;8080:80&amp;quot;

  db:
    image: postgres
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;In this setup, the web service can access the db service using the hostname db. There’s no need to define IP addresses; Docker manages the DNS resolution internally.&lt;/p&gt;
&lt;h3&gt;2. Defining Custom Networks&lt;/h3&gt;
&lt;p&gt;Although the default network works in most cases, you may want more control over how your services communicate, especially in larger or more complex setups. You can create custom networks to organize service communication or to isolate certain services from others.&lt;/p&gt;
&lt;p&gt;To define a custom network, use the networks key in your &lt;code&gt;docker-compose.yml&lt;/code&gt; file:&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-yaml&quot;&gt;networks:
  frontend_network:
  backend_network:
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Then assign services to these networks:&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-yaml&quot;&gt;services:
  web:
    image: nginx
    networks:
      - frontend_network
      - backend_network

  db:
    image: postgres
    networks:
      - backend_network
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;In this example, the web service is attached to both frontend_network and backend_network, allowing it to communicate with both the front-end and back-end services. The db service is only attached to backend_network, which limits its exposure to the internal services.&lt;/p&gt;
&lt;h3&gt;3. Bridge Network Mode&lt;/h3&gt;
&lt;p&gt;The most common network mode in Docker Compose is the bridge network, which allows containers on the same network to communicate with each other. This is the default mode for networks unless you explicitly specify another network driver.&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-yaml&quot;&gt;networks:
  my_bridge_network:
    driver: bridge
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;You can attach services to this network by specifying it in the services section:&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-yaml&quot;&gt;services:
  web:
    image: nginx
    networks:
      - my_bridge_network

  db:
    image: postgres
    networks:
      - my_bridge_network
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Now, both services are connected via the my_bridge_network and can communicate freely using their service names (web and db).&lt;/p&gt;
&lt;h3&gt;4. Host Network Mode&lt;/h3&gt;
&lt;p&gt;In some cases, you may need your containers to share the network stack of the host. This is called the host network mode. In this mode, containers bypass Docker&apos;s network isolation and bind directly to the host’s network interface. This mode is useful when low-latency communication or direct access to the host’s network is required, but it reduces network isolation between containers and the host.&lt;/p&gt;
&lt;p&gt;To use the host network mode:&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-yaml&quot;&gt;services:
  web:
    image: nginx
    network_mode: &amp;quot;host&amp;quot;
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;However, be cautious when using the host network mode because it can introduce security risks by exposing your containers directly to the host network.&lt;/p&gt;
&lt;h3&gt;5. External Networks&lt;/h3&gt;
&lt;p&gt;In some cases, you might want to connect your services to networks created outside of Docker Compose. This is particularly useful when you have services running in separate Compose projects or standalone Docker containers that need to communicate with each other.&lt;/p&gt;
&lt;p&gt;To use an external network, you first create the network using the Docker CLI:&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-bash&quot;&gt;docker network create my_external_network
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Then, in your docker-compose.yml file, define the network as external:&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-yaml&quot;&gt;networks:
  my_external_network:
    external: true
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Now, you can assign services to this external network:&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-yaml&quot;&gt;services:
  web:
    image: nginx
    networks:
      - my_external_network
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;This allows your web service to communicate with containers that are also connected to &lt;code&gt;my_external_network&lt;/code&gt;, even if they are not defined in the same Docker Compose project.&lt;/p&gt;
&lt;h3&gt;6. Exposing Ports&lt;/h3&gt;
&lt;p&gt;Docker Compose allows you to expose container ports to the host machine, making services accessible from outside the Docker network. This is typically done using the ports option:&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-yaml&quot;&gt;services:
  web:
    image: nginx
    ports:
      - &amp;quot;8080:80&amp;quot;
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;In this example, port &lt;code&gt;80&lt;/code&gt; on the web container is mapped to port &lt;code&gt;8080&lt;/code&gt; on the host. This makes the web service accessible via &lt;code&gt;http://localhost:8080&lt;/code&gt; on your machine.&lt;/p&gt;
&lt;h3&gt;7. Connecting Services Across Multiple Docker Compose Files&lt;/h3&gt;
&lt;p&gt;When dealing with multiple Compose files or projects, you might need to connect services across different networks. By using external networks, you can link services from different Docker Compose configurations together.&lt;/p&gt;
&lt;p&gt;For example, suppose you have two separate projects, each with its own docker-compose.yml file. You can create an external network and add both services to this network to allow cross-project communication.&lt;/p&gt;
&lt;p&gt;Understanding Docker Compose networking is key to building scalable, secure, and efficient applications. Whether you are working with simple applications or complex microservices architectures, Compose makes networking straightforward and customizable, allowing you to tailor communication to your application&apos;s needs.&lt;/p&gt;
&lt;h2&gt;Using Pre-Built Images in Docker Compose&lt;/h2&gt;
&lt;p&gt;One of the major benefits of Docker is the availability of pre-built images for popular software, which you can easily pull from Docker Hub or other container registries. With Docker Compose, you can integrate these images into your &lt;code&gt;docker-compose.yml&lt;/code&gt; file, saving time and effort when setting up common services like databases, message brokers, or web servers.&lt;/p&gt;
&lt;h3&gt;1. Pulling and Using Existing Images&lt;/h3&gt;
&lt;p&gt;The &lt;code&gt;image&lt;/code&gt; option in Docker Compose allows you to specify a pre-built image. Docker will automatically pull the image if it&apos;s not available locally when you run &lt;code&gt;docker-compose up&lt;/code&gt;.&lt;/p&gt;
&lt;p&gt;Here’s an example of using a pre-built Nginx image:&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-yaml&quot;&gt;services:
  web:
    image: nginx:latest
    ports:
      - &amp;quot;8080:80&amp;quot;
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;In this example, Docker Compose pulls the &lt;code&gt;nginx:latest&lt;/code&gt; image from Docker Hub and runs the container, exposing port &lt;code&gt;80&lt;/code&gt; on the container to port &lt;code&gt;8080&lt;/code&gt; on the host machine.&lt;/p&gt;
&lt;ol start=&quot;2&quot;&gt;
&lt;li&gt;Specifying Image Versions
It’s important to specify a version tag when using pre-built images to avoid potential issues caused by changes in the latest version. For example, you might want to use a specific version of PostgreSQL:&lt;/li&gt;
&lt;/ol&gt;
&lt;pre&gt;&lt;code class=&quot;language-yaml&quot;&gt;services:
  db:
    image: postgres:13
    environment:
      POSTGRES_USER: admin
      POSTGRES_PASSWORD: secret
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;In this case, the image &lt;code&gt;postgres:13&lt;/code&gt; is pulled from Docker Hub, ensuring that version 13 of PostgreSQL is used, rather than the latest version which might introduce breaking changes.&lt;/p&gt;
&lt;h3&gt;3. Working with Private Images&lt;/h3&gt;
&lt;p&gt;Sometimes, you’ll need to pull images from private registries that require authentication. Docker Compose can handle private images by leveraging Docker&apos;s login mechanism. First, you need to log in to your private registry using the Docker CLI:&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-bash&quot;&gt;docker login myprivateregistry.com
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Then, you can reference the private image in your &lt;code&gt;docker-compose.yml&lt;/code&gt;:&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-yaml&quot;&gt;services:
  app:
    image: myprivateregistry.com/myapp:latest
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Docker Compose will automatically use your credentials from the Docker CLI to pull the image.&lt;/p&gt;
&lt;h3&gt;4. Using Image Variants&lt;/h3&gt;
&lt;p&gt;Some images come with different variants (e.g., alpine, slim, or buster), optimized for different use cases. For example, the nginx image has an alpine variant that is smaller in size, making it ideal for minimal setups.&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-yaml&quot;&gt;services:
  web:
    image: nginx:alpine
    ports:
      - &amp;quot;8080:80&amp;quot;
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;In this example, the &lt;code&gt;nginx:alpine&lt;/code&gt; image is pulled, which is a lightweight version of Nginx, reducing the overall size of the container and startup time.&lt;/p&gt;
&lt;h3&gt;5. Customizing Pre-Built Images&lt;/h3&gt;
&lt;p&gt;You may need to customize a pre-built image for your use case by adding configuration files, installing extra packages, or modifying the environment. You can still use a pre-built image as a base and customize it with a Dockerfile.&lt;/p&gt;
&lt;p&gt;For example, if you want to extend the official Node.js image to include additional packages:&lt;/p&gt;
&lt;p&gt;Dockerfile:&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-dockerfile&quot;&gt;FROM node:14

# Install additional packages
RUN apt-get update &amp;amp;&amp;amp; apt-get install -y \
    python \
    build-essential

# Set the working directory
WORKDIR /app

# Copy your application files
COPY . /app

# Install dependencies
RUN npm install

# Start the application
CMD [&amp;quot;npm&amp;quot;, &amp;quot;start&amp;quot;]
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;In your docker-compose.yml, use the build option to build this customized image:&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-yaml&quot;&gt;services:
  app:
    build: .
    ports:
      - &amp;quot;3000:3000&amp;quot;
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Docker Compose will now build the custom image using the Dockerfile, while still benefiting from the official Node.js base image.&lt;/p&gt;
&lt;h3&gt;6. Combining Pre-Built Images with Custom Services&lt;/h3&gt;
&lt;p&gt;A typical setup might involve combining multiple pre-built images, such as a database or cache, alongside custom services that you build yourself.&lt;/p&gt;
&lt;p&gt;Here’s an example of a web service that uses a custom Dockerfile and a PostgreSQL database that uses a pre-built image:&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-yaml&quot;&gt;services:
  app:
    build: .
    ports:
      - &amp;quot;3000:3000&amp;quot;
    environment:
      DATABASE_URL: postgres://admin:secret@db:5432/mydb
    depends_on:
      - db

  db:
    image: postgres:13
    environment:
      POSTGRES_USER: admin
      POSTGRES_PASSWORD: secret
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;In this setup, the db service uses a pre-built PostgreSQL image, while the app service is built using a custom Dockerfile.&lt;/p&gt;
&lt;h2&gt;Building Custom Images with Dockerfiles&lt;/h2&gt;
&lt;p&gt;While Docker Compose makes it easy to use pre-built images, sometimes you need more control over how your containers are built. This is where Dockerfiles come in. A Dockerfile is a script that contains instructions on how to build a Docker image from scratch or from a base image. By specifying a &lt;code&gt;Dockerfile&lt;/code&gt; in your Docker Compose setup, you can create custom images tailored to your application’s needs.&lt;/p&gt;
&lt;h3&gt;1. Overview of a Dockerfile&lt;/h3&gt;
&lt;p&gt;A Dockerfile consists of a series of commands and instructions that define what goes into your container image. The most common instructions include:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;code&gt;FROM&lt;/code&gt;: Specifies the base image you want to build from.&lt;/li&gt;
&lt;li&gt;&lt;code&gt;COPY&lt;/code&gt; or &lt;code&gt;ADD&lt;/code&gt;: Copies files from your host machine into the container.&lt;/li&gt;
&lt;li&gt;&lt;code&gt;RUN&lt;/code&gt;: Executes commands inside the container to install software, set up the environment, etc.&lt;/li&gt;
&lt;li&gt;&lt;code&gt;CMD&lt;/code&gt; or &lt;code&gt;ENTRYPOINT&lt;/code&gt;: Defines the default command or executable that runs when the container starts.&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;Here’s an example of a basic &lt;code&gt;Dockerfile&lt;/code&gt; for a Node.js application:&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-dockerfile&quot;&gt;# Use Node.js official image as the base image
FROM node:14

# Set the working directory in the container
WORKDIR /app

# Copy package.json and install dependencies
COPY package.json ./
RUN npm install

# Copy the rest of the application code
COPY . .

# Expose the application port
EXPOSE 3000

# Start the application
CMD [&amp;quot;npm&amp;quot;, &amp;quot;start&amp;quot;]
&lt;/code&gt;&lt;/pre&gt;
&lt;h3&gt;2. Using Dockerfiles in Docker Compose&lt;/h3&gt;
&lt;p&gt;In your docker-compose.yml, you can reference the Dockerfile using the build key. Docker Compose will build the custom image before running the services.&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-yaml&quot;&gt;services:
  app:
    build: .
    ports:
      - &amp;quot;3000:3000&amp;quot;
    volumes:
      - .:/app
    environment:
      NODE_ENV: development
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;In this example, Docker Compose will look for a Dockerfile in the same directory as the docker-compose.yml file, build the image, and then run the container. The . under build specifies the current directory as the build context, which includes the Dockerfile and the application files.&lt;/p&gt;
&lt;h3&gt;3. Specifying a Dockerfile Location&lt;/h3&gt;
&lt;p&gt;If your Dockerfile is located in a different directory, you can specify its path using the dockerfile option inside the build section.&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-yaml&quot;&gt;services:
  app:
    build:
      context: .
      dockerfile: ./docker/Dockerfile
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;This tells Docker Compose to use the Dockerfile located in the docker/ directory.&lt;/p&gt;
&lt;h3&gt;4. Customizing Images with Dockerfile Instructions&lt;/h3&gt;
&lt;p&gt;You can extend the functionality of your custom image by adding more commands to the Dockerfile. Here are some commonly used instructions:&lt;/p&gt;
&lt;p&gt;Installing Dependencies: Use RUN to install dependencies or run setup commands inside the container.&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-dockerfile&quot;&gt;RUN apt-get update &amp;amp;&amp;amp; apt-get install -y python3
&lt;/code&gt;&lt;/pre&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Environment Variables:&lt;/strong&gt; You can set environment variables inside the Dockerfile using ENV.&lt;/li&gt;
&lt;/ul&gt;
&lt;pre&gt;&lt;code class=&quot;language-dockerfile&quot;&gt;ENV NODE_ENV production
&lt;/code&gt;&lt;/pre&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Exposing Ports:&lt;/strong&gt; Use EXPOSE to specify which port the application will listen on inside the container.&lt;/li&gt;
&lt;/ul&gt;
&lt;pre&gt;&lt;code class=&quot;language-dockerfile&quot;&gt;EXPOSE 3000
&lt;/code&gt;&lt;/pre&gt;
&lt;h3&gt;5. Managing Build Caches with Multi-Stage Builds&lt;/h3&gt;
&lt;p&gt;Docker supports multi-stage builds, which allow you to optimize the size of your final image by including only the necessary components for production. This is especially useful for build-heavy applications like Java or Node.js, where you may need extra dependencies for development but not for the final production container.&lt;/p&gt;
&lt;p&gt;Here’s an example of a multi-stage build for a Go application:&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-dockerfile&quot;&gt;# Build stage
FROM golang:1.17 AS builder
WORKDIR /app
COPY . .
RUN go build -o myapp .

# Production stage
FROM alpine:3.15
WORKDIR /app
COPY --from=builder /app/myapp .
CMD [&amp;quot;./myapp&amp;quot;]
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;In this example, the golang image is used for compiling the Go application, but the final container is based on the lightweight alpine image, making the production image much smaller.&lt;/p&gt;
&lt;h3&gt;6. Overriding the Default CMD or ENTRYPOINT&lt;/h3&gt;
&lt;p&gt;In some cases, you may want to override the default command or entrypoint defined in the Dockerfile. Docker Compose allows you to specify a custom command for a service using the command option:&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-yaml&quot;&gt;services:
  app:
    build: .
    ports:
      - &amp;quot;3000:3000&amp;quot;
    command: [&amp;quot;npm&amp;quot;, &amp;quot;run&amp;quot;, &amp;quot;custom-script&amp;quot;]
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;This overrides the CMD defined in the Dockerfile and runs the custom script instead.&lt;/p&gt;
&lt;h3&gt;7. Rebuilding Images&lt;/h3&gt;
&lt;p&gt;When you make changes to the Dockerfile, you need to rebuild the image for those changes to take effect. You can force a rebuild by running:&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-bash&quot;&gt;docker-compose up --build
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;This will recreate the images based on the updated Dockerfile and redeploy the services.&lt;/p&gt;
&lt;h3&gt;8. Using Build Arguments&lt;/h3&gt;
&lt;p&gt;You can pass build-time variables to your Dockerfile using build arguments (ARG). This is useful for passing values that are only needed during the build process (e.g., different configurations for development and production).&lt;/p&gt;
&lt;p&gt;Here’s how you define a build argument in a Dockerfile:&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-dockerfile&quot;&gt;ARG APP_ENV=development
RUN echo &amp;quot;Building for $APP_ENV&amp;quot;
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;And in your docker-compose.yml, you can pass the argument during the build process:&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-yaml&quot;&gt;services:
  app:
    build:
      context: .
      args:
        APP_ENV: production
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;This allows you to customize the build process based on different environments.&lt;/p&gt;
&lt;h2&gt;Volumes and Persistent Storage in Docker Compose&lt;/h2&gt;
&lt;p&gt;When running containers, any data stored inside them is ephemeral, meaning it will be lost when the container is stopped or removed. To ensure data persistence, Docker Compose allows you to use volumes, which enable you to store data outside of the container’s lifecycle. Volumes are the preferred way to persist data, as they are managed by Docker and can be shared between containers.&lt;/p&gt;
&lt;h3&gt;1. What Are Docker Volumes?&lt;/h3&gt;
&lt;p&gt;Docker volumes provide a mechanism for storing data outside the container’s filesystem. This allows you to persist data even if a container is stopped, removed, or recreated. Volumes can be shared between multiple containers, making them useful for scenarios where multiple services need access to the same data.&lt;/p&gt;
&lt;h3&gt;2. Defining Volumes in Docker Compose&lt;/h3&gt;
&lt;p&gt;You can define volumes in the &lt;code&gt;docker-compose.yml&lt;/code&gt; file under the &lt;code&gt;volumes&lt;/code&gt; section. Volumes can be either named or anonymous. Named volumes have explicit names and can be reused across multiple services, while anonymous volumes are automatically generated by Docker and have no specific name.&lt;/p&gt;
&lt;p&gt;Here’s how to define a named volume:&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-yaml&quot;&gt;volumes:
  db_data:
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;In this example, we’ve defined a volume named db_data that can be shared between services.&lt;/p&gt;
&lt;h3&gt;3. Attaching Volumes to Services&lt;/h3&gt;
&lt;p&gt;Once you’ve defined a volume, you can attach it to a service using the volumes option under that service. This maps a directory on the host to a directory inside the container.&lt;/p&gt;
&lt;p&gt;Here’s an example of attaching the &lt;code&gt;db_data&lt;/code&gt; volume to a PostgreSQL service to persist the database data:&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-yaml&quot;&gt;services:
  db:
    image: postgres:13
    volumes:
      - db_data:/var/lib/postgresql/data

&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;In this case, the &lt;code&gt;db_data&lt;/code&gt; volume is mapped to &lt;code&gt;/var/lib/postgresql/data&lt;/code&gt; inside the container, ensuring that any data stored by PostgreSQL is saved outside the container.&lt;/p&gt;
&lt;h3&gt;4. Bind Mounts vs. Volumes&lt;/h3&gt;
&lt;p&gt;There are two ways to persist data in Docker: volumes and bind mounts. While volumes are managed by Docker and are the recommended approach, bind mounts allow you to directly map directories on your host machine to directories inside the container.&lt;/p&gt;
&lt;p&gt;Here’s how to use a bind mount in Docker Compose:&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-yaml&quot;&gt;services:
  app:
    image: myapp:latest
    volumes:
      - ./app:/usr/src/app
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;In this example, the local directory ./app on the host is mounted to /usr/src/app inside the container. This is especially useful in development environments where you want to reflect changes in real-time.&lt;/p&gt;
&lt;h3&gt;5. Sharing Volumes Between Services&lt;/h3&gt;
&lt;p&gt;Sometimes, multiple services need access to the same data. Docker Compose allows you to share volumes between services, enabling them to collaborate on the same files or datasets.&lt;/p&gt;
&lt;p&gt;For example, here’s how you could share a volume between a web service and a background worker service:&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-yaml&quot;&gt;services:
  web:
    image: nginx:latest
    volumes:
      - shared_data:/usr/share/nginx/html

  worker:
    image: myworker:latest
    volumes:
      - shared_data:/usr/src/app/data

volumes:
  shared_data:
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;In this setup, both the web and worker services have access to the shared_data volume. The web service stores its static content in the volume, while the worker service reads or processes the same data.&lt;/p&gt;
&lt;h3&gt;6. Data Persistence for Databases&lt;/h3&gt;
&lt;p&gt;For databases, using volumes is crucial to ensure data is not lost when containers are stopped or removed. Here’s an example of a Docker Compose configuration for a MySQL service that persists data using a named volume:&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-yaml&quot;&gt;services:
  mysql:
    image: mysql:8
    environment:
      MYSQL_ROOT_PASSWORD: rootpass
      MYSQL_DATABASE: mydatabase
    volumes:
      - mysql_data:/var/lib/mysql

volumes:
  mysql_data:
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;In this setup, the mysql_data volume is used to persist MySQL data in &lt;code&gt;/var/lib/mysql&lt;/code&gt;. This ensures that even if the mysql container is stopped or recreated, the database data remains intact.&lt;/p&gt;
&lt;h3&gt;7. Backing Up and Restoring Volumes&lt;/h3&gt;
&lt;p&gt;Since volumes are managed by Docker, you can easily back them up and restore them using Docker CLI commands. To back up a volume, you can create a new container that mounts the volume and copies its contents to a file on your host:&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-bash&quot;&gt;docker run --rm -v db_data:/volume -v $(pwd):/backup busybox tar cvf /backup/db_data.tar /volume
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;To restore the volume, simply reverse the process:&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-bash&quot;&gt;docker run --rm -v db_data:/volume -v $(pwd):/backup busybox tar xvf /backup/db_data.tar -C /volume
&lt;/code&gt;&lt;/pre&gt;
&lt;h3&gt;8. Volume Drivers and Options&lt;/h3&gt;
&lt;p&gt;Docker allows you to use custom volume drivers for more advanced use cases. These drivers let you store data on remote storage systems, such as AWS, Google Cloud, or NFS. To specify a volume driver, you can use the driver option in the volumes section:&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-yaml&quot;&gt;Copy code
volumes:
  my_custom_volume:
    driver: nfs
    driver_opts:
      share: &amp;quot;192.168.1.100:/path/to/share&amp;quot;
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;This example sets up an NFS volume, allowing your service to persist data on a remote NFS server.&lt;/p&gt;
&lt;h3&gt;9. Removing Volumes&lt;/h3&gt;
&lt;p&gt;Volumes are not automatically removed when you stop or remove a container. You need to explicitly remove volumes when they are no longer needed. To remove all unused volumes, you can run the following command:&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-bash&quot;&gt;docker volume prune
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;To remove a specific volume, use:&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-bash&quot;&gt;docker volume rm volume_name
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Volumes are a powerful feature in Docker Compose that allow you to persist and share data between containers. Whether you’re storing database data, sharing files across services, or mounting host directories, volumes provide a flexible and reliable way to manage data within your containerized applications.&lt;/p&gt;
&lt;h2&gt;Advanced Docker Compose Features&lt;/h2&gt;
&lt;p&gt;While Docker Compose simplifies multi-container setups, it also provides several advanced features that enhance control and efficiency in managing your application’s lifecycle. These features allow you to scale services, manage dependencies, ensure service health, and more. Let’s dive into some of these powerful tools that you can leverage in Docker Compose.&lt;/p&gt;
&lt;h3&gt;1. Scaling Services with Docker Compose&lt;/h3&gt;
&lt;p&gt;One of the most useful features of Docker Compose is the ability to scale your services horizontally. This means you can run multiple instances of a service to handle more load or ensure redundancy. Scaling is especially beneficial for stateless services, like web servers or worker processes.&lt;/p&gt;
&lt;p&gt;You can scale services by using the &lt;code&gt;--scale&lt;/code&gt; option with &lt;code&gt;docker-compose up&lt;/code&gt;:&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-bash&quot;&gt;docker-compose up --scale web=3
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;This command will start 3 instances of the web service. To ensure proper load balancing between the scaled services, you may need to configure a load balancer (like NGINX) or rely on Docker&apos;s internal round-robin DNS resolution.&lt;/p&gt;
&lt;p&gt;Alternatively, you can define service replicas in your &lt;code&gt;docker-compose.yml&lt;/code&gt;:&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-yaml&quot;&gt;services:
  web:
    image: nginx:latest
    deploy:
      replicas: 3
&lt;/code&gt;&lt;/pre&gt;
&lt;h3&gt;2. Service Dependencies with &lt;code&gt;depends_on&lt;/code&gt;&lt;/h3&gt;
&lt;p&gt;In many applications, certain services depend on others to be available before they can start. Docker Compose provides the depends_on option to express this relationship. This ensures that Docker starts the dependent services in the correct order.&lt;/p&gt;
&lt;p&gt;Here’s an example of a web service that depends on a database service:&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-yaml&quot;&gt;services:
  web:
    image: nginx
    depends_on:
      - db

  db:
    image: postgres
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;However, note that depends_on only controls the startup order; it does not wait for the dependent service to be &amp;quot;ready&amp;quot; (e.g., wait for the database to be accepting connections). For more robust dependency management, consider using health checks (covered below) or custom retry logic in your application.&lt;/p&gt;
&lt;h3&gt;3. Health Checks to Ensure Service Availability&lt;/h3&gt;
&lt;p&gt;Docker Compose allows you to define health checks to monitor the state of a service. A health check regularly runs a command inside the container, and Docker uses the result to determine if the container is healthy or not. You can configure health checks for your services using the healthcheck option.&lt;/p&gt;
&lt;p&gt;Here’s an example of adding a health check to a database service:&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-yaml&quot;&gt;services:
  db:
    image: postgres
    healthcheck:
      test: [&amp;quot;CMD-SHELL&amp;quot;, &amp;quot;pg_isready -U postgres&amp;quot;]
      interval: 30s
      timeout: 10s
      retries: 5
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;In this case, Docker will check every 30 seconds whether the Postgres database is ready to accept connections. If the service fails the check 5 times, Docker marks the service as unhealthy.&lt;/p&gt;
&lt;p&gt;You can use this health status in combination with other services, ensuring that dependent services only start once the service they rely on is healthy.&lt;/p&gt;
&lt;h3&gt;4. Managing Resource Constraints&lt;/h3&gt;
&lt;p&gt;Docker Compose allows you to control the resources (CPU and memory) allocated to each service. This is especially important when running multiple containers on the same host, as it helps prevent resource contention.&lt;/p&gt;
&lt;p&gt;Here’s how to define resource limits for a service:&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-yaml&quot;&gt;services:
  web:
    image: nginx
    deploy:
      resources:
        limits:
          cpus: &amp;quot;0.5&amp;quot;
          memory: &amp;quot;512M&amp;quot;
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;In this example, the web service is limited to using 50% of the CPU and 512MB of memory. You can also set reservation values to guarantee a certain amount of resources for a container.&lt;/p&gt;
&lt;h3&gt;5. Using restart Policies for Service Resilience&lt;/h3&gt;
&lt;p&gt;To ensure that your services are automatically restarted in case of failure, you can define a restart policy. Docker Compose provides several options for managing how and when containers should be restarted:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;no:&lt;/strong&gt; Do not automatically restart the container (default).&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;always:&lt;/strong&gt; Always restart the container if it stops.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;on-failure:&lt;/strong&gt; Only restart if the container exits with a non-zero code.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;unless-stopped:&lt;/strong&gt; Restart unless the container is explicitly stopped.&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;Here’s an example of using a restart policy for a web service:&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-yaml&quot;&gt;services:
  web:
    image: nginx
    restart: always
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;In this case, the web service will always be restarted if it crashes or is stopped unintentionally.&lt;/p&gt;
&lt;h3&gt;6. Using External Configuration Files&lt;/h3&gt;
&lt;p&gt;For complex environments, it&apos;s often necessary to manage multiple configurations for different deployment stages (e.g., development, testing, production). Docker Compose allows you to extend or override base configurations using multiple Compose files.&lt;/p&gt;
&lt;p&gt;Here’s how you can use multiple Compose files:&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-bash&quot;&gt;docker-compose -f docker-compose.yml -f docker-compose.prod.yml up
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;In this example, the docker-compose.prod.yml file extends or overrides configurations from the base docker-compose.yml file, allowing you to customize settings for production.&lt;/p&gt;
&lt;h3&gt;7. Environment-Specific Overrides with Profiles&lt;/h3&gt;
&lt;p&gt;Docker Compose introduced the concept of profiles, allowing you to selectively enable or disable services depending on the environment. Profiles allow you to define which services should run in specific environments (e.g., production vs. development).&lt;/p&gt;
&lt;p&gt;Here’s how to define a profile in your docker-compose.yml:&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-yaml&quot;&gt;services:
  web:
    image: nginx
    profiles:
      - production

  debug:
    image: busybox
    profiles:
      - debug
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;You can specify the profile to use when running Docker Compose:&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-bash&quot;&gt;docker-compose --profile production up
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;In this case, only the web service will be started, as it belongs to the production profile.&lt;/p&gt;
&lt;h3&gt;8. Using Docker Secrets for Secure Data Management&lt;/h3&gt;
&lt;p&gt;For handling sensitive data like passwords, API keys, or certificates, Docker provides a secure way to manage secrets. In Docker Compose, secrets are securely stored outside of the container and injected at runtime.&lt;/p&gt;
&lt;p&gt;Here’s an example of using Docker secrets in a Compose file:&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-yaml&quot;&gt;services:
  app:
    image: myapp:latest
    secrets:
      - db_password

secrets:
  db_password:
    file: ./secrets/db_password.txt
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;In this example, the secret &lt;code&gt;db_password&lt;/code&gt; is stored in an external file and made available to the app service. Docker Compose automatically ensures that the secret is only accessible to the service that needs it.&lt;/p&gt;
&lt;h2&gt;Best Practices for Docker Compose Files&lt;/h2&gt;
&lt;p&gt;As your application grows, your &lt;code&gt;docker-compose.yml&lt;/code&gt; file can become more complex. Following best practices can help you maintain clean, readable, and scalable configurations, making it easier to manage and deploy your applications. Below are some key best practices to keep in mind when working with Docker Compose files.&lt;/p&gt;
&lt;h3&gt;1. Use Environment Variables for Configuration&lt;/h3&gt;
&lt;p&gt;Hardcoding values like database passwords, API keys, and service configuration in your &lt;code&gt;docker-compose.yml&lt;/code&gt; file can lead to security risks and reduced flexibility. Instead, use environment variables to manage configuration, particularly when deploying to different environments (e.g., development, testing, production).&lt;/p&gt;
&lt;p&gt;Here’s an example of using environment variables in your Compose file:&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-yaml&quot;&gt;services:
  app:
    image: myapp:latest
    environment:
      - DATABASE_URL=${DATABASE_URL}
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;And in your &lt;code&gt;.env&lt;/code&gt; file:&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-bash&quot;&gt;DATABASE_URL=postgres://admin:secret@db:5432/mydb
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;This setup ensures that sensitive information is not hardcoded, and you can easily switch configurations by modifying the .env file.&lt;/p&gt;
&lt;h3&gt;2. Split Configuration into Multiple Files&lt;/h3&gt;
&lt;p&gt;For larger applications, managing everything in a single docker-compose.yml file can become cumbersome. A good practice is to split configurations into multiple files, each targeting a specific environment or use case. You can then combine these files when running your services.&lt;/p&gt;
&lt;p&gt;For example:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;docker-compose.yml:&lt;/strong&gt; Base configuration.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;docker-compose.override.yml:&lt;/strong&gt; Development-specific overrides.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;docker-compose.prod.yml:&lt;/strong&gt; Production-specific settings.
You can run them together using:&lt;/li&gt;
&lt;/ul&gt;
&lt;pre&gt;&lt;code class=&quot;language-bash&quot;&gt;docker-compose -f docker-compose.yml -f docker-compose.prod.yml up
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;This approach keeps your configurations organized and easier to manage.&lt;/p&gt;
&lt;h3&gt;3. Use Named Volumes for Data Persistence&lt;/h3&gt;
&lt;p&gt;Always use named volumes instead of anonymous volumes to persist data across container restarts and ensure proper management. Named volumes are easier to reference and maintain throughout the lifecycle of your application.&lt;/p&gt;
&lt;p&gt;Here’s how to define a named volume:&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-yaml&quot;&gt;services:
  db:
    image: postgres
    volumes:
      - db_data:/var/lib/postgresql/data

volumes:
  db_data:
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Named volumes also make it simpler to perform backups or migrate data between environments.&lt;/p&gt;
&lt;h3&gt;4. Limit Container Resource Usage&lt;/h3&gt;
&lt;p&gt;To prevent containers from consuming excessive resources, it’s important to set resource limits for CPU and memory. This is particularly important when running multiple services on a single machine or when deploying in a production environment.&lt;/p&gt;
&lt;p&gt;Here’s how to define resource limits for a service:&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-yaml&quot;&gt;services:
  web:
    image: nginx:latest
    deploy:
      resources:
        limits:
          cpus: &amp;quot;0.5&amp;quot;
          memory: &amp;quot;512M&amp;quot;
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;This ensures that the web service only consumes half a CPU core and 512MB of memory, avoiding resource contention.&lt;/p&gt;
&lt;h3&gt;5. Avoid Running Unnecessary Services in Production&lt;/h3&gt;
&lt;p&gt;During development, you might have services that are only necessary for debugging or testing purposes (e.g., admin panels, mock services). In production, these services can introduce security risks and unnecessary overhead. Use Docker Compose profiles or multiple Compose files to control which services are included in specific environments.&lt;/p&gt;
&lt;p&gt;For example, you can define a development-only service:&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-yaml&quot;&gt;services:
  debug:
    image: busybox
    command: sleep 1000
    profiles:
      - debug
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;When you deploy to production, simply omit the debug profile:&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-bash&quot;&gt;docker-compose --profile production up
&lt;/code&gt;&lt;/pre&gt;
&lt;ol start=&quot;6&quot;&gt;
&lt;li&gt;Use Build Caching to Speed Up Development
When building custom Docker images, take advantage of Docker’s layer caching by ordering the steps in your Dockerfile efficiently. For example, install dependencies that don’t change frequently first, and copy application code afterward. This minimizes the number of rebuilds needed during development.&lt;/li&gt;
&lt;/ol&gt;
&lt;p&gt;Here’s an example:&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-dockerfile&quot;&gt;# Install dependencies first
COPY package.json /app
RUN npm install

# Then copy the rest of the application code
COPY . /app
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;This ensures that if you only modify your application code, Docker can reuse the cached layers for dependency installation, speeding up the build process.&lt;/p&gt;
&lt;h3&gt;7. Use Version Control for Docker Compose Files&lt;/h3&gt;
&lt;p&gt;Treat your docker-compose.yml file as part of your source code. Use version control systems like Git to track changes, collaborate with others, and roll back configurations if necessary. This is especially important in team environments where multiple people are working on the same project.&lt;/p&gt;
&lt;p&gt;Additionally, use clear commit messages and meaningful branch names when modifying your Docker Compose configurations, so it’s easier to track changes over time.&lt;/p&gt;
&lt;h3&gt;8. Keep Secrets Secure&lt;/h3&gt;
&lt;p&gt;Avoid storing sensitive information, such as database passwords or API keys, directly in your &lt;code&gt;docker-compose.yml&lt;/code&gt; file or environment variables. Instead, use Docker secrets for sensitive data in production environments.&lt;/p&gt;
&lt;p&gt;Here’s an example of using Docker secrets in your Compose file:&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-yaml&quot;&gt;services:
  db:
    image: postgres
    secrets:
      - db_password

secrets:
  db_password:
    file: ./secrets/db_password.txt
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;This way, secrets are managed securely and are only accessible to the service that needs them.&lt;/p&gt;
&lt;h3&gt;9. Use docker-compose config to Validate Files&lt;/h3&gt;
&lt;p&gt;Before running your Docker Compose setup, it’s a good idea to validate your docker-compose.yml file. The docker-compose config command helps you ensure that your configuration is correct and free of syntax errors.&lt;/p&gt;
&lt;p&gt;Run the following command to validate your Compose file:&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-bash&quot;&gt;docker-compose config
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;This will print the merged configuration, showing any syntax errors or misconfigurations.&lt;/p&gt;
&lt;h3&gt;10. Clean Up Unused Resources Regularly&lt;/h3&gt;
&lt;p&gt;Over time, unused containers, images, and volumes can accumulate, consuming disk space and memory. Make it a habit to clean up these unused resources regularly to keep your system lean.&lt;/p&gt;
&lt;p&gt;To remove unused containers, images, and volumes, run:&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-bash&quot;&gt;docker system prune
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;To remove unused volumes specifically, run:&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-bash&quot;&gt;docker volume prune
&lt;/code&gt;&lt;/pre&gt;
&lt;h2&gt;Conclusion&lt;/h2&gt;
&lt;p&gt;Docker Compose is an incredibly powerful tool for managing multi-container applications, offering a simplified way to configure, deploy, and scale your services. Through this deep dive, we’ve explored the core aspects of Docker Compose, from basic file structures and service configurations to advanced features like networking, environment variables, volumes, and custom Dockerfiles. Along the way, we’ve also covered best practices to help ensure your Compose setup is scalable, maintainable, and secure.&lt;/p&gt;
&lt;h3&gt;Key Takeaways:&lt;/h3&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Compose File Structure&lt;/strong&gt;: Understand the basic components of a &lt;code&gt;docker-compose.yml&lt;/code&gt; file, including services, networks, and volumes.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Service Configuration&lt;/strong&gt;: Learn how to define services using pre-built images or custom Dockerfiles, manage environment variables, and control service dependencies.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Networking&lt;/strong&gt;: Docker Compose simplifies internal service communication through default and custom networks, making service discovery and network isolation easier.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Persistent Data&lt;/strong&gt;: Use volumes to persist and share data across containers, ensuring critical data is not lost between restarts.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Advanced Features&lt;/strong&gt;: Leverage advanced features like scaling, health checks, resource constraints, and restart policies to ensure your application is resilient and efficient.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Best Practices&lt;/strong&gt;: Keep your Docker Compose files clean, modular, and secure by using environment variables, named volumes, resource limits, and proper version control.&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;Whether you&apos;re managing a small development environment or deploying a complex production system, Docker Compose provides the flexibility and control needed to efficiently run and scale containerized applications. With this guide, you now have the knowledge and tools to fully leverage Docker Compose in your projects, ensuring a smoother, more organized workflow.&lt;/p&gt;
&lt;h3&gt;Next Steps:&lt;/h3&gt;
&lt;ul&gt;
&lt;li&gt;Experiment with Docker Compose in different environments (development, testing, production).&lt;/li&gt;
&lt;li&gt;Explore more advanced Docker Compose features, such as integration with Docker Swarm for orchestration.&lt;/li&gt;
&lt;li&gt;Continuously refine your Compose files by following best practices and adopting new Docker features as they are released.&lt;/li&gt;
&lt;li&gt;Consider diving deeper into container orchestration systems like Kubernetes for larger-scale deployments.&lt;/li&gt;
&lt;/ul&gt;
</content:encoded><author>Alex Merced</author></item><item><title>Why Data Analysts, Engineers, Architects and Scientists Should Care about Dremio and Apache Iceberg</title><link>https://iceberglakehouse.com/posts/2024-9-Why-Dremio-Iceberg-Matters/</link><guid isPermaLink="true">https://iceberglakehouse.com/posts/2024-9-Why-Dremio-Iceberg-Matters/</guid><description>
- [Free Copy of Apache Iceberg: The Definitive Guide](https://hello.dremio.com/wp-apache-iceberg-the-definitive-guide-reg.html?utm_source=ev_external...</description><pubDate>Tue, 10 Sep 2024 09:00:00 GMT</pubDate><content:encoded>&lt;ul&gt;
&lt;li&gt;&lt;a href=&quot;https://hello.dremio.com/wp-apache-iceberg-the-definitive-guide-reg.html?utm_source=ev_external_blog&amp;amp;utm_medium=influencer&amp;amp;utm_campaign=whypros&amp;amp;utm_content=alexmerced&amp;amp;utm_term=external_blog&quot;&gt;Free Copy of Apache Iceberg: The Definitive Guide&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&quot;https://hello.dremio.com/webcast-an-apache-iceberg-lakehouse-crash-course-reg.html?utm_source=ev_external_blog&amp;amp;utm_medium=influencer&amp;amp;utm_campaign=whypros&amp;amp;utm_content=alexmerced&amp;amp;utm_term=external_blog&quot;&gt;Free Apache Iceberg Crash Course&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;Data architecture is an ever-evolving landscape. Over the years, we&apos;ve witnessed the shift from on-premises data warehouses to on-premises data lakes, then to cloud-based data warehouses and lakes. Now we&apos;re seeing a growing trend toward hybrid infrastructure. One thing is clear: change is inevitable. That&apos;s why it&apos;s crucial to have a flexible architecture, allowing you to embrace future innovations without overhauling your entire data ecosystem.&lt;/p&gt;
&lt;p&gt;In this article, I’ll explore why data professionals—whether you&apos;re a data analyst, engineer, architect, or scientist—should care about technologies like Apache Iceberg and Dremio. I&apos;ll explain how these tools can simplify your workflow while maintaining the flexibility you need.&lt;/p&gt;
&lt;h2&gt;What is Dremio?&lt;/h2&gt;
&lt;p&gt;Dremio is a Lakehouse Platform designed to help you unlock the full potential of your existing data lake by embracing three key architectural trends: the data lakehouse, data mesh, and data virtualization.&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Data Virtualization&lt;/strong&gt;: Dremio’s SQL query engine, built on Apache Arrow and other performance innovations, enables you to seamlessly federate queries across databases, data warehouses, data lakes, and data lakehouse catalogs, including Iceberg and Delta Lake tables. With industry-leading performance, Dremio provides a practical and highly effective tool for data virtualization.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Data Mesh&lt;/strong&gt;: Dremio’s Semantic Layer empowers you to model, collaborate, and govern your data across all sources from a single location. This robust feature allows you to create virtual data marts or data products that adhere to data mesh principles, facilitating better collaboration and governance across teams.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Lakehouse Architecture&lt;/strong&gt;: Dremio’s Lakehouse capabilities fully support reading and writing to Apache Iceberg tables. With Dremio&apos;s Enterprise Lakehouse Catalog, you can enable Git-like isolation for workloads, create zero-copy environments for experimentation and development, and automate the optimization of your Iceberg tables. This ensures they are both performant and storage-efficient, transforming your data lake into a fully functional data warehouse—essentially, a data lakehouse.&lt;/p&gt;
&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;Beyond enabling you to maximize the value and accessibility of your data, Dremio offers flexibility in deployment, whether on-premises or in the cloud. It can also access data from both environments, delivering unmatched flexibility and data unification.&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;a href=&quot;https://www.dremio.com/blog/intro-to-dremio-nessie-and-apache-iceberg-on-your-laptop/?utm_source=ev_external_blog&amp;amp;utm_medium=influencer&amp;amp;utm_campaign=whypros&amp;amp;utm_content=alexmerced&amp;amp;utm_term=external_blog&quot;&gt;Get Hands-on with Dremio For Free From Your Laptop&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;h2&gt;What is Apache Iceberg?&lt;/h2&gt;
&lt;p&gt;Apache Iceberg is a table format that brings data warehouse-like functionality to your data lake by utilizing Apache Parquet files. Iceberg acts as a metadata layer around groups of Parquet files, offering three key capabilities:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Consistent Table Definition&lt;/strong&gt;: Iceberg ensures a consistent definition of what files are part of the table, providing stability and reliability in managing large datasets.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Efficient Data Scanning&lt;/strong&gt;: It provides statistics on the table that function as an index, enabling efficient and optimized scans of the table for faster query performance.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Advanced Data Warehouse Features&lt;/strong&gt;: Iceberg supports essential data warehouse features like ACID guarantees and schema evolution, along with unique capabilities like partition evolution and hidden partitioning. These features make partitioning easier to use for both data engineers and data analysts.&lt;/p&gt;
&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;By enabling your data lake to function as a data warehouse, Apache Iceberg, when paired with a Lakehouse platform like Dremio, allows you to efficiently manage your Iceberg tables while unifying them with other data sources across databases, data lakes, and data warehouses.&lt;/p&gt;
&lt;h2&gt;Why Data Engineers Should Care?&lt;/h2&gt;
&lt;p&gt;Data engineers face various daily challenges when dealing with complex data ecosystems. These challenges often stem from data silos, governance issues, and managing long chains of pipelines. Here are some of the most common pain points:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Data Silos&lt;/strong&gt;: Different teams or departments often store their data in separate systems, such as databases, cloud storage, or on-prem data lakes. This fragmentation creates data silos, making it difficult to integrate and unify data across the organization. Data engineers spend significant time building and maintaining pipelines to access, transform, and consolidate this data.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Data Governance&lt;/strong&gt;: Ensuring proper data governance is another ongoing challenge. Data engineers must ensure compliance with data access, security, lineage, and privacy policies across a diverse set of data sources. Without a unified approach, enforcing consistent data governance can be a cumbersome and error-prone process.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Complex Pipelines&lt;/strong&gt;: Managing long and complex data pipelines often involves many interdependent steps, from data extraction to transformation and loading (ETL). These pipelines are fragile, difficult to maintain, and prone to errors when upstream changes occur, causing bottlenecks in data delivery and forcing engineers to spend time on troubleshooting rather than innovation.&lt;/p&gt;
&lt;/li&gt;
&lt;/ul&gt;
&lt;h3&gt;How Apache Iceberg and Dremio Alleviate These Challenges&lt;/h3&gt;
&lt;p&gt;Apache Iceberg and Dremio provide a powerful combination that addresses these challenges with modern, scalable solutions:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Unifying Data Silos&lt;/strong&gt;: Dremio&apos;s high-performance data virtualization capabilities enable seamless querying across multiple data sources—whether it&apos;s a data lake, database, or cloud storage—without the need for complex pipelines. This allows data engineers to access and integrate data more efficiently, reducing the friction of working across data silos.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Data Governance Simplified&lt;/strong&gt;: With Dremio&apos;s Semantic Layer, data engineers can model, secure, and govern data from a single interface, ensuring consistent governance across all sources. Iceberg&apos;s metadata layer also tracks schema changes, partitions, and file statistics, making managing and auditing data lineage and compliance easier.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Streamlining Pipeline Complexity&lt;/strong&gt;: Dremio&apos;s reflections feature reduces the need for many steps in traditional data pipelines by enabling automatic optimization and caching or views. This eliminates the need for complex ETL processes and materialized views, allowing data engineers to focus on delivering insights faster. Meanwhile, Apache Iceberg allows pipelines to end directly in your data lake, removing the need to move data into a separate data warehouse. This simplifies data architecture and cuts down on unnecessary data movement, while still providing powerful data warehouse-like features directly in the lake.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Improved Performance&lt;/strong&gt;: Both Dremio and Iceberg are optimized for performance. Dremio&apos;s SQL engine, built on Apache Arrow, allows for fast queries across large datasets, while Iceberg&apos;s advanced indexing and partitioning features reduce the time spent scanning tables, making querying more efficient.&lt;/p&gt;
&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;By leveraging Dremio and Apache Iceberg, data engineers can spend less time troubleshooting and managing infrastructure, and more time driving innovation and delivering value to the business.&lt;/p&gt;
&lt;h2&gt;Why Data Architects Should Care&lt;/h2&gt;
&lt;h3&gt;Streamlining Data Architect Challenges with Dremio and Apache Iceberg&lt;/h3&gt;
&lt;p&gt;Data Architects are critical in designing and maintaining scalable, efficient, and future-proof data platforms. Their primary challenges often include managing the complexity of data infrastructure, controlling costs, and ensuring that the platform is easy for teams across the organization to adopt. Here’s how Dremio and Apache Iceberg help overcome these challenges:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Reducing Complexity and Maintenance&lt;/strong&gt;: Designing and maintaining a data platform often involves integrating multiple systems—data lakes, data warehouses, and ETL pipelines—which increases complexity and operational overhead. Dremio simplifies this by providing a unified platform that can query data from various sources without needing to move it. Coupled with Apache Iceberg’s ability to serve as the foundation of a data lakehouse, architects can significantly reduce the need for costly and time-consuming data migrations. Iceberg’s ACID guarantees and schema evolution make it easy to manage and govern data, keeping the platform adaptable without adding maintenance burdens.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Lowering Costs&lt;/strong&gt;: Traditional data architectures require significant resources to store, move, and process data across different systems. By leveraging Apache Iceberg’s table format directly in the data lake and combining it with Dremio’s query acceleration features like reflections, you can minimize data duplication and avoid runaway data warehousing bills. This leads to lower storage and compute costs while still delivering fast, efficient queries across large datasets.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Maximizing Adoption and Value&lt;/strong&gt;: A well-designed data platform must be user-friendly to maximize adoption by analysts, data scientists, and other teams. Dremio’s easy-to-use SQL-based interface and semantic layer make it simple for teams to access and explore data without needing deep technical expertise. By providing a self-service experience, Dremio empowers teams to derive value from the platform quickly, reducing the reliance on IT or engineering teams and driving greater overall usage.&lt;/p&gt;
&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;With Dremio and Apache Iceberg, data architects can build a scalable, low-maintenance platform that delivers high performance at a lower cost, while ensuring that it’s accessible and valuable to the entire organization.&lt;/p&gt;
&lt;h2&gt;Why Does it Matter for Data Analysts?&lt;/h2&gt;
&lt;p&gt;Data analysts often face several challenges in their day-to-day work, including navigating access to various data systems, waiting on data engineering teams for minor modeling updates, and dealing with the redundancy of different teams redefining the same metrics across multiple BI tools. These inefficiencies slow down analysis and limit the ability to deliver timely insights. Here&apos;s how Dremio and Apache Iceberg can help overcome these hurdles:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Seamless Data Access&lt;/strong&gt;: Analysts frequently struggle with accessing data spread across different databases, data lakes, and warehouses, often relying on data engineers to provide access or create custom queries. Dremio simplifies this process by enabling direct, self-service access to data from multiple sources through a single, easy-to-use SQL interface. Analysts can query data in real time without waiting on access requests or dealing with various systems, whether the data is stored in a data lake, database, or cloud storage.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Faster Modeling Updates&lt;/strong&gt;: Making even minor changes to data models often involves opening tickets and waiting on data engineers to update pipelines or reformat datasets. With Dremio’s semantic layer, analysts can model and define relationships across datasets directly within the platform. This eliminates the need to wait on engineering for minor changes, allowing analysts to iterate faster and stay agile when business requirements evolve.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Consistency Across Metrics&lt;/strong&gt;: A common pain point for analysts is the inconsistency of metrics definitions across different BI tools and teams. This redundancy leads to conflicting reports and wasted time reconciling metrics. Dremio centralizes metric definitions through its semantic layer, ensuring that all teams access a single source of truth. This reduces the need for redefining metrics across different tools and ensures consistency in analysis and reporting across the organization.&lt;/p&gt;
&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;By leveraging Dremio’s self-service capabilities and Apache Iceberg’s ability to manage large datasets directly in the data lake, analysts gain faster access to data, more control over data modeling, and a unified platform that ensures consistent metrics, leading to quicker, more reliable insights.&lt;/p&gt;
&lt;h2&gt;Why Does it Matter for Data Scientists?&lt;/h2&gt;
&lt;h3&gt;Enhancing Data Science Workflows with Dremio and Apache Iceberg&lt;/h3&gt;
&lt;p&gt;Data scientists face unique challenges when working with large, complex datasets across various platforms. They often struggle with data accessibility, managing ever-growing data volumes, and ensuring reproducibility and version control in their workflows. Lakehouse platforms like Dremio, combined with table formats like Apache Iceberg, offer powerful solutions to these challenges:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Simplified Data Access and Exploration&lt;/strong&gt;: One of the biggest pain points for data scientists is gaining access to diverse data sources, often stored across different silos such as databases, data lakes, and cloud platforms. This makes data discovery and exploration cumbersome and time-consuming. Dremio’s Lakehouse Platform provides unified, self-service access to all your data, regardless of where it’s stored, through a single interface. With Dremio, data scientists can seamlessly query, analyze, and experiment with large datasets without navigating multiple systems or relying on engineering teams for access.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Scalable Data Management&lt;/strong&gt;: As datasets grow larger and more complex, managing them in a traditional data warehouse setup becomes costly and inefficient. Apache Iceberg allows data scientists to work directly with large datasets in the data lake, eliminating the need to move data into a separate warehouse for analysis. Iceberg’s scalable table format enables efficient handling of large volumes of data while providing advanced features like hidden partitioning and ACID guarantees, ensuring that data scientists can focus on building models without worrying about performance bottlenecks.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Reproducibility and Experimentation&lt;/strong&gt;: Ensuring reproducibility of experiments and models is critical in data science but can be challenging when data is constantly changing. Apache Iceberg’s versioning and time-travel capabilities allow data scientists to access and work with specific snapshots of the data, ensuring that experiments can be reproduced at any point in time. Dremio’s zero-copy cloning and Git-like data management features enable data scientists to create isolated, experimental environments without duplicating data, streamlining the workflow for testing models and performing “what-if” analyses.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Collaboration and Consistency&lt;/strong&gt;: Data scientists often work closely with data engineers and analysts, and inconsistent access to data or version control can hinder collaboration. Dremio’s semantic layer provides a consistent and shared view of the data, allowing all teams to work from the same definitions and datasets, reducing inconsistencies in models and analysis. This leads to better collaboration across the organization and more reliable insights from models.&lt;/p&gt;
&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;By leveraging Dremio’s Lakehouse Platform and Apache Iceberg tables, data scientists can streamline their workflows, gain faster access to critical data, ensure reproducibility, and scale their experiments more effectively, all while minimizing the complexity and overhead typically associated with large-scale data science projects.&lt;/p&gt;
&lt;h2&gt;Conclusion&lt;/h2&gt;
&lt;p&gt;Data professionals across the board—whether you&apos;re a data engineer, architect, analyst, or scientist—face the common challenges of navigating complex data systems, maintaining performance, and ensuring scalability. As the data landscape evolves, adopting technologies that provide flexibility, reduce overhead, and improve accessibility is crucial.&lt;/p&gt;
&lt;p&gt;Dremio and Apache Iceberg offer powerful solutions that enable you to manage your data with greater efficiency, scalability, and performance. With Dremio&apos;s Lakehouse Platform and Iceberg&apos;s table format, you can unify your data silos, streamline pipelines, and access real-time insights—all while lowering costs and minimizing maintenance.&lt;/p&gt;
&lt;p&gt;If you&apos;re looking to build a future-proof data architecture that meets the needs of your entire organization, embracing a Lakehouse approach with Dremio and Apache Iceberg will empower your teams to make better, faster decisions while keeping data governance and management simple.&lt;/p&gt;
&lt;h2&gt;Resources to Learn More about Iceberg&lt;/h2&gt;
&lt;ul&gt;
&lt;li&gt;&lt;a href=&quot;https://www.dremio.com/lakehouse-deep-dives/apache-iceberg-101/?utm_source=ev_external_blog&amp;amp;utm_medium=influencer&amp;amp;utm_campaign=whypros&amp;amp;utm_content=alexmerced&amp;amp;utm_term=external_blog&quot;&gt;Apache Iceberg 101&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&quot;https://www.dremio.com/blog/intro-to-dremio-nessie-and-apache-iceberg-on-your-laptop/?utm_source=ev_external_blog&amp;amp;utm_medium=influencer&amp;amp;utm_campaign=whypros&amp;amp;utm_content=alexmerced&amp;amp;utm_term=external_blog&quot;&gt;Hands-on Intro with Apache iceberg&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&quot;https://hello.dremio.com/webcast-an-apache-iceberg-lakehouse-crash-course-reg.html?utm_source=ev_external_blog&amp;amp;utm_medium=influencer&amp;amp;utm_campaign=whypros&amp;amp;utm_content=alexmerced&amp;amp;utm_term=external_blog&quot;&gt;Free Apache Iceberg Crash Course&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&quot;https://hello.dremio.com/wp-apache-iceberg-the-definitive-guide-reg.html?utm_source=ev_external_blog&amp;amp;utm_medium=influencer&amp;amp;utm_campaign=whypros&amp;amp;utm_content=alexmerced&amp;amp;utm_term=external_blog&quot;&gt;Free Copy Of Apache Iceberg: The Definitive Guide&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;
</content:encoded><author>Alex Merced</author></item><item><title>Hands-on with Apache Iceberg on Your Laptop - Deep Dive with Apache Spark, Nessie, Minio, Dremio, Polars and Seaborn</title><link>https://iceberglakehouse.com/posts/2024-9-hands-on-iceberg-dremio-minio-nessie/</link><guid isPermaLink="true">https://iceberglakehouse.com/posts/2024-9-hands-on-iceberg-dremio-minio-nessie/</guid><description>
- [Free Copy of Apache Iceberg: The Definitive Guide](https://hello.dremio.com/wp-apache-iceberg-the-definitive-guide-reg.html?utm_source=ev_external...</description><pubDate>Tue, 10 Sep 2024 09:00:00 GMT</pubDate><content:encoded>&lt;ul&gt;
&lt;li&gt;&lt;a href=&quot;https://hello.dremio.com/wp-apache-iceberg-the-definitive-guide-reg.html?utm_source=ev_external_blog&amp;amp;utm_medium=influencer&amp;amp;utm_campaign=introiceberg&amp;amp;utm_content=alexmerced&amp;amp;utm_term=external_blog&quot;&gt;Free Copy of Apache Iceberg: The Definitive Guide&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&quot;https://hello.dremio.com/webcast-an-apache-iceberg-lakehouse-crash-course-reg.html?utm_source=ev_external_blog&amp;amp;utm_medium=influencer&amp;amp;utm_campaign=introiceberg&amp;amp;utm_content=alexmerced&amp;amp;utm_term=external_blog&quot;&gt;Free Apache Iceberg Crash Course&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;&lt;strong&gt;Table of Contents&lt;/strong&gt;&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;a href=&quot;#what-is-a-data-lakehouse&quot;&gt;What is a Data Lakehouse?&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&quot;#data-lakehouse-technologies&quot;&gt;Data Lakehouse Technologies&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&quot;#setting-up-the-environment-with-docker-compose&quot;&gt;Setting Up the Evvironment with Docker Compose&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&quot;#verifying-that-the-services-are-running&quot;&gt;Verifying Services are Running&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&quot;#ingesting-data-into-iceberg-with-apache-spark&quot;&gt;Ingesting Data Into Iceberg with Apache Spark&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&quot;#verifying-iceberg-data-and-metadata-in-minio&quot;&gt;Verifying Iceberg Metadata Stored in Minio&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&quot;#confirming-nessie-is-tracking-the-iceberg-table-with-curl-commands&quot;&gt;Confirming Nessie is Tracking Tables&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&quot;#connecting-nessie-and-minio-as-sources-in-dremio&quot;&gt;Connecting Dremio to Minio and Nessie&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&quot;#accessing-data-in-dremio-bi-tool-integrations-rest-api-jdbcodbc-and-apache-arrow-flight&quot;&gt;Access Data From Dremio in BI Tools and Notebooks&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&quot;#conclusion&quot;&gt;Conclusion&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;Apache Iceberg and the Data Lakehouse architecture have garnered significant attention in the data landscape. Technologies such as Dremio, Nessie, and Minio play a vital role in enabling the Lakehouse paradigm, offering powerful tools for data management and analytics. In this blog, we&apos;ll explore the concept of the Lakehouse, introduce the key technologies that make it possible, and provide a hands-on guide to building your own Data Lakehouse environment right on your laptop. This will allow you to experience firsthand how these tools work together to revolutionize data storage and processing.&lt;/p&gt;
&lt;h2&gt;What is a Data Lakehouse?&lt;/h2&gt;
&lt;p&gt;Traditional database and data warehouse systems typically bundle together several core components, such as storage, table format, cataloging, and query processing, into a single tool. While this approach is convenient, it comes with challenges. Each system implements these features differently, which can lead to issues when scaling, transferring data across platforms, or achieving seamless interoperability. As organizations grow and evolve, these limitations become more apparent, especially in terms of flexibility and performance.&lt;/p&gt;
&lt;p&gt;Data Lakes, on the other hand, serve as a centralized repository where all types of data land in various forms, from structured to unstructured. Given this role, it makes sense to leverage the data lake as the storage foundation while decoupling the other functions—table metadata, cataloging, and query processing—into separate, specialized tools. This decoupled architecture forms the essence of the Data Lakehouse. It combines the flexibility and scalability of a Data Lake with the management features of a Data Warehouse, providing a unified solution for handling large-scale data storage, organization, and analytics.&lt;/p&gt;
&lt;h2&gt;Data Lakehouse Technologies&lt;/h2&gt;
&lt;p&gt;A Data Lakehouse is powered by a combination of tools and technologies that work together to enable efficient storage, cataloging, and analytics. Below are key technologies that make up a modern data lakehouse:&lt;/p&gt;
&lt;h3&gt;Minio&lt;/h3&gt;
&lt;p&gt;Minio is a high-performance object storage solution that can act as your data lake, whether in the cloud or on-premises. Object storage is unique because it allows data to be stored in flexible, scalable units called objects, which are well-suited for unstructured and structured data alike. Minio offers several unique features, including S3-compatible APIs, strong security, and efficient performance across different environments. Its ability to seamlessly store large amounts of data makes it ideal for serving as the foundation of a data lake in both hybrid cloud and on-prem environments.&lt;/p&gt;
&lt;h3&gt;Apache Parquet&lt;/h3&gt;
&lt;p&gt;Apache Parquet is a specialized file format designed to store structured data for analytics at scale. What makes Parquet stand out is its columnar storage format, which is optimized for read-heavy analytical workloads. By organizing data by columns instead of rows, Parquet enables more efficient queries, especially when only a subset of columns is required. Additionally, its ability to compress data effectively reduces storage costs while speeding up query performance, making it a go-to format for modern data lakehouses.&lt;/p&gt;
&lt;h3&gt;Apache Iceberg&lt;/h3&gt;
&lt;p&gt;Apache Iceberg introduces a new standard for handling large datasets in data lakes by offering a structured table format with built-in ACID guarantees. Iceberg organizes Parquet files into tables, allowing data lakes to be managed like traditional data warehouses with features such as schema evolution, time travel, and partitioning. Iceberg solves many of the challenges associated with using data lakes for complex analytics, enabling tables to evolve without downtime and providing reliable query performance across massive datasets.&lt;/p&gt;
&lt;h3&gt;Nessie&lt;/h3&gt;
&lt;p&gt;Nessie is an open-source catalog designed to track Apache Iceberg tables, making it easy for different tools to interact with and understand the structure of your data lake. What sets Nessie apart is its catalog versioning features, which allow users to manage different versions of their table schemas over time, supporting experimentation and rollback capabilities. Nessie ensures that the state of your data lake is well-documented and consistently accessible across tools like Dremio and Apache Spark.&lt;/p&gt;
&lt;h3&gt;Apache Spark&lt;/h3&gt;
&lt;p&gt;Apache Spark is a powerful data processing framework that plays a crucial role in moving data between different sources and performing transformations at scale. Its distributed nature allows it to process large datasets quickly, and it integrates well with other data lakehouse technologies like Iceberg and Parquet. Whether you&apos;re loading data into your lakehouse or transforming it for analysis, Spark provides the muscle to handle these operations efficiently.&lt;/p&gt;
&lt;h3&gt;Dremio&lt;/h3&gt;
&lt;p&gt;Dremio is a lakehouse platform designed to connect all your data in one place, enabling SQL-based analytics directly on your data lake. Dremio provides native support for Apache Iceberg tables, allowing you to work with them as if they were traditional database tables, but without the constraints of a data warehouse. It also offers features such as reflections for query acceleration and a user-friendly interface for running complex queries across diverse datasets. With Dremio, data analysts can easily query large datasets stored in object storage, leveraging Iceberg&apos;s powerful table features without needing to move data into a warehouse.&lt;/p&gt;
&lt;h2&gt;Setting Up the Environment with Docker Compose&lt;/h2&gt;
&lt;p&gt;Docker is a platform that allows you to package and run applications in isolated containers, ensuring that they run consistently across different environments. Containers bundle an application with all of its dependencies, making it easy to manage and deploy complex setups. Docker Compose is a tool specifically designed to handle multi-container applications, enabling you to define and run multiple services using a simple configuration file. By using Docker Compose, we can quickly set up an environment with all the necessary components for our data lakehouse—such as Spark, Dremio, Minio, and Nessie—without worrying about manual installation or configuration. This approach not only saves time but ensures that our setup is portable and easy to replicate across different systems.&lt;/p&gt;
&lt;h3&gt;Understanding the Docker Compose File&lt;/h3&gt;
&lt;p&gt;A Docker Compose file is a YAML configuration file that defines how to run multiple containers in a single environment. The standard name for this file is &lt;code&gt;docker-compose.yml&lt;/code&gt;, and it allows you to describe the services, networks, and volumes required for your setup in a straightforward and structured way. This file centralizes the configuration, so you can manage all the different parts of your environment in one place.&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Services&lt;/strong&gt;: Services are the individual containers that will be running. Each service represents a specific component of your environment, such as a database, object storage, or an analytics engine. In the Compose file, you define each service’s configuration, including the image it will run, ports to expose, and any environment variables needed.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Networks&lt;/strong&gt;: Networks allow the different services to communicate with each other. By default, Docker Compose creates a network so all the containers can connect seamlessly. You can also define custom networks to further control how services interact with each other, which is especially useful in complex setups where you want to limit certain containers&apos; access to one another.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Volumes&lt;/strong&gt;: Volumes are used for persisting data beyond the lifecycle of a container. When a container is removed, its data is lost unless it&apos;s saved to a volume. In a Docker Compose file, volumes are defined to store and share data between containers, making sure that critical information like database contents or file storage remains intact even when containers are restarted or recreated.&lt;/p&gt;
&lt;/li&gt;
&lt;/ul&gt;
&lt;h3&gt;Cloning the Demo Environment Repository&lt;/h3&gt;
&lt;p&gt;For this exercise, we will be working with a pre-built environment template hosted on GitHub: &lt;a href=&quot;https://github.com/AlexMercedCoder/dremio-spark-nessie-demo-environment-template&quot;&gt;AlexMercedCoder/dremio-spark-nessie-demo-environment-template&lt;/a&gt;. This is a template repository, meaning you can generate your own copy of the repository to modify and use for your needs. To get started, go to the repository page, and click the &amp;quot;Use this template&amp;quot; button at the top to create a new repository under your own GitHub account. Once you&apos;ve created your template copy, you can clone it to your local machine using the &lt;code&gt;git clone&lt;/code&gt; command. This will allow you to have the full environment set up locally for hands-on experimentation.&lt;/p&gt;
&lt;h3&gt;Our Docker Compose File&lt;/h3&gt;
&lt;pre&gt;&lt;code class=&quot;language-yaml&quot;&gt;version: &amp;quot;3&amp;quot;

services:
  # Nessie Catalog Server Using In-Memory Store
  nessie:
    image: projectnessie/nessie:latest
    container_name: nessie
    environment:
      - QUARKUS_PROFILE=prod
      - QUARKUS_HTTP_PORT=19120
      - QUARKUS_LOG_CONSOLE_FORMAT=%d{yyyy-MM-dd HH:mm:ss} %-5p [%c{1.}] (%t) %s%e%n
      - QUARKUS_LOG_LEVEL=INFO
      - QUARKUS_DATASOURCE_DB_KIND=rocksdb
      - QUARKUS_DATASOURCE_JDBC_URL=jdbc:rocksdb:file:///nessie/data
      - QUARKUS_DATASOURCE_USERNAME=nessie
      - QUARKUS_DATASOURCE_PASSWORD=nessie
    volumes:
      - ./nessie-data:/nessie/data  # Mount local directory to persist RocksDB data
    ports:
      - &amp;quot;19120:19120&amp;quot;  # Expose Nessie API port
    networks:
      intro-network:
  # Minio Storage Server
  minio:
    image: minio/minio
    container_name: minio
    environment:
      - MINIO_ROOT_USER=admin
      - MINIO_ROOT_PASSWORD=password
      - MINIO_DOMAIN=minio
      - MINIO_REGION_NAME=us-east-1
      - MINIO_REGION=us-east-1
    ports:
      - &amp;quot;9000:9000&amp;quot;
      - &amp;quot;9001:9001&amp;quot;
    healthcheck:
      test: [&amp;quot;CMD&amp;quot;, &amp;quot;curl&amp;quot;, &amp;quot;-f&amp;quot;, &amp;quot;http://localhost:9000/minio/health/live&amp;quot;]
      interval: 30s
      timeout: 20s
      retries: 3
    volumes:
      - ./minio-data:/minio-data  # Mount the local folder to container
    entrypoint: &amp;gt;
      /bin/sh -c &amp;quot;
      minio server /data --console-address &apos;:9001&apos; &amp;amp;
      sleep 5;
      mc alias set myminio http://localhost:9000 admin password;
      mc mb myminio/datalake;
      mc mb myminio/datalakehouse;
      mc mb myminio/warehouse;
      mc mb myminio/seed;
      mc cp /minio-data/* myminio/seed/;
      tail -f /dev/null&amp;quot;
    networks:
      intro-network:
  
  # Spark
  spark:
    platform: linux/x86_64
    image: alexmerced/spark35nb:latest
    ports: 
      - 8080:8080    # Master Web UI
      - 7077:7077    # Master Port for job submissions
      - 8081:8081    # Worker Web UI
      - 4040-4045:4040-4045  # Additional Spark job UI ports for more jobs
      - 18080:18080  # Spark History Server
      - 8888:8888    # Jupyter Notebook
    environment:
      - AWS_REGION=us-east-1
      - AWS_ACCESS_KEY_ID=admin  # Minio username
      - AWS_SECRET_ACCESS_KEY=password  # Minio password
      - SPARK_MASTER_HOST=spark
      - SPARK_MASTER_PORT=7077
      - SPARK_MASTER_WEBUI_PORT=8080
      - SPARK_WORKER_WEBUI_PORT=8081
      - SPARK_HISTORY_OPTS=-Dspark.history.fs.logDirectory=/tmp/spark-events
      - SPARK_HOME=/opt/spark  # Set SPARK_HOME explicitly
    volumes:
      - ./notebook-seed:/workspace/seed-data  # Volume for seeding data into the container
    container_name: spark
    entrypoint: &amp;gt;
      /bin/bash -c &amp;quot;
      /opt/spark/sbin/start-master.sh &amp;amp;&amp;amp; \
      /opt/spark/sbin/start-worker.sh spark://$(hostname):7077 &amp;amp;&amp;amp; \
      mkdir -p /tmp/spark-events &amp;amp;&amp;amp; \
      start-history-server.sh &amp;amp;&amp;amp; \
      jupyter lab --ip=0.0.0.0 --port=8888 --no-browser --allow-root --NotebookApp.token=&apos;&apos; --NotebookApp.password=&apos;&apos; &amp;amp;&amp;amp; \
      tail -f /dev/null
      &amp;quot;
    networks:
      intro-network:

  # Dremio
  dremio:
    platform: linux/x86_64
    image: dremio/dremio-oss:latest
    ports:
      - 9047:9047
      - 31010:31010
      - 32010:32010
      - 45678:45678
    container_name: dremio
    environment:
      - DREMIO_JAVA_SERVER_EXTRA_OPTS=-Dpaths.dist=file:///opt/dremio/data/dist
    networks:
      intro-network:

networks:
  intro-network:
&lt;/code&gt;&lt;/pre&gt;
&lt;h3&gt;Understanding the Docker Compose File in Depth&lt;/h3&gt;
&lt;p&gt;This Docker Compose file sets up a multi-service environment that integrates Apache Iceberg, Nessie, Minio, Spark, and Dremio, providing a full hands-on data lakehouse experience. Each service is configured with specific parameters to ensure smooth interoperability between components, and it includes functionality to seed data into Minio and the Spark notebooks. Let&apos;s dive deeper into the configuration of each service and how it fits into the broader architecture.&lt;/p&gt;
&lt;h4&gt;1. &lt;strong&gt;Nessie&lt;/strong&gt;&lt;/h4&gt;
&lt;p&gt;Nessie is our catalog service, responsible for tracking the metadata of Apache Iceberg tables. The configuration for Nessie includes several important settings:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Image&lt;/strong&gt;: &lt;code&gt;projectnessie/nessie:latest&lt;/code&gt; pulls the latest Nessie image from Docker Hub.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Environment Variables&lt;/strong&gt;:
&lt;ul&gt;
&lt;li&gt;&lt;code&gt;QUARKUS_PROFILE=prod&lt;/code&gt;: Sets the profile to production mode.&lt;/li&gt;
&lt;li&gt;&lt;code&gt;QUARKUS_HTTP_PORT=19120&lt;/code&gt;: Configures Nessie&apos;s HTTP API to be exposed on port 19120.&lt;/li&gt;
&lt;li&gt;&lt;code&gt;QUARKUS_DATASOURCE_DB_KIND=rocksdb&lt;/code&gt;: Nessie uses RocksDB as its internal database to store catalog information.&lt;/li&gt;
&lt;li&gt;&lt;code&gt;QUARKUS_DATASOURCE_JDBC_URL=jdbc:rocksdb:file:///nessie/data&lt;/code&gt;: Defines the storage location for the RocksDB database, which is mounted as a volume to persist data across container restarts.&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Volumes&lt;/strong&gt;: The local directory &lt;code&gt;./nessie-data&lt;/code&gt; is mounted to &lt;code&gt;/nessie/data&lt;/code&gt; inside the container, ensuring that catalog information is stored persistently.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Ports&lt;/strong&gt;: Port &lt;code&gt;19120&lt;/code&gt; is exposed, allowing external tools (like Dremio and Spark) to access Nessie&apos;s API for catalog management.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Networks&lt;/strong&gt;: Nessie is part of the &lt;code&gt;intro-network&lt;/code&gt;, enabling it to communicate with other services in the Compose setup.&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;This setup ensures that Iceberg table metadata is cataloged and persists across container lifecycles, allowing tools to query and manage the Iceberg tables effectively.&lt;/p&gt;
&lt;h4&gt;2. &lt;strong&gt;Minio&lt;/strong&gt;&lt;/h4&gt;
&lt;p&gt;Minio serves as the object storage system in this setup, simulating an S3-like environment to act as the data lake for our Apache Iceberg tables.&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Image&lt;/strong&gt;: &lt;code&gt;minio/minio&lt;/code&gt; is the latest version of the Minio server.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Environment Variables&lt;/strong&gt;:
&lt;ul&gt;
&lt;li&gt;&lt;code&gt;MINIO_ROOT_USER=admin&lt;/code&gt; and &lt;code&gt;MINIO_ROOT_PASSWORD=password&lt;/code&gt;: These define the credentials for accessing the Minio service.&lt;/li&gt;
&lt;li&gt;&lt;code&gt;MINIO_DOMAIN=minio&lt;/code&gt; and &lt;code&gt;MINIO_REGION_NAME=us-east-1&lt;/code&gt;: Sets up the Minio domain and region, simulating a cloud-based object storage.&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Ports&lt;/strong&gt;:
&lt;ul&gt;
&lt;li&gt;&lt;code&gt;9000:9000&lt;/code&gt;: Exposes Minio’s S3-compatible API on port 9000.&lt;/li&gt;
&lt;li&gt;&lt;code&gt;9001:9001&lt;/code&gt;: Exposes Minio’s web console on port 9001, allowing you to manage your storage via a web interface.&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Healthcheck&lt;/strong&gt;: Ensures that Minio is healthy by testing its liveness endpoint (&lt;code&gt;http://localhost:9000/minio/health/live&lt;/code&gt;), and retries if needed.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Volumes&lt;/strong&gt;: The local &lt;code&gt;./minio-data&lt;/code&gt; directory is mounted into the container as &lt;code&gt;/minio-data&lt;/code&gt;. This allows you to seed data into the Minio server by placing files in the &lt;code&gt;./minio-data&lt;/code&gt; folder.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Entrypoint&lt;/strong&gt;:
&lt;ul&gt;
&lt;li&gt;Minio&apos;s entrypoint script initializes the object storage and creates several buckets (&lt;code&gt;datalake&lt;/code&gt;, &lt;code&gt;datalakehouse&lt;/code&gt;, &lt;code&gt;warehouse&lt;/code&gt;, &lt;code&gt;seed&lt;/code&gt;).&lt;/li&gt;
&lt;li&gt;The &lt;code&gt;mc cp /minio-data/* myminio/seed/&lt;/code&gt; command uploads all data from the &lt;code&gt;./minio-data&lt;/code&gt; directory into the &lt;code&gt;seed&lt;/code&gt; bucket in Minio. This provides a straightforward way to seed datasets into your object storage for later use.&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;Minio’s configuration makes it the core storage layer of the data lakehouse, and by automatically seeding data into it, we streamline the process of making datasets available for analytics.&lt;/p&gt;
&lt;h4&gt;3. &lt;strong&gt;Spark&lt;/strong&gt;&lt;/h4&gt;
&lt;p&gt;Spark is the processing engine for the environment, handling data transformations and moving data between sources. It also includes a Jupyter notebook for interactive data processing.&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Image&lt;/strong&gt;: &lt;code&gt;alexmerced/spark35nb:latest&lt;/code&gt; pulls a custom Spark image that includes Jupyter for notebook-based processing.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Ports&lt;/strong&gt;:
&lt;ul&gt;
&lt;li&gt;&lt;code&gt;8080:8080&lt;/code&gt; and &lt;code&gt;8081:8081&lt;/code&gt;: These expose the Spark Master and Worker web UIs, allowing you to monitor job submissions and worker performance.&lt;/li&gt;
&lt;li&gt;&lt;code&gt;7077&lt;/code&gt;: This is the Spark Master port used for job submissions.&lt;/li&gt;
&lt;li&gt;&lt;code&gt;8888&lt;/code&gt;: This exposes the Jupyter notebook interface, making it easy to run Spark jobs interactively in a notebook environment.&lt;/li&gt;
&lt;li&gt;&lt;code&gt;4040-4045&lt;/code&gt;: These ports are reserved for Spark&apos;s job UIs, which provide detailed information about running jobs.&lt;/li&gt;
&lt;li&gt;&lt;code&gt;18080&lt;/code&gt;: Exposes the Spark History Server, where you can review past jobs and their execution metrics.&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Environment Variables&lt;/strong&gt;:
&lt;ul&gt;
&lt;li&gt;&lt;code&gt;AWS_ACCESS_KEY_ID=admin&lt;/code&gt; and &lt;code&gt;AWS_SECRET_ACCESS_KEY=password&lt;/code&gt;: These credentials allow Spark to access Minio&apos;s object storage as if it were S3.&lt;/li&gt;
&lt;li&gt;Other Spark-specific environment variables ensure that Spark runs as a distributed system and can connect to the Minio object store.&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Volumes&lt;/strong&gt;: The local directory &lt;code&gt;./notebook-seed&lt;/code&gt; is mounted to &lt;code&gt;/workspace/seed-data&lt;/code&gt; inside the container. This volume contains any data that you want to pre-load into the Spark environment, making it accessible within the Jupyter notebooks for processing.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Entrypoint&lt;/strong&gt;:
&lt;ul&gt;
&lt;li&gt;The entrypoint script starts the Spark Master, Worker, and History Server. It also launches Jupyter Lab, providing an interactive environment to run Spark jobs and experiments.&lt;/li&gt;
&lt;li&gt;The script ensures that the Spark processing engine is always running, ready to handle tasks, and that the notebook interface is accessible.&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;This Spark setup allows you to run interactive notebooks, process large datasets, and leverage Minio for data storage.&lt;/p&gt;
&lt;h4&gt;4. &lt;strong&gt;Dremio&lt;/strong&gt;&lt;/h4&gt;
&lt;p&gt;Dremio is the analytics layer of this environment, allowing you to perform SQL-based queries on your data lakehouse. It connects seamlessly to both Minio and Nessie to provide a smooth experience for querying Iceberg tables stored in the object storage.&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Image&lt;/strong&gt;: &lt;code&gt;dremio/dremio-oss:latest&lt;/code&gt; pulls the latest open-source version of Dremio.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Ports&lt;/strong&gt;:
&lt;ul&gt;
&lt;li&gt;&lt;code&gt;9047&lt;/code&gt;: Exposes the Dremio web interface, where users can query datasets and manage the environment.&lt;/li&gt;
&lt;li&gt;&lt;code&gt;31010&lt;/code&gt;, &lt;code&gt;32010&lt;/code&gt;, &lt;code&gt;45678&lt;/code&gt;: These ports are used for Dremio’s internal services, handling query execution and communication between Dremio components. (31010 for JDBC, 32010 for Arrow Flight)&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Environment Variables&lt;/strong&gt;:
&lt;ul&gt;
&lt;li&gt;&lt;code&gt;DREMIO_JAVA_SERVER_EXTRA_OPTS=-Dpaths.dist=file:///opt/dremio/data/dist&lt;/code&gt;: This sets the internal paths for Dremio to ensure it runs correctly in the Docker environment.&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Networks&lt;/strong&gt;: Dremio is connected to the &lt;code&gt;intro-network&lt;/code&gt;, enabling it to interact with Nessie and Minio for querying Iceberg tables and accessing object storage.&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;Dremio’s role in this setup is to serve as the query engine for your data lakehouse, allowing you to perform high-performance SQL queries on data stored in Minio and managed by Nessie.&lt;/p&gt;
&lt;h3&gt;Seeding Data into Minio and Spark Notebooks&lt;/h3&gt;
&lt;ul&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Minio&lt;/strong&gt;: The &lt;code&gt;./minio-data&lt;/code&gt; folder on your local machine is mounted to the container and used to seed data into Minio. When the container starts, the &lt;code&gt;mc cp&lt;/code&gt; command uploads any files in this directory to the &lt;code&gt;seed&lt;/code&gt; bucket in Minio. This makes your datasets immediately available for querying or processing without needing to manually upload files after the environment is up.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Spark Notebooks&lt;/strong&gt;: Similarly, the &lt;code&gt;./notebook-seed&lt;/code&gt; directory is mounted into the Spark container at &lt;code&gt;/workspace/seed-data&lt;/code&gt;. This allows any data placed in the &lt;code&gt;./notebook-seed&lt;/code&gt; folder to be available within the Jupyter notebook environment, making it easy to start analyzing or transforming data right away.&lt;/p&gt;
&lt;/li&gt;
&lt;/ul&gt;
&lt;h3&gt;Spinning Up and Down the Environment with Docker Compose&lt;/h3&gt;
&lt;p&gt;Once your Docker Compose file is configured, you can easily spin up and down the entire environment using simple Docker Compose commands. This process launches all the necessary services—Nessie, Minio, Spark, and Dremio—allowing them to work together as a cohesive data lakehouse environment.&lt;/p&gt;
&lt;h4&gt;Spinning Up the Environment&lt;/h4&gt;
&lt;p&gt;To start the environment, navigate to the directory containing your &lt;code&gt;docker-compose.yml&lt;/code&gt; file, then run:&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-bash&quot;&gt;docker-compose up
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;This command will pull the necessary images (if they aren’t already on your machine) and start the services defined in the docker-compose.yml file. Each service will be launched in the background, with all the specified ports, networks, and volumes properly configured.&lt;/p&gt;
&lt;p&gt;For additional control over how the environment is started, you can use the following flags:&lt;/p&gt;
&lt;p&gt;&lt;code&gt;- -d&lt;/code&gt; (Detached Mode): This runs the environment in the background, allowing you to continue using your terminal.&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-bash&quot;&gt;docker-compose up -d
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;In detached mode, you won&apos;t see the logs in the terminal, but the services will continue running in the background.&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;code&gt;--build&lt;/code&gt;: Use this flag to force a rebuild of the images, which is helpful if you&apos;ve made changes to the Dockerfiles or configurations.&lt;/li&gt;
&lt;/ul&gt;
&lt;pre&gt;&lt;code class=&quot;language-bash&quot;&gt;docker-compose up --build
&lt;/code&gt;&lt;/pre&gt;
&lt;ul&gt;
&lt;li&gt;&lt;code&gt;--force-recreate&lt;/code&gt;: If you want to ensure that all containers are recreated (even if their configurations haven&apos;t changed), you can use this flag.&lt;/li&gt;
&lt;/ul&gt;
&lt;pre&gt;&lt;code class=&quot;language-bash&quot;&gt;docker-compose up --force-recreate
&lt;/code&gt;&lt;/pre&gt;
&lt;h4&gt;Spinning Down the Environment&lt;/h4&gt;
&lt;p&gt;To stop and remove all running services, use the following command:&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-bash&quot;&gt;docker-compose down
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;This stops the services and removes the associated containers, networks, and volumes. Your data will still be preserved in the mounted volumes, so any changes made to your data (such as in Minio or Nessie) will remain intact the next time you spin up the environment.&lt;/p&gt;
&lt;p&gt;You can also use the following flags with docker-compose down:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;code&gt;--volumes&lt;/code&gt;: This flag will remove all the associated volumes as well. Use this if you want to completely clean up the environment, including any persisted data in volumes.&lt;/li&gt;
&lt;/ul&gt;
&lt;pre&gt;&lt;code class=&quot;language-bash&quot;&gt;docker-compose down --volumes
&lt;/code&gt;&lt;/pre&gt;
&lt;ul&gt;
&lt;li&gt;&lt;code&gt;--remove-orphans&lt;/code&gt;: If there are any containers running from previous Compose configurations that aren&apos;t defined in the current file, this flag will remove them.&lt;/li&gt;
&lt;/ul&gt;
&lt;pre&gt;&lt;code class=&quot;language-bash&quot;&gt;docker-compose down --remove-orphans
&lt;/code&gt;&lt;/pre&gt;
&lt;h4&gt;Checking the Status of the Environment&lt;/h4&gt;
&lt;p&gt;You can check the status of the running services by using:&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-bash&quot;&gt;docker-compose ps
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;This command will show the state of each service (whether it’s up or down) and the ports they are mapped to.&lt;/p&gt;
&lt;h4&gt;Viewing Logs&lt;/h4&gt;
&lt;p&gt;If you want to view the logs of your services while they are running, use:&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-bash&quot;&gt;docker-compose logs
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;This will output logs for all services. To view logs for a specific service (for example, dremio), use:&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-bash&quot;&gt;docker-compose logs dremio
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;This allows you to monitor the activity of your environment and troubleshoot any issues that arise.&lt;/p&gt;
&lt;p&gt;By using these commands and flags, you can easily manage the lifecycle of your environment, spinning it up for testing or development and shutting it down when you&apos;re done, while maintaining control over data persistence and configurations.&lt;/p&gt;
&lt;h2&gt;Verifying That the Services Are Running&lt;/h2&gt;
&lt;p&gt;Once you&apos;ve spun up the environment using Docker Compose, it&apos;s important to check that all the services are running correctly. Below are the steps to ensure that each service is functioning as expected.&lt;/p&gt;
&lt;h3&gt;1. Accessing the Jupyter Notebook Server&lt;/h3&gt;
&lt;p&gt;Spark is configured to run a Jupyter Notebook interface for interactive data processing. To confirm that the notebook server is running, open your browser and navigate to:&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;http://localhost:8888
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;You should see the JupyterLab interface. Since we configured it without a password, you will have immediate access. Inside the workspace, navigate to the &lt;code&gt;/workspace/seed-data&lt;/code&gt; folder, where the seeded datasets are available. You can now create a new notebook or open an existing one to interact with the data. Remember where your notebooks are created, as file paths are relative when accessing other files.&lt;/p&gt;
&lt;h3&gt;2. Accessing Dremio and Setting Up Your User Information&lt;/h3&gt;
&lt;p&gt;Dremio provides the web interface for querying your data lakehouse. Open your browser and navigate to:&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;http://localhost:9047
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;On your first visit, Dremio will prompt you to create an admin user. Follow the steps to set up your user information, such as username, password, and email. Afterward, you’ll land on the Dremio dashboard. From here, you can start configuring Dremio to connect to Nessie and Minio (covered later), and explore the data in your lakehouse through SQL-based queries.&lt;/p&gt;
&lt;h3&gt;3. Accessing Minio and Verifying Buckets&lt;/h3&gt;
&lt;p&gt;To check that Minio is running and has the correct buckets, visit the Minio console by navigating to:&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;http://localhost:9001
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Log in using the credentials defined in the Docker Compose file (&lt;code&gt;admin&lt;/code&gt; for the username and &lt;code&gt;password&lt;/code&gt; for the password). Once logged in, you should see the following buckets already created:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;code&gt;datalake&lt;/code&gt;&lt;/li&gt;
&lt;li&gt;&lt;code&gt;datalakehouse&lt;/code&gt;&lt;/li&gt;
&lt;li&gt;&lt;code&gt;warehouse&lt;/code&gt;&lt;/li&gt;
&lt;li&gt;&lt;code&gt;seed&lt;/code&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;These buckets were set up automatically when the Minio service started. The &lt;code&gt;seed&lt;/code&gt; bucket should contain any data that was placed in the &lt;code&gt;./minio-data&lt;/code&gt; directory. Verify that the buckets exist, and ensure that your data has been successfully uploaded.&lt;/p&gt;
&lt;h4&gt;4. Verifying Nessie with a Basic Curl Request&lt;/h4&gt;
&lt;p&gt;To confirm that the Nessie catalog service is running correctly, you can make a simple &lt;code&gt;curl&lt;/code&gt; request to its API. Open a terminal and run the following command:&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-bash&quot;&gt;curl -X GET http://localhost:19120/api/v1/trees
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;This command queries the list of available &amp;quot;trees&amp;quot; (branches or tags) in the Nessie catalog. If Nessie is running properly, you should receive a JSON response that includes information about the default branch, typically called main.&lt;/p&gt;
&lt;p&gt;Example response:&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-json&quot;&gt;{
  &amp;quot;type&amp;quot;: &amp;quot;BRANCH&amp;quot;,
  &amp;quot;name&amp;quot;: &amp;quot;main&amp;quot;,
  &amp;quot;hash&amp;quot;: &amp;quot;...&amp;quot;
}
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;This confirms that Nessie is responding and ready to track the Iceberg tables you will create.&lt;/p&gt;
&lt;p&gt;By completing these steps, you can ensure that all the services—Jupyter, Dremio, Minio, and Nessie—are running smoothly and are ready for use in your data lakehouse environment.&lt;/p&gt;
&lt;h2&gt;Ingesting Data into Iceberg with Apache Spark&lt;/h2&gt;
&lt;p&gt;In this section, we will create a PySpark script that simulates messy sales data and stores it in an Apache Iceberg table managed by the Nessie catalog. The data will contain duplicates and other issues, which we will address later in Dremio. Additionally, since Minio is used for object storage, we will need to inspect the Minio container to get its IP address to configure our storage correctly.&lt;/p&gt;
&lt;h3&gt;Step 1: Inspect the Minio Container for IP Address&lt;/h3&gt;
&lt;p&gt;Before we start writing our PySpark script, we need the Minio service’s IP address to access object storage properly (the docker DNS doesn&apos;t always work as expected in referencing the minio container from Spark). Run the following command in your terminal to inspect the Minio container:&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-bash&quot;&gt;docker inspect minio
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Look for the &amp;quot;IPAddress&amp;quot; field in the network settings. Once you find the IP address, note it down as you’ll use it to configure your storage URI.&lt;/p&gt;
&lt;p&gt;Example:&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-json&quot;&gt;&amp;quot;IPAddress&amp;quot;: &amp;quot;172.18.0.2&amp;quot;
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;We&apos;ll use this IP (172.18.0.2) for our Minio URI in our PySpark script.&lt;/p&gt;
&lt;h3&gt;Step 2: PySpark Script to Create Iceberg Table with Messy Sales Data&lt;/h3&gt;
&lt;p&gt;Below is the PySpark code to create a DataFrame with some messy sales data and write it to an Apache Iceberg table stored in Minio and managed by the Nessie catalog.&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-python&quot;&gt;import pyspark
from pyspark.sql import SparkSession
from pyspark.sql.types import StructType, StructField, IntegerType, StringType, DoubleType
import os

## DEFINE SENSITIVE VARIABLES
CATALOG_URI = &amp;quot;http://nessie:19120/api/v1&amp;quot;  # Nessie Server URI
WAREHOUSE = &amp;quot;s3://warehouse/&amp;quot;               # Minio Address to Write to
STORAGE_URI = &amp;quot;http://172.18.0.2:9000&amp;quot;      # Minio IP address from docker inspect

# Configure Spark with necessary packages and Iceberg/Nessie settings
conf = (
    pyspark.SparkConf()
        .setAppName(&apos;sales_data_app&apos;)
        # Include necessary packages
        .set(&apos;spark.jars.packages&apos;, &apos;org.postgresql:postgresql:42.7.3,org.apache.iceberg:iceberg-spark-runtime-3.5_2.12:1.5.0,org.projectnessie.nessie-integrations:nessie-spark-extensions-3.5_2.12:0.77.1,software.amazon.awssdk:bundle:2.24.8,software.amazon.awssdk:url-connection-client:2.24.8&apos;)
        # Enable Iceberg and Nessie extensions
        .set(&apos;spark.sql.extensions&apos;, &apos;org.apache.iceberg.spark.extensions.IcebergSparkSessionExtensions,org.projectnessie.spark.extensions.NessieSparkSessionExtensions&apos;)
        # Configure Nessie catalog
        .set(&apos;spark.sql.catalog.nessie&apos;, &apos;org.apache.iceberg.spark.SparkCatalog&apos;)
        .set(&apos;spark.sql.catalog.nessie.uri&apos;, CATALOG_URI)
        .set(&apos;spark.sql.catalog.nessie.ref&apos;, &apos;main&apos;)
        .set(&apos;spark.sql.catalog.nessie.authentication.type&apos;, &apos;NONE&apos;)
        .set(&apos;spark.sql.catalog.nessie.catalog-impl&apos;, &apos;org.apache.iceberg.nessie.NessieCatalog&apos;)
        # Set Minio as the S3 endpoint for Iceberg storage
        .set(&apos;spark.sql.catalog.nessie.s3.endpoint&apos;, STORAGE_URI)
        .set(&apos;spark.sql.catalog.nessie.warehouse&apos;, WAREHOUSE)
        .set(&apos;spark.sql.catalog.nessie.io-impl&apos;, &apos;org.apache.iceberg.aws.s3.S3FileIO&apos;)
)

# Start Spark session
spark = SparkSession.builder.config(conf=conf).getOrCreate()
print(&amp;quot;Spark Session Started&amp;quot;)

# Define a schema for the sales data
schema = StructType([
    StructField(&amp;quot;order_id&amp;quot;, IntegerType(), True),
    StructField(&amp;quot;customer_id&amp;quot;, IntegerType(), True),
    StructField(&amp;quot;product&amp;quot;, StringType(), True),
    StructField(&amp;quot;quantity&amp;quot;, IntegerType(), True),
    StructField(&amp;quot;price&amp;quot;, DoubleType(), True),
    StructField(&amp;quot;order_date&amp;quot;, StringType(), True)
])

# Create a DataFrame with messy sales data (including duplicates and errors)
sales_data = [
    (1, 101, &amp;quot;Laptop&amp;quot;, 1, 1000.00, &amp;quot;2023-08-01&amp;quot;),
    (2, 102, &amp;quot;Mouse&amp;quot;, 2, 25.50, &amp;quot;2023-08-01&amp;quot;),
    (3, 103, &amp;quot;Keyboard&amp;quot;, 1, 45.00, &amp;quot;2023-08-01&amp;quot;),
    (1, 101, &amp;quot;Laptop&amp;quot;, 1, 1000.00, &amp;quot;2023-08-01&amp;quot;),  # Duplicate
    (4, 104, &amp;quot;Monitor&amp;quot;, None, 200.00, &amp;quot;2023-08-02&amp;quot;),  # Missing quantity
    (5, None, &amp;quot;Mouse&amp;quot;, 1, 25.50, &amp;quot;2023-08-02&amp;quot;)  # Missing customer_id
]

# Convert the data into a DataFrame
sales_df = spark.createDataFrame(sales_data, schema)

# Create the &amp;quot;sales&amp;quot; namespace
spark.sql(&amp;quot;CREATE NAMESPACE nessie.sales;&amp;quot;).show()

# Write the DataFrame to an Iceberg table in the Nessie catalog
sales_df.writeTo(&amp;quot;nessie.sales.sales_data_raw&amp;quot;).createOrReplace()

# Verify by reading from the Iceberg table
spark.read.table(&amp;quot;nessie.sales.sales_data_raw&amp;quot;).show()

# Stop the Spark session
spark.stop()
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Run the Script.&lt;/p&gt;
&lt;h3&gt;Walkthrough of the PySpark Code for Creating an Iceberg Table with Nessie Catalog&lt;/h3&gt;
&lt;p&gt;This PySpark script creates a DataFrame of messy sales data and writes it to an Apache Iceberg table managed by the Nessie catalog. Let&apos;s break down the syntax and purpose of each part of the code:&lt;/p&gt;
&lt;h4&gt;1. Importing Required Libraries&lt;/h4&gt;
&lt;pre&gt;&lt;code class=&quot;language-python&quot;&gt;import pyspark
from pyspark.sql import SparkSession
from pyspark.sql.types import StructType, StructField, IntegerType, StringType, DoubleType
import os
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;We start by importing the necessary libraries:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;code&gt;pyspark&lt;/code&gt;: The core PySpark library for working with Spark in Python.&lt;/li&gt;
&lt;li&gt;&lt;code&gt;SparkSession&lt;/code&gt;: Used to configure and initialize the Spark session.
StructType, StructField, and data types (IntegerType, StringType, etc.): These are used to define the schema of our DataFrame.&lt;/li&gt;
&lt;li&gt;&lt;code&gt;os&lt;/code&gt;: Standard Python library for interacting with the operating system, although not used in this script.&lt;/li&gt;
&lt;/ul&gt;
&lt;h4&gt;2. Defining Sensitive Variables&lt;/h4&gt;
&lt;pre&gt;&lt;code class=&quot;language-python&quot;&gt;CATALOG_URI = &amp;quot;http://nessie:19120/api/v1&amp;quot;  # Nessie Server URI
WAREHOUSE = &amp;quot;s3://warehouse/&amp;quot;               # Minio Address to Write to
STORAGE_URI = &amp;quot;http://172.27.0.3:9000&amp;quot;      # Minio IP address from docker inspect
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Here, we define a few key variables:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;code&gt;CATALOG_URI&lt;/code&gt;: The URL for the Nessie catalog API, which is hosted on port 19120.&lt;/li&gt;
&lt;li&gt;&lt;code&gt;WAREHOUSE&lt;/code&gt;: The S3-like address that points to the Minio storage location, where Iceberg tables will be stored.&lt;/li&gt;
&lt;li&gt;&lt;code&gt;STORAGE_URI&lt;/code&gt;: The Minio service’s IP address (found via docker inspect), used to access the Minio object storage.&lt;/li&gt;
&lt;/ul&gt;
&lt;h4&gt;3. Configuring Spark with Iceberg and Nessie Settings&lt;/h4&gt;
&lt;pre&gt;&lt;code class=&quot;language-python&quot;&gt;conf = (
    pyspark.SparkConf()
        .setAppName(&apos;sales_data_app&apos;)
        # Include necessary packages
        .set(&apos;spark.jars.packages&apos;, &apos;org.postgresql:postgresql:42.7.3,org.apache.iceberg:iceberg-spark-runtime-3.5_2.12:1.5.0,org.projectnessie.nessie-integrations:nessie-spark-extensions-3.5_2.12:0.77.1,software.amazon.awssdk:bundle:2.24.8,software.amazon.awssdk:url-connection-client:2.24.8&apos;)
        # Enable Iceberg and Nessie extensions
        .set(&apos;spark.sql.extensions&apos;, &apos;org.apache.iceberg.spark.extensions.IcebergSparkSessionExtensions,org.projectnessie.spark.extensions.NessieSparkSessionExtensions&apos;)
        # Configure Nessie catalog
        .set(&apos;spark.sql.catalog.nessie&apos;, &apos;org.apache.iceberg.spark.SparkCatalog&apos;)
        .set(&apos;spark.sql.catalog.nessie.uri&apos;, CATALOG_URI)
        .set(&apos;spark.sql.catalog.nessie.ref&apos;, &apos;main&apos;)
        .set(&apos;spark.sql.catalog.nessie.authentication.type&apos;, &apos;NONE&apos;)
        .set(&apos;spark.sql.catalog.nessie.catalog-impl&apos;, &apos;org.apache.iceberg.nessie.NessieCatalog&apos;)
        # Set Minio as the S3 endpoint for Iceberg storage
        .set(&apos;spark.sql.catalog.nessie.s3.endpoint&apos;, STORAGE_URI)
        .set(&apos;spark.sql.catalog.nessie.warehouse&apos;, WAREHOUSE)
        .set(&apos;spark.sql.catalog.nessie.io-impl&apos;, &apos;org.apache.iceberg.aws.s3.S3FileIO&apos;)
)
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;This block configures the Spark session to work with Apache Iceberg and Nessie:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;code&gt;Packages&lt;/code&gt;: Specifies the necessary Spark packages, including connectors for PostgreSQL, Iceberg, Nessie, and AWS SDK to interact with S3/Minio.&lt;/li&gt;
&lt;li&gt;&lt;code&gt;Extensions&lt;/code&gt;: The IcebergSparkSessionExtensions and NessieSparkSessionExtensions are enabled to work with Iceberg and Nessie catalogs within Spark.&lt;/li&gt;
&lt;li&gt;&lt;code&gt;Nessie Catalog Configuration&lt;/code&gt;: The Nessie catalog is defined using the nessie catalog name, pointing to the CATALOG_URI and using the main reference (branch) in Nessie. The Nessie catalog is implemented using NessieCatalog.&lt;/li&gt;
&lt;li&gt;&lt;code&gt;Storage Configuration&lt;/code&gt;: The Minio service is set as the S3 endpoint (STORAGE_URI) and the warehouse path is configured for storing Iceberg tables.&lt;/li&gt;
&lt;/ul&gt;
&lt;h4&gt;4. Starting the Spark Session&lt;/h4&gt;
&lt;pre&gt;&lt;code class=&quot;language-python&quot;&gt;spark = SparkSession.builder.config(conf=conf).getOrCreate()
print(&amp;quot;Spark Session Started&amp;quot;)
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;This line starts the Spark session using the previously defined configuration (conf). The session is the entry point for working with Spark data and accessing the Nessie catalog.&lt;/p&gt;
&lt;ol start=&quot;5&quot;&gt;
&lt;li&gt;Defining the Schema for the Sales Data&lt;/li&gt;
&lt;/ol&gt;
&lt;pre&gt;&lt;code class=&quot;language-python&quot;&gt;schema = StructType([
    StructField(&amp;quot;order_id&amp;quot;, IntegerType(), True),
    StructField(&amp;quot;customer_id&amp;quot;, IntegerType(), True),
    StructField(&amp;quot;product&amp;quot;, StringType(), True),
    StructField(&amp;quot;quantity&amp;quot;, IntegerType(), True),
    StructField(&amp;quot;price&amp;quot;, DoubleType(), True),
    StructField(&amp;quot;order_date&amp;quot;, StringType(), True)
])
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Here, we define the schema for the sales data. This schema includes fields such as order_id, customer_id, product, quantity, price, and order_date, specifying their data types and whether they can contain null values (the True flag).&lt;/p&gt;
&lt;h4&gt;6. Creating a DataFrame with Messy Sales Data&lt;/h4&gt;
&lt;pre&gt;&lt;code class=&quot;language-python&quot;&gt;sales_data = [
    (1, 101, &amp;quot;Laptop&amp;quot;, 1, 1000.00, &amp;quot;2023-08-01&amp;quot;),
    (2, 102, &amp;quot;Mouse&amp;quot;, 2, 25.50, &amp;quot;2023-08-01&amp;quot;),
    (3, 103, &amp;quot;Keyboard&amp;quot;, 1, 45.00, &amp;quot;2023-08-01&amp;quot;),
    (1, 101, &amp;quot;Laptop&amp;quot;, 1, 1000.00, &amp;quot;2023-08-01&amp;quot;),  # Duplicate
    (4, 104, &amp;quot;Monitor&amp;quot;, None, 200.00, &amp;quot;2023-08-02&amp;quot;),  # Missing quantity
    (5, None, &amp;quot;Mouse&amp;quot;, 1, 25.50, &amp;quot;2023-08-02&amp;quot;)  # Missing customer_id
]

sales_df = spark.createDataFrame(sales_data, schema)
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;This section defines a list of tuples representing the sales data. Some rows intentionally contain duplicates and missing values, simulating messy data.&lt;/p&gt;
&lt;p&gt;We convert this list into a Spark DataFrame (sales_df) using the schema defined earlier. This DataFrame will be written to an Iceberg table.&lt;/p&gt;
&lt;h4&gt;7. Creating a Namespace in Nessie&lt;/h4&gt;
&lt;pre&gt;&lt;code class=&quot;language-python&quot;&gt;spark.sql(&amp;quot;CREATE NAMESPACE nessie.sales;&amp;quot;).show()
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Before writing data to the Nessie catalog, we create a namespace called sales in the Nessie catalog using Spark SQL. The &lt;code&gt;CREATE NAMESPACE&lt;/code&gt; command allows us to organize tables under a logical grouping, similar to a database schema.&lt;/p&gt;
&lt;h4&gt;8. Writing the DataFrame to an Iceberg Table in Nessie&lt;/h4&gt;
&lt;pre&gt;&lt;code class=&quot;language-python&quot;&gt;sales_df.writeTo(&amp;quot;nessie.sales.sales_data_raw&amp;quot;).createOrReplace()
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;This line writes the sales_df DataFrame to an Iceberg table called sales_data_raw under the nessie.sales namespace. The createOrReplace() method ensures that if the table already exists, it is replaced with the new data.&lt;/p&gt;
&lt;h4&gt;9. Verifying the Iceberg Table&lt;/h4&gt;
&lt;pre&gt;&lt;code class=&quot;language-python&quot;&gt;spark.read.table(&amp;quot;nessie.sales.sales_data_raw&amp;quot;).show()
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;We verify that the data has been successfully written to the Iceberg table by reading the table back from the Nessie catalog and displaying the contents.&lt;/p&gt;
&lt;h4&gt;10. Stopping the Spark Session&lt;/h4&gt;
&lt;pre&gt;&lt;code class=&quot;language-python&quot;&gt;spark.stop()
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Finally, we stop the Spark session to free up resources and end the application.&lt;/p&gt;
&lt;h2&gt;Verifying Iceberg Data and Metadata in Minio&lt;/h2&gt;
&lt;p&gt;Once the sales data has been written to the Apache Iceberg table in the Nessie catalog, we can verify that both the data and metadata files have been successfully created and stored in Minio. Follow the steps below to inspect the structure of the Iceberg table in Minio, and then explore the metadata files directly to understand how Iceberg organizes and manages table metadata.&lt;/p&gt;
&lt;h4&gt;Step 1: Access the Minio UI to Verify Data and Metadata Files&lt;/h4&gt;
&lt;ol&gt;
&lt;li&gt;Open your browser and navigate to the Minio UI at:&lt;/li&gt;
&lt;/ol&gt;
&lt;pre&gt;&lt;code&gt;http://localhost:9001
&lt;/code&gt;&lt;/pre&gt;
&lt;ol start=&quot;2&quot;&gt;
&lt;li&gt;Log in using the credentials defined in the &lt;code&gt;docker-compose.yml&lt;/code&gt; file:&lt;/li&gt;
&lt;/ol&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Username&lt;/strong&gt;: &lt;code&gt;admin&lt;/code&gt;&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Password&lt;/strong&gt;: &lt;code&gt;password&lt;/code&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;ol start=&quot;3&quot;&gt;
&lt;li&gt;
&lt;p&gt;Once logged in, locate the bucket where the Iceberg table is stored (e.g., the &lt;code&gt;warehouse&lt;/code&gt; bucket).&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;Inside the bucket, you will see a directory structure that represents the Iceberg table. The structure typically includes:&lt;/p&gt;
&lt;/li&gt;
&lt;/ol&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Data Files&lt;/strong&gt;: These are the physical Parquet files containing the actual data for the table.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Metadata Files&lt;/strong&gt;: These are JSON files that track the state and evolution of the table, including schema changes, partitions, snapshots, and more.&lt;/li&gt;
&lt;/ul&gt;
&lt;ol start=&quot;5&quot;&gt;
&lt;li&gt;Verify that both data and metadata files have been created for the &lt;code&gt;nessie.sales.sales_data_raw&lt;/code&gt; table. This confirms that Iceberg has correctly managed both the physical data storage and the metadata needed for table management.&lt;/li&gt;
&lt;/ol&gt;
&lt;h4&gt;Step 2: Create a New Python Notebook to Examine Iceberg Metadata&lt;/h4&gt;
&lt;p&gt;To better understand how Apache Iceberg structures its metadata, we will now create a new Python notebook in the JupyterLab environment and directly examine the metadata files.&lt;/p&gt;
&lt;ol&gt;
&lt;li&gt;&lt;strong&gt;Open the JupyterLab interface&lt;/strong&gt;:&lt;/li&gt;
&lt;/ol&gt;
&lt;ul&gt;
&lt;li&gt;
&lt;p&gt;In your browser, navigate to the JupyterLab environment at:&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;http://localhost:8888
&lt;/code&gt;&lt;/pre&gt;
&lt;/li&gt;
&lt;/ul&gt;
&lt;ol start=&quot;2&quot;&gt;
&lt;li&gt;&lt;strong&gt;Create a new Python notebook&lt;/strong&gt;:&lt;/li&gt;
&lt;/ol&gt;
&lt;ul&gt;
&lt;li&gt;In the JupyterLab interface, create a new Python notebook to run the following code, which will inspect the metadata files stored in Minio.&lt;/li&gt;
&lt;/ul&gt;
&lt;ol start=&quot;3&quot;&gt;
&lt;li&gt;&lt;strong&gt;Examine the Iceberg Metadata Files&lt;/strong&gt;:&lt;/li&gt;
&lt;/ol&gt;
&lt;p&gt;The metadata files are stored as JSON in the Iceberg directory structure. Below is an example of Python code you can use to read and examine the content of these metadata files:&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-python&quot;&gt;import boto3
import json

# Define Minio connection parameters
minio_client = boto3.client(
    &apos;s3&apos;,
    endpoint_url=&apos;http://172.27.0.3:9000&apos;,  # Minio IP address from docker inspect
    aws_access_key_id=&apos;admin&apos;,
    aws_secret_access_key=&apos;password&apos;,
    region_name=&apos;us-east-1&apos;
)

# Specify the bucket and metadata file path
bucket_name = &apos;warehouse&apos;
metadata_file_key = &apos;sales/sales_data_raw_a2c0456f-77a6-4121-8d3a-1d8168404edc/metadata/00000-ea121056-0c00-46cb-b9ca-88643d3492cb.metadata.json&apos;  # Example metadata path

# Download the metadata file
metadata_file = minio_client.get_object(Bucket=bucket_name, Key=metadata_file_key)
metadata_content = metadata_file[&apos;Body&apos;].read().decode(&apos;utf-8&apos;)

# Parse and print the metadata content
metadata_json = json.loads(metadata_content)
print(json.dumps(metadata_json, indent=4))
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;This code does the following:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Connects to Minio&lt;/strong&gt;: Using the boto3 library, it establishes a connection to the Minio service using the credentials (admin and password).&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Retrieves Metadata File&lt;/strong&gt;: It downloads one of the Iceberg metadata files (for example, v1.metadata.json) from the warehouse bucket.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Parses and Prints Metadata&lt;/strong&gt;: The metadata file is parsed as JSON and displayed in a readable format.&lt;/li&gt;
&lt;/ul&gt;
&lt;h4&gt;Explore the Metadata Structure:&lt;/h4&gt;
&lt;pre&gt;&lt;code class=&quot;language-json&quot;&gt;{
    &amp;quot;format-version&amp;quot;: 2,
    &amp;quot;table-uuid&amp;quot;: &amp;quot;2914be24-fd7d-4b54-bcc0-63edc5e03942&amp;quot;,
    &amp;quot;location&amp;quot;: &amp;quot;s3://warehouse/sales/sales_data_raw_a2c0456f-77a6-4121-8d3a-1d8168404edc&amp;quot;,
    &amp;quot;last-sequence-number&amp;quot;: 1,
    &amp;quot;last-updated-ms&amp;quot;: 1726146520362,
    &amp;quot;last-column-id&amp;quot;: 6,
    &amp;quot;current-schema-id&amp;quot;: 0,
    &amp;quot;schemas&amp;quot;: [
        {
            &amp;quot;type&amp;quot;: &amp;quot;struct&amp;quot;,
            &amp;quot;schema-id&amp;quot;: 0,
            &amp;quot;fields&amp;quot;: [
                {
                    &amp;quot;id&amp;quot;: 1,
                    &amp;quot;name&amp;quot;: &amp;quot;order_id&amp;quot;,
                    &amp;quot;required&amp;quot;: false,
                    &amp;quot;type&amp;quot;: &amp;quot;int&amp;quot;
                },
                {
                    &amp;quot;id&amp;quot;: 2,
                    &amp;quot;name&amp;quot;: &amp;quot;customer_id&amp;quot;,
                    &amp;quot;required&amp;quot;: false,
                    &amp;quot;type&amp;quot;: &amp;quot;int&amp;quot;
                },
                {
                    &amp;quot;id&amp;quot;: 3,
                    &amp;quot;name&amp;quot;: &amp;quot;product&amp;quot;,
                    &amp;quot;required&amp;quot;: false,
                    &amp;quot;type&amp;quot;: &amp;quot;string&amp;quot;
                },
                {
                    &amp;quot;id&amp;quot;: 4,
                    &amp;quot;name&amp;quot;: &amp;quot;quantity&amp;quot;,
                    &amp;quot;required&amp;quot;: false,
                    &amp;quot;type&amp;quot;: &amp;quot;int&amp;quot;
                },
                {
                    &amp;quot;id&amp;quot;: 5,
                    &amp;quot;name&amp;quot;: &amp;quot;price&amp;quot;,
                    &amp;quot;required&amp;quot;: false,
                    &amp;quot;type&amp;quot;: &amp;quot;double&amp;quot;
                },
                {
                    &amp;quot;id&amp;quot;: 6,
                    &amp;quot;name&amp;quot;: &amp;quot;order_date&amp;quot;,
                    &amp;quot;required&amp;quot;: false,
                    &amp;quot;type&amp;quot;: &amp;quot;string&amp;quot;
                }
            ]
        }
    ],
    &amp;quot;default-spec-id&amp;quot;: 0,
    &amp;quot;partition-specs&amp;quot;: [
        {
            &amp;quot;spec-id&amp;quot;: 0,
            &amp;quot;fields&amp;quot;: []
        }
    ],
    &amp;quot;last-partition-id&amp;quot;: 999,
    &amp;quot;default-sort-order-id&amp;quot;: 0,
    &amp;quot;sort-orders&amp;quot;: [
        {
            &amp;quot;order-id&amp;quot;: 0,
            &amp;quot;fields&amp;quot;: []
        }
    ],
    &amp;quot;properties&amp;quot;: {
        &amp;quot;owner&amp;quot;: &amp;quot;root&amp;quot;,
        &amp;quot;write.metadata.delete-after-commit.enabled&amp;quot;: &amp;quot;false&amp;quot;,
        &amp;quot;gc.enabled&amp;quot;: &amp;quot;false&amp;quot;,
        &amp;quot;write.parquet.compression-codec&amp;quot;: &amp;quot;zstd&amp;quot;
    },
    &amp;quot;current-snapshot-id&amp;quot;: 8859389821243348049,
    &amp;quot;refs&amp;quot;: {
        &amp;quot;main&amp;quot;: {
            &amp;quot;snapshot-id&amp;quot;: 8859389821243348049,
            &amp;quot;type&amp;quot;: &amp;quot;branch&amp;quot;
        }
    },
    &amp;quot;snapshots&amp;quot;: [
        {
            &amp;quot;sequence-number&amp;quot;: 1,
            &amp;quot;snapshot-id&amp;quot;: 8859389821243348049,
            &amp;quot;timestamp-ms&amp;quot;: 1726146520362,
            &amp;quot;summary&amp;quot;: {
                &amp;quot;operation&amp;quot;: &amp;quot;append&amp;quot;,
                &amp;quot;spark.app.id&amp;quot;: &amp;quot;local-1726146494182&amp;quot;,
                &amp;quot;added-data-files&amp;quot;: &amp;quot;6&amp;quot;,
                &amp;quot;added-records&amp;quot;: &amp;quot;6&amp;quot;,
                &amp;quot;added-files-size&amp;quot;: &amp;quot;10183&amp;quot;,
                &amp;quot;changed-partition-count&amp;quot;: &amp;quot;1&amp;quot;,
                &amp;quot;total-records&amp;quot;: &amp;quot;6&amp;quot;,
                &amp;quot;total-files-size&amp;quot;: &amp;quot;10183&amp;quot;,
                &amp;quot;total-data-files&amp;quot;: &amp;quot;6&amp;quot;,
                &amp;quot;total-delete-files&amp;quot;: &amp;quot;0&amp;quot;,
                &amp;quot;total-position-deletes&amp;quot;: &amp;quot;0&amp;quot;,
                &amp;quot;total-equality-deletes&amp;quot;: &amp;quot;0&amp;quot;
            },
            &amp;quot;manifest-list&amp;quot;: &amp;quot;s3://warehouse/sales/sales_data_raw_a2c0456f-77a6-4121-8d3a-1d8168404edc/metadata/snap-8859389821243348049-1-b395c768-1348-4f50-a762-8033fe417915.avro&amp;quot;,
            &amp;quot;schema-id&amp;quot;: 0
        }
    ],
    &amp;quot;statistics&amp;quot;: [],
    &amp;quot;partition-statistics&amp;quot;: [],
    &amp;quot;snapshot-log&amp;quot;: [
        {
            &amp;quot;timestamp-ms&amp;quot;: 1726146520362,
            &amp;quot;snapshot-id&amp;quot;: 8859389821243348049
        }
    ],
    &amp;quot;metadata-log&amp;quot;: []
}
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;The metadata JSON file contains important information about the table, including:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Schema&lt;/strong&gt;: Defines the structure of the table (columns, types, etc.).&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Snapshots&lt;/strong&gt;: Lists all snapshots of the table, which track historical versions of the data.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Partition Information&lt;/strong&gt;: Details about how the table is partitioned, if applicable.
By examining this metadata, you can gain insight into how Apache Iceberg tracks the state of the table, manages schema evolution, and supports features like time travel and partitioning.&lt;/li&gt;
&lt;/ul&gt;
&lt;h4&gt;Step 3: Analyze and Understand Iceberg Metadata&lt;/h4&gt;
&lt;p&gt;By exploring the Iceberg metadata files directly, you’ll see how Iceberg provides detailed information about your table’s state and changes over time. This metadata-driven architecture allows Iceberg to efficiently manage large datasets in a data lake, enabling advanced features such as:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Schema Evolution&lt;/strong&gt;: Support for adding, dropping, or modifying columns without downtime.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Partitioning&lt;/strong&gt;: Efficient querying of partitioned data for performance optimization.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Snapshots and Time Travel&lt;/strong&gt;: Ability to roll back or query previous versions of the table.
This deep integration of data and metadata makes Iceberg a powerful table format for modern data lakehouse architectures.&lt;/li&gt;
&lt;/ul&gt;
&lt;h2&gt;Confirming Nessie is Tracking the Iceberg Table with Curl Commands&lt;/h2&gt;
&lt;p&gt;Now that we have created the &lt;code&gt;sales_data_raw&lt;/code&gt; table in Apache Iceberg, it’s important to confirm that the Nessie catalog is properly tracking this table. We can use a series of &lt;code&gt;curl&lt;/code&gt; commands to interact with the Nessie REST API and verify that the catalog has recorded the new table in the appropriate namespace.&lt;/p&gt;
&lt;h3&gt;Step 1: List All Available Branches&lt;/h3&gt;
&lt;p&gt;First, let&apos;s confirm that we are working with the correct reference (branch). Nessie uses Git-like versioning, so tables are tracked within branches. To list the branches, run the following command:&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-bash&quot;&gt;curl -X GET &amp;quot;http://localhost:19120/api/v1/trees&amp;quot;
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;This command retrieves all available branches in the Nessie catalog. You should see a response that includes the main branch, where your sales_data_raw table is stored.&lt;/p&gt;
&lt;p&gt;Example response:&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-json&quot;&gt;{
  &amp;quot;type&amp;quot;: &amp;quot;BRANCH&amp;quot;,
  &amp;quot;name&amp;quot;: &amp;quot;main&amp;quot;,
  &amp;quot;hash&amp;quot;: &amp;quot;abcdef1234567890&amp;quot;
}
&lt;/code&gt;&lt;/pre&gt;
&lt;h4&gt;Step 2: List All Tables in the sales Namespace&lt;/h4&gt;
&lt;p&gt;Next, let’s verify that the sales_data_raw table exists in the nessie.sales namespace. Use the following curl command to list all entries in the sales namespace:&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-bash&quot;&gt;curl -X GET &amp;quot;http://localhost:19120/api/v1/contents/nessie.sales?ref=main&amp;quot;
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;This command will list all tables or objects in the nessie.sales namespace, on the main branch. You should see a response that includes the sales_data_raw table:&lt;/p&gt;
&lt;p&gt;Example response:&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-json&quot;&gt;{
  &amp;quot;sales_data_raw&amp;quot;: {
    &amp;quot;type&amp;quot;: &amp;quot;ICEBERG_TABLE&amp;quot;,
    &amp;quot;metadataLocation&amp;quot;: &amp;quot;s3://warehouse/nessie/sales/sales_data_raw/metadata/v1.metadata.json&amp;quot;
  }
}
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;This confirms that the sales_data_raw table is tracked in the sales namespace, and the response includes the path to the Iceberg metadata file.&lt;/p&gt;
&lt;h3&gt;Step 3: Retrieve Specific Metadata for the sales_data_raw Table&lt;/h3&gt;
&lt;p&gt;To get more detailed information about the sales_data_raw table, such as its metadata location and schema, run the following command:&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-bash&quot;&gt;curl -X GET &amp;quot;http://localhost:19120/api/v1/contents/nessie.sales.sales_data_raw?ref=main&amp;quot;
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;This command queries the specific entry for the sales_data_raw table. The response will include metadata such as the table’s schema, partitioning, and snapshot information.&lt;/p&gt;
&lt;p&gt;Example response:&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-json&quot;&gt;{
  &amp;quot;type&amp;quot;: &amp;quot;ICEBERG_TABLE&amp;quot;,
  &amp;quot;metadataLocation&amp;quot;: &amp;quot;s3://warehouse/nessie/sales/sales_data_raw/metadata/v1.metadata.json&amp;quot;,
  &amp;quot;snapshotId&amp;quot;: &amp;quot;1234567890abcdef&amp;quot;,
  &amp;quot;schema&amp;quot;: {
    &amp;quot;fields&amp;quot;: [
      {&amp;quot;name&amp;quot;: &amp;quot;order_id&amp;quot;, &amp;quot;type&amp;quot;: &amp;quot;int&amp;quot;},
      {&amp;quot;name&amp;quot;: &amp;quot;customer_id&amp;quot;, &amp;quot;type&amp;quot;: &amp;quot;int&amp;quot;},
      {&amp;quot;name&amp;quot;: &amp;quot;product&amp;quot;, &amp;quot;type&amp;quot;: &amp;quot;string&amp;quot;},
      {&amp;quot;name&amp;quot;: &amp;quot;quantity&amp;quot;, &amp;quot;type&amp;quot;: &amp;quot;int&amp;quot;},
      {&amp;quot;name&amp;quot;: &amp;quot;price&amp;quot;, &amp;quot;type&amp;quot;: &amp;quot;double&amp;quot;},
      {&amp;quot;name&amp;quot;: &amp;quot;order_date&amp;quot;, &amp;quot;type&amp;quot;: &amp;quot;string&amp;quot;}
    ]
  }
}
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;This response provides detailed information about the sales_data_raw table, including its metadata location and schema. You can use this information to verify that the table has been correctly tracked and is ready for querying in Dremio or Spark.&lt;/p&gt;
&lt;h3&gt;Step 4: List All Snapshots for the sales_data_raw Table&lt;/h3&gt;
&lt;p&gt;Nessie and Iceberg support snapshotting, which allows you to track the evolution of a table over time. To list all snapshots for the sales_data_raw table, run:&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-bash&quot;&gt;curl -X GET &amp;quot;http://localhost:19120/api/v1/tables/nessie.sales.sales_data_raw/refs/main/snapshots&amp;quot;
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;This command retrieves all snapshots of the table, allowing you to see previous versions and track how the table has changed. If snapshots are available, the response will include the snapshot ID, timestamp, and other relevant information.&lt;/p&gt;
&lt;p&gt;Example response:&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-json&quot;&gt;[
  {
    &amp;quot;snapshotId&amp;quot;: &amp;quot;1234567890abcdef&amp;quot;,
    &amp;quot;timestamp&amp;quot;: 1694653200000,
    &amp;quot;summary&amp;quot;: {
      &amp;quot;operation&amp;quot;: &amp;quot;append&amp;quot;,
      &amp;quot;addedFiles&amp;quot;: 1,
      &amp;quot;addedRecords&amp;quot;: 1000
    }
  }
]
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;This response shows the details of the latest snapshot, including the number of records and files added.&lt;/p&gt;
&lt;h3&gt;Step 5: Verify the Table&apos;s Current Reference (Optional)&lt;/h3&gt;
&lt;p&gt;If you want to verify that the table&apos;s current state matches the latest commit or branch reference, you can use the following command to check the state of the main branch:&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-bash&quot;&gt;curl -X GET &amp;quot;http://localhost:19120/api/v1/trees/main&amp;quot;
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;This command retrieves the latest commit on the main branch, including the hash. You can use this to ensure that the table’s current state is up to date with the latest commits.&lt;/p&gt;
&lt;h2&gt;Connecting Nessie and Minio as Sources in Dremio&lt;/h2&gt;
&lt;p&gt;Now that we&apos;ve confirmed that our Apache Iceberg table is being tracked by the Nessie catalog, it&apos;s time to connect Dremio to both Nessie and Minio so we can query the data and clean it up into more usable formats (Silver and Gold views). Dremio will allow us to access the raw data stored in Minio and transform it into higher-quality data, generating useful metrics from this process.&lt;/p&gt;
&lt;h3&gt;Step 1: Adding the Nessie Source in Dremio&lt;/h3&gt;
&lt;ol&gt;
&lt;li&gt;&lt;strong&gt;Open Dremio&lt;/strong&gt;: Open your browser and navigate to the Dremio UI at:&lt;/li&gt;
&lt;/ol&gt;
&lt;pre&gt;&lt;code&gt;http://localhost:9047
&lt;/code&gt;&lt;/pre&gt;
&lt;ol start=&quot;2&quot;&gt;
&lt;li&gt;&lt;strong&gt;Add a Nessie Source&lt;/strong&gt;:&lt;/li&gt;
&lt;/ol&gt;
&lt;ul&gt;
&lt;li&gt;Click on the &lt;strong&gt;“Add Source”&lt;/strong&gt; button in the bottom left corner of the Dremio interface.&lt;/li&gt;
&lt;li&gt;Select &lt;strong&gt;Nessie&lt;/strong&gt; from the list of available sources.&lt;/li&gt;
&lt;/ul&gt;
&lt;ol start=&quot;3&quot;&gt;
&lt;li&gt;&lt;strong&gt;Configure the Nessie Source&lt;/strong&gt;:
There are two sections to fill out: &lt;strong&gt;General&lt;/strong&gt; and &lt;strong&gt;Storage Settings&lt;/strong&gt;.&lt;/li&gt;
&lt;/ol&gt;
&lt;ul&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;General Settings (Connecting to the Nessie Server)&lt;/strong&gt;:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Name&lt;/strong&gt;: Set the source name to &lt;code&gt;nessie&lt;/code&gt;.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Endpoint URL&lt;/strong&gt;: Enter the Nessie API endpoint URL as:&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;http://nessie:19120/api/v2
&lt;/code&gt;&lt;/pre&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Authentication&lt;/strong&gt;: Set this to &lt;code&gt;None&lt;/code&gt;.&lt;/p&gt;
&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Storage Settings&lt;/strong&gt;:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Access Key&lt;/strong&gt;: Set this to &lt;code&gt;admin&lt;/code&gt; (Minio username).&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Secret Key&lt;/strong&gt;: Set this to &lt;code&gt;password&lt;/code&gt; (Minio password).&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Root Path&lt;/strong&gt;: Set this to &lt;code&gt;warehouse&lt;/code&gt; (this is the bucket where our Iceberg tables are stored).&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Connection Properties&lt;/strong&gt;:
&lt;ul&gt;
&lt;li&gt;Set &lt;code&gt;fs.s3a.path.style.access&lt;/code&gt; to &lt;code&gt;true&lt;/code&gt;.&lt;/li&gt;
&lt;li&gt;Set &lt;code&gt;fs.s3a.endpoint&lt;/code&gt; to &lt;code&gt;minio:9000&lt;/code&gt;.&lt;/li&gt;
&lt;li&gt;Set &lt;code&gt;dremio.s3.compat&lt;/code&gt; to &lt;code&gt;true&lt;/code&gt;.&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Encrypt Connection&lt;/strong&gt;: Uncheck this option (since we are running Nessie locally on HTTP).&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;/ul&gt;
&lt;ol start=&quot;4&quot;&gt;
&lt;li&gt;&lt;strong&gt;Save the Source&lt;/strong&gt;: After filling out all the settings, click &lt;strong&gt;Save&lt;/strong&gt;. The Nessie source will now be connected to Dremio, and you will be able to browse the tables stored in the Nessie catalog.&lt;/li&gt;
&lt;/ol&gt;
&lt;h3&gt;Step 2: Adding Minio (Seed Bucket) as an S3 Source in Dremio&lt;/h3&gt;
&lt;ol&gt;
&lt;li&gt;&lt;strong&gt;Add an S3 Source&lt;/strong&gt;:&lt;/li&gt;
&lt;/ol&gt;
&lt;ul&gt;
&lt;li&gt;Click on the &lt;strong&gt;“Add Source”&lt;/strong&gt; button again and select &lt;strong&gt;S3&lt;/strong&gt; from the list of sources.&lt;/li&gt;
&lt;/ul&gt;
&lt;ol start=&quot;2&quot;&gt;
&lt;li&gt;&lt;strong&gt;Configure the S3 Source for Minio&lt;/strong&gt;:&lt;/li&gt;
&lt;/ol&gt;
&lt;ul&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;General Settings&lt;/strong&gt;:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Name&lt;/strong&gt;: Set the source name to &lt;code&gt;seed&lt;/code&gt;.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Credentials&lt;/strong&gt;: Select &lt;strong&gt;AWS access key&lt;/strong&gt;.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Access Key&lt;/strong&gt;: Set to &lt;code&gt;admin&lt;/code&gt; (Minio username).&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Secret Key&lt;/strong&gt;: Set to &lt;code&gt;password&lt;/code&gt; (Minio password).&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Encrypt Connection&lt;/strong&gt;: Uncheck this option (since Minio is running locally).&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Advanced Options&lt;/strong&gt;:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Enable Compatibility Mode&lt;/strong&gt;: Set to &lt;code&gt;true&lt;/code&gt; (to ensure compatibility with Minio).&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Root Path&lt;/strong&gt;: Set to &lt;code&gt;/seed&lt;/code&gt; (this is where the seed data files are located in Minio).&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Connection Properties&lt;/strong&gt;:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Set &lt;code&gt;fs.s3a.path.style.access&lt;/code&gt; to &lt;code&gt;true&lt;/code&gt;.&lt;/li&gt;
&lt;li&gt;Set &lt;code&gt;fs.s3a.endpoint&lt;/code&gt; to &lt;code&gt;minio:9000&lt;/code&gt;.&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;/ul&gt;
&lt;ol start=&quot;3&quot;&gt;
&lt;li&gt;&lt;strong&gt;Save the Source&lt;/strong&gt;: After entering the configuration details, click &lt;strong&gt;Save&lt;/strong&gt;. The &lt;code&gt;seed&lt;/code&gt; bucket is now accessible in Dremio, and you can query the raw data stored in this bucket.&lt;/li&gt;
&lt;/ol&gt;
&lt;h3&gt;Step 3: Cleaning the Raw Data into a Silver View&lt;/h3&gt;
&lt;p&gt;Now that both sources are connected, we can begin cleaning up the raw sales data stored in the Iceberg table. In Dremio, you can create a &lt;strong&gt;Silver view&lt;/strong&gt;, which is a cleaned-up version of the raw data.&lt;/p&gt;
&lt;ol&gt;
&lt;li&gt;&lt;strong&gt;Query the Raw Data&lt;/strong&gt;:&lt;/li&gt;
&lt;/ol&gt;
&lt;ul&gt;
&lt;li&gt;Navigate to the Nessie source in Dremio.&lt;/li&gt;
&lt;li&gt;Locate the &lt;code&gt;sales_data_raw&lt;/code&gt; table in the &lt;code&gt;nessie.sales&lt;/code&gt; namespace.&lt;/li&gt;
&lt;li&gt;Right-click on the table and choose &lt;strong&gt;New Query&lt;/strong&gt;.&lt;/li&gt;
&lt;/ul&gt;
&lt;ol start=&quot;2&quot;&gt;
&lt;li&gt;&lt;strong&gt;Clean the Data&lt;/strong&gt;:
In the SQL editor, you can clean the raw data by removing duplicates, fixing missing values, and standardizing the data. Here&apos;s an example of a SQL query to clean the sales data:&lt;/li&gt;
&lt;/ol&gt;
&lt;pre&gt;&lt;code class=&quot;language-sql&quot;&gt;SELECT DISTINCT
  COALESCE(order_id, 0) AS order_id,
  COALESCE(customer_id, 0) AS customer_id,
  product,
  COALESCE(quantity, 1) AS quantity,
  price,
  order_date
FROM nessie.sales.sales_data_raw
WHERE customer_id IS NOT NULL
&lt;/code&gt;&lt;/pre&gt;
&lt;ul&gt;
&lt;li&gt;COALESCE is used to fill in missing values with default values (e.g., order_id and customer_id are set to 0 if missing).&lt;/li&gt;
&lt;li&gt;DISTINCT removes duplicate rows from the dataset.&lt;/li&gt;
&lt;li&gt;WHERE customer_id IS NOT NULL filters out rows with missing customer_id.&lt;/li&gt;
&lt;/ul&gt;
&lt;h5&gt;Save the Query as a Silver View:&lt;/h5&gt;
&lt;ul&gt;
&lt;li&gt;After running the query, click on Save As and save this cleaned-up dataset as a Silver view.&lt;/li&gt;
&lt;li&gt;Name the view sales_data_silver, and choose the location under the Nessie catalog.&lt;/li&gt;
&lt;/ul&gt;
&lt;h3&gt;Step 4: Generating Metrics from the Silver View (Gold Metrics)&lt;/h3&gt;
&lt;p&gt;With the Silver view cleaned up, we can now generate &amp;quot;Gold&amp;quot; metrics—higher-level aggregated data that provides business insights.&lt;/p&gt;
&lt;h4&gt;Create a New Query on the Silver View:&lt;/h4&gt;
&lt;ul&gt;
&lt;li&gt;
&lt;p&gt;Right-click on the sales_data_silver view and select New Query.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;Generate Gold Metrics: In this query, we can calculate metrics such as total sales, average order value, and total sales by product:&lt;/p&gt;
&lt;/li&gt;
&lt;/ul&gt;
&lt;pre&gt;&lt;code class=&quot;language-sql&quot;&gt;SELECT
  product,
  COUNT(order_id) AS total_orders,
  SUM(quantity) AS total_quantity_sold,
  SUM(quantity * price) AS total_sales,
  AVG(quantity * price) AS avg_order_value
FROM nessie.sales.sales_data_silver
GROUP BY product
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;This query generates the following metrics:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Total Orders: The number of orders per product.&lt;/li&gt;
&lt;li&gt;Total Quantity Sold: The total quantity of each product sold.&lt;/li&gt;
&lt;li&gt;Total Sales: The total revenue generated by each product.&lt;/li&gt;
&lt;li&gt;Average Order Value: The average value of each order for each product.&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;Save the Query as a Gold View:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;After running the query, save the results as a Gold view.&lt;/li&gt;
&lt;li&gt;Name this view sales_data_gold, and store it in the Nessie catalog.&lt;/li&gt;
&lt;/ul&gt;
&lt;h3&gt;Step 5: Visualizing Metrics and Insights&lt;/h3&gt;
&lt;p&gt;Once you have the Gold view ready, you can use Dremio&apos;s BI Tool integrations or export the data to BI tools like Apache Superset, Tableau or Power BI for further analysis. You now have clean, aggregated data (Silver and Gold views) ready for generating valuable insights and reporting.&lt;/p&gt;
&lt;p&gt;This process demonstrates how to transform raw, messy data into clean, structured views and meaningful metrics using Apache Iceberg, Nessie, Minio, and Dremio.&lt;/p&gt;
&lt;h2&gt;Accessing Data in Dremio: BI Tool Integrations, REST API, JDBC/ODBC, and Apache Arrow Flight&lt;/h2&gt;
&lt;p&gt;Dremio provides multiple ways to access your data, ensuring flexibility whether you&apos;re a data analyst using BI tools, a developer working with APIs, or a data scientist using Python notebooks. Here’s an overview of the different access methods available with Dremio.&lt;/p&gt;
&lt;h3&gt;BI Tool Integrations&lt;/h3&gt;
&lt;p&gt;Dremio integrates seamlessly with popular BI tools such as Tableau, Power BI, and Qlik. These tools can connect to Dremio using either JDBC or ODBC drivers, allowing analysts to directly query data in the data lakehouse without needing to move the data into a traditional data warehouse. With these integrations, you can build dashboards, visualizations, and reports on top of Dremio&apos;s unified data access layer.&lt;/p&gt;
&lt;p&gt;To connect your BI tool to Dremio:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Tableau&lt;/strong&gt;: Use the Dremio JDBC driver to connect, configure your data source, and start building dashboards.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Power BI&lt;/strong&gt;: Connect via the Dremio ODBC driver to query your Dremio datasets for report generation.&lt;/li&gt;
&lt;/ul&gt;
&lt;h3&gt;REST API&lt;/h3&gt;
&lt;p&gt;Dremio’s REST API allows developers to interact programmatically with the Dremio platform. You can execute queries, manage datasets, and control various aspects of your Dremio instance through HTTP requests. This is especially useful for building custom applications or automation workflows.&lt;/p&gt;
&lt;p&gt;Example:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;To authenticate and retrieve a token, you can use the &lt;code&gt;/login&lt;/code&gt; endpoint with a payload containing your credentials.&lt;/li&gt;
&lt;li&gt;Once authenticated, you can submit queries to Dremio using the &lt;code&gt;/sql&lt;/code&gt; endpoint, or manage sources and reflections through the API.&lt;/li&gt;
&lt;/ul&gt;
&lt;h3&gt;JDBC/ODBC&lt;/h3&gt;
&lt;p&gt;For integration with more traditional analytics workflows, Dremio provides both JDBC and ODBC drivers. These drivers enable you to connect to Dremio from a wide range of applications, such as SQL clients, BI tools, and custom applications, to query data using SQL.&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;JDBC&lt;/strong&gt;: A common driver used in Java-based applications.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;ODBC&lt;/strong&gt;: Useful for applications like Excel and other non-Java-based systems that support ODBC connections.&lt;/li&gt;
&lt;/ul&gt;
&lt;h3&gt;Apache Arrow Flight&lt;/h3&gt;
&lt;p&gt;Arrow Flight is an optimized protocol for transferring large datasets across networks efficiently using Apache Arrow. Dremio’s Arrow Flight interface allows high-performance data access directly into memory, enabling tools like Python, R, or any Arrow-enabled environment to query data from Dremio with very low latency.&lt;/p&gt;
&lt;p&gt;Arrow Flight is particularly useful for:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Fast data retrieval for analytics or machine learning.&lt;/li&gt;
&lt;li&gt;Working with large datasets in memory for interactive notebooks or custom applications.&lt;/li&gt;
&lt;/ul&gt;
&lt;h3&gt;Accessing Data in Jupyter Notebooks with &lt;code&gt;dremio-simple-query&lt;/code&gt;&lt;/h3&gt;
&lt;p&gt;If you&apos;re working in Python notebooks, Dremio can be accessed using the &lt;code&gt;dremio-simple-query&lt;/code&gt; library, which simplifies querying Dremio via Arrow Flight. This allows for high-performance querying and direct data manipulation in popular libraries like &lt;strong&gt;Polars&lt;/strong&gt; and &lt;strong&gt;Pandas&lt;/strong&gt;.&lt;/p&gt;
&lt;p&gt;Let’s walk through a practical example of querying the Gold dataset (&lt;code&gt;sales_data_gold&lt;/code&gt;) in Dremio and visualizing the results using &lt;code&gt;Polars&lt;/code&gt; and &lt;code&gt;Seaborn&lt;/code&gt;.&lt;/p&gt;
&lt;h3&gt;Step 1: Setup the Dremio Connection in Python&lt;/h3&gt;
&lt;p&gt;Assuming that you have &lt;strong&gt;Polars&lt;/strong&gt;, &lt;strong&gt;Seaborn&lt;/strong&gt;, and &lt;strong&gt;dremio-simple-query&lt;/strong&gt; installed in your environment (which is the case with our particular environment), you can start by setting up the Dremio connection.&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-python&quot;&gt;from dremio_simple_query.connect import get_token, DremioConnection
import polars as pl
import seaborn as sns
import matplotlib.pyplot as plt

# Dremio login details
login_endpoint = &amp;quot;http://dremio:9047/apiv2/login&amp;quot;
payload = {
    &amp;quot;userName&amp;quot;: &amp;quot;admin&amp;quot;,  # Dremio username
    &amp;quot;password&amp;quot;: &amp;quot;password&amp;quot;  # Dremio password
}

# Get the token
token = get_token(uri=login_endpoint, payload=payload)

# Dremio Arrow Flight endpoint (no SSL for local setup)
arrow_endpoint = &amp;quot;grpc://dremio:32010&amp;quot;

# Create the connection
dremio = DremioConnection(token, arrow_endpoint)
Step 2: Query the Gold Dataset (sales_data_gold)
Next, we&apos;ll query the sales_data_gold dataset from Dremio using the toPolars() method to return the data in a Polars DataFrame.

python
Copy code
# Query the Gold dataset
query = &amp;quot;SELECT * FROM nessie.sales.sales_data_gold;&amp;quot;
df = dremio.toPolars(query)

# Display the Polars DataFrame
print(df)
Step 3: Visualize the Data with Seaborn
Using the queried data, we can now visualize key metrics. In this example, we&apos;ll plot total sales by product.

python
Copy code
# Convert the Polars DataFrame to a Pandas DataFrame for Seaborn visualization
df_pandas = df.to_pandas()

# Create a bar plot of total sales by product
sns.barplot(data=df_pandas, x=&amp;quot;product&amp;quot;, y=&amp;quot;total_sales&amp;quot;, palette=&amp;quot;viridis&amp;quot;)
plt.title(&amp;quot;Total Sales by Product&amp;quot;)
plt.xlabel(&amp;quot;Product&amp;quot;)
plt.ylabel(&amp;quot;Total Sales&amp;quot;)
plt.xticks(rotation=45)
plt.show()
&lt;/code&gt;&lt;/pre&gt;
&lt;h3&gt;Step 4: Explanation of Query and Visualization&lt;/h3&gt;
&lt;p&gt;In this example:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Step 1&lt;/strong&gt;: We use the dremio-simple-query library to establish a connection to Dremio using the Arrow Flight protocol. This ensures high-speed data retrieval directly into memory.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Step 2&lt;/strong&gt;: We query the sales_data_gold table, which contains aggregated sales metrics, and load the data into a Polars DataFrame.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Step 3&lt;/strong&gt;: The data is converted into a Pandas DataFrame (for compatibility with Seaborn), and a simple bar plot is created to visualize total sales per product.&lt;/li&gt;
&lt;/ul&gt;
&lt;h3&gt;Benefits of Arrow Flight for Data Access&lt;/h3&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;High Performance&lt;/strong&gt;: By using Apache Arrow Flight, you can retrieve large datasets from Dremio into your local environment much faster than traditional methods like JDBC/ODBC.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;In-Memory Processing&lt;/strong&gt;: The data is transferred as Arrow tables, which can be efficiently processed in memory by tools like Polars, Pandas, and DuckDB.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Easy Integration with Python&lt;/strong&gt;: With libraries like dremio-simple-query, accessing and visualizing Dremio data in Python notebooks becomes straightforward, enabling faster iterations for data analysis and experimentation.&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;By combining Dremio’s Arrow Flight capabilities with powerful Python libraries, you can build high-performance, interactive data analysis workflows directly from your Jupyter notebooks, making it easy to transform and visualize your datasets on the fly.&lt;/p&gt;
&lt;h2&gt;Conclusion&lt;/h2&gt;
&lt;p&gt;Apache Iceberg and the Data Lakehouse architecture are revolutionizing the way organizations manage and analyze large-scale data. By decoupling storage, table formats, catalogs, and query engines, the lakehouse model combines the flexibility of data lakes with the powerful management features of data warehouses. In this blog, we’ve explored the technologies that enable the lakehouse paradigm, such as &lt;strong&gt;Minio&lt;/strong&gt; for object storage, &lt;strong&gt;Apache Iceberg&lt;/strong&gt; for ACID-compliant table formats, &lt;strong&gt;Nessie&lt;/strong&gt; for catalog versioning, &lt;strong&gt;Apache Spark&lt;/strong&gt; for distributed data processing, and &lt;strong&gt;Dremio&lt;/strong&gt; for fast, SQL-based analytics.&lt;/p&gt;
&lt;p&gt;We’ve walked through the steps of setting up a complete data lakehouse environment on your laptop using &lt;strong&gt;Docker Compose&lt;/strong&gt;, integrating these technologies into a cohesive system that demonstrates how they work together. From ingesting raw data into Apache Iceberg, to querying and cleaning it in Dremio, and finally generating valuable business metrics, you now have the tools and knowledge to build, explore, and scale your own data lakehouse.&lt;/p&gt;
&lt;p&gt;With powerful integrations like &lt;strong&gt;Arrow Flight&lt;/strong&gt; for high-performance data access and flexible options for querying and visualizing your data, the Data Lakehouse model empowers data teams to handle increasingly complex and large-scale datasets, unlocking the full potential of modern analytics.&lt;/p&gt;
&lt;p&gt;This hands-on guide is just the beginning—feel free to experiment with different datasets, configurations, and optimizations to see how this powerful architecture can meet your unique data needs. Whether you&apos;re running analytics or building scalable data pipelines, the lakehouse architecture provides the flexibility and performance required for the data-driven future.&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;a href=&quot;https://www.dremio.com/lakehouse-deep-dives/apache-iceberg-101/?utm_source=ev_external_blog&amp;amp;utm_medium=influencer&amp;amp;utm_campaign=introiceberg&amp;amp;utm_content=alexmerced&amp;amp;utm_term=external_blog&quot;&gt;Apache Iceberg 101&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&quot;https://www.dremio.com/blog/intro-to-dremio-nessie-and-apache-iceberg-on-your-laptop/?utm_source=ev_external_blog&amp;amp;utm_medium=influencer&amp;amp;utm_campaign=introiceberg&amp;amp;utm_content=alexmerced&amp;amp;utm_term=external_blog&quot;&gt;Hands-on Intro with Apache iceberg&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&quot;https://hello.dremio.com/webcast-an-apache-iceberg-lakehouse-crash-course-reg.html?utm_source=ev_external_blog&amp;amp;utm_medium=influencer&amp;amp;utm_campaign=introiceberg&amp;amp;utm_content=alexmerced&amp;amp;utm_term=external_blog&quot;&gt;Free Apache Iceberg Crash Course&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&quot;https://hello.dremio.com/wp-apache-iceberg-the-definitive-guide-reg.html?utm_source=ev_external_blog&amp;amp;utm_medium=influencer&amp;amp;utm_campaign=introiceberg&amp;amp;utm_content=alexmerced&amp;amp;utm_term=external_blog&quot;&gt;Free Copy Of Apache Iceberg: The Definitive Guide&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;
</content:encoded><author>Alex Merced</author></item><item><title>5 Trends in the Data Lakehouse Space</title><link>https://iceberglakehouse.com/posts/2024-9-five-trends-in-data-lakehouse/</link><guid isPermaLink="true">https://iceberglakehouse.com/posts/2024-9-five-trends-in-data-lakehouse/</guid><description>
- [Free Copy of Apache Iceberg: The Definitive Guide](https://hello.dremio.com/wp-apache-iceberg-the-definitive-guide-reg.html?utm_source=ev_external...</description><pubDate>Sun, 01 Sep 2024 09:00:00 GMT</pubDate><content:encoded>&lt;ul&gt;
&lt;li&gt;&lt;a href=&quot;https://hello.dremio.com/wp-apache-iceberg-the-definitive-guide-reg.html?utm_source=ev_external_blog&amp;amp;utm_medium=influencer&amp;amp;utm_campaign=lakehousetrends&amp;amp;utm_content=alexmerced&amp;amp;utm_term=external_blog&quot;&gt;Free Copy of Apache Iceberg: The Definitive Guide&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&quot;https://hello.dremio.com/webcast-an-apache-iceberg-lakehouse-crash-course-reg.html?utm_source=ev_external_blog&amp;amp;utm_medium=influencer&amp;amp;utm_campaign=lakehousetrends&amp;amp;utm_content=alexmerced&amp;amp;utm_term=external_blog&quot;&gt;Free Apache Iceberg Crash Course&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;The data lakehouse is emerging and evolving as the next iteration of analytical data architecture. It builds on previous approaches by integrating the data lake and data warehouse, which were traditionally separate, and reimagining the tightly coupled components of a data warehouse (storage, table format, catalog, processing) into a modular, deconstructed form. In this new architecture, the data lake is the storage layer, allowing you to stitch together diverse, interoperable components modularly. Let&apos;s explore the current trends in the lakehouse space to see where we stand today.&lt;/p&gt;
&lt;h2&gt;1. Storage Vendors Becoming Data Warehouses&lt;/h2&gt;
&lt;p&gt;Before data lakehouses became the dominant trend, the industry was marked by a rush to the cloud, driven by the elastic scalability offered by vendors like AWS, Azure, and Google Cloud. Each of these vendors had its own data warehouse solution (Redshift, Synapse, and BigQuery, respectively). In the early days of the lakehouse, these vendors enhanced their object storage products—already used in data lakes to store both structured and unstructured data—to better support lakehouse use cases.&lt;/p&gt;
&lt;p&gt;However, as regulations around the globe and the rising costs of elastic cloud infrastructure at scale prompted a reevaluation of on-premises and hybrid scenarios, a new trend emerged. Object storage vendors like Minio, Vast Data, NetApp, and Pure Storage began to pair their existing products with software and hardware innovations to become full-fledged data analytics platforms. The missing piece was a robust processing layer that could operate both in the cloud and on-premises, a gap that Dremio, the Lakehouse Platform, filled, enabling these products to offer comprehensive data lakehouse solutions in the cloud, on-premises, and in hybrid environments, democratizing the market.&lt;/p&gt;
&lt;p&gt;This competition is beneficial for consumers, as it fosters innovation. The data lakehouse shift is driving significant advancements in how data is stored, retrieved, and processed.&lt;/p&gt;
&lt;h2&gt;2. The Evolution of Table Formats&lt;/h2&gt;
&lt;p&gt;Table formats are crucial to ensuring interoperability within the data lakehouse ecosystem. Interoperability hinges on tools supporting various metadata standards and the unique features each format offers. Initially, three primary formats emerged: Apache Iceberg, Delta Lake, and Apache Hudi. Recently, Apache Paimon, born out of the Apache Flink project, has entered the scene.&lt;/p&gt;
&lt;p&gt;Over time, Apache Iceberg has solidified its position as the industry &amp;quot;default&amp;quot; for consuming large datasets at scale with maximum interoperability, thanks to its extensive ecosystem. This status has been reinforced by the support and strategic moves from major players like Snowflake and Databricks. While Delta Lake maintains a popular option due to its deep integration with Databricks and its comprehensive Python support—an area where PyIceberg is rapidly catching up. Iceberg and Delta Lake have emerged as the primary choices for data consumption. Iceberg stands out for its rich ecosystem, robust SQL support, and unique partitioning features, while Delta Lake is also used for its seamless Databricks integration and Python support.&lt;/p&gt;
&lt;p&gt;Apache Hudi has carved out a niche in the low-latency streaming ingestion space, particularly for scenarios involving frequent updates and deletes. Apache Paimon addresses a similar use case in streaming ingestion, so it will be interesting to see how Hudi and Paimon coexist in this segment of the market.&lt;/p&gt;
&lt;p&gt;The distinctions between these formats are becoming increasingly blurred as tools like Apache XTable (incubating) enable conversion between formats, and Delta Lake&apos;s UniForm feature allows for limited co-location of Iceberg and Hudi metadata with tables that are natively Delta Lake.&lt;/p&gt;
&lt;h2&gt;3. Iceberg Table Management&lt;/h2&gt;
&lt;p&gt;One of the unique features of Apache Iceberg is its independence from any major vendor control, which sets it apart from other table formats. While many tools can read and write to both Apache Iceberg and Delta Lake, the Iceberg ecosystem also extends to table management, offering a wide variety of options for optimizing the performance and storage of Iceberg tables. Previously, Tabular was the main player in this space, but after its acquisition by Databricks, it ceased taking on new customers. This shift has created an opportunity for other solutions from companies like Dremio, Upsolver, AWS, and Snowflake to step in and provide enterprise-grade solutions for automating the management of Apache Iceberg lakehouses.&lt;/p&gt;
&lt;h2&gt;4. The Catalog Wars Have Begun&lt;/h2&gt;
&lt;p&gt;As the focus on table formats has settled, a new battle has emerged: the catalog wars. Table Formats are crucial for recognizing a group of Apache Parquet files as a single table, complete with statistics to enable efficient querying, and catalogs are key to tools discovering them. Catalogs play a key role in listing available tables on your lakehouse, making them essential for tools like Dremio and Apache Spark to discover and query your data. Beyond this, catalogs could solve another major issue: the portability of governance. Traditionally, data governance involved securing files at the storage layer and separately governing tables across different tools and engines, as table formats lack inherent security mechanisms. Catalogs can potentially centralize governance rules that can be enforced across multiple tools.&lt;/p&gt;
&lt;p&gt;In this space, attention is focused on solutions like Nessie, Apache Polaris (incubating), Gravitino, and Unity Catalog, as they address this governance challenge. Nessie, in particular, offers unique value with its &amp;quot;git-for-data&amp;quot; approach, allowing for git-like versioning semantics across multiple tables. This enables multi-table transactions, rollbacks, and easy isolation of data workloads for experimentation.&lt;/p&gt;
&lt;h2&gt;5. The Data Lakehouse Platform&lt;/h2&gt;
&lt;p&gt;The Lakehouse architecture is composed of various deconstructed components that come together to create data warehouse-like functionality. Any modular system like this creates a demand for a platform that can integrate all these pieces into a unified, user-friendly experience while retaining as much modularity as possible. As data platforms transition to the lakehouse model, they are evolving their feature sets to become comprehensive lakehouse platforms. Currently, Dremio stands out as a leading platform in this space, offering the following features:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Compatibility with your choice of storage layer, whether in the cloud or on-premises&lt;/li&gt;
&lt;li&gt;Support for multiple table formats (Read/Write/Manage support for Iceberg, Read support for Delta Lake)&lt;/li&gt;
&lt;li&gt;Integrated Apache Iceberg Catalog with Automated Table Management, Role-Based Access Control (RBAC), and Git-For-Data features&lt;/li&gt;
&lt;li&gt;A catalog that supports connections from any tools compatible with Nessie, allowing you to use different engines alongside Dremio&lt;/li&gt;
&lt;li&gt;Support for working with various other Apache Iceberg catalogs&lt;/li&gt;
&lt;li&gt;The ability to enrich your Lakehouse data with data from databases, data warehouses, and file-based datasets on your data lake.&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;Dremio provides a fast, open, and easy-to-use lakehouse platform that allows you to leverage your existing databases, data lakes, and data warehouses in the cloud or on-prem from a single interface. It also offers flexible deployment options in the cloud or on-premises. With its flexibility and data virtualization capabilities, Dremio is well-positioned to serve as the gateway to the Lakehouse.&lt;/p&gt;
&lt;h2&gt;Conclusion&lt;/h2&gt;
&lt;p&gt;The evolution of the data lakehouse marks a significant shift in how we approach analytical data architecture. By merging the strengths of data lakes and data warehouses, the lakehouse model offers a modular, interoperable framework that can adapt to diverse business needs. As we&apos;ve seen, storage vendors are expanding into full-fledged data lakehouses, table formats like Apache Iceberg and Delta Lake are maturing, and the landscape of table management and catalog solutions is rapidly evolving.&lt;/p&gt;
&lt;p&gt;These trends indicate that the data lakehouse is not just a passing phase but a robust and flexible architecture that will continue to grow and influence the future of data management. With platforms like Dremio leading the way, offering comprehensive solutions that integrate storage, table management, and cataloging, the data lakehouse is well on its way to becoming the standard for modern data architectures. As the ecosystem around the lakehouse continues to innovate, businesses will find new opportunities to optimize their data strategies, ensuring that they can scale, adapt, and thrive in an increasingly data-driven world.&lt;/p&gt;
&lt;h2&gt;Resources to Learn More about Iceberg&lt;/h2&gt;
&lt;ul&gt;
&lt;li&gt;&lt;a href=&quot;https://www.dremio.com/lakehouse-deep-dives/apache-iceberg-101/?utm_source=ev_external_blog&amp;amp;utm_medium=influencer&amp;amp;utm_campaign=lakehousetrends&amp;amp;utm_content=alexmerced&amp;amp;utm_term=external_blog&quot;&gt;Apache Iceberg 101&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&quot;https://www.dremio.com/blog/intro-to-dremio-nessie-and-apache-iceberg-on-your-laptop/?utm_source=ev_external_blog&amp;amp;utm_medium=influencer&amp;amp;utm_campaign=lakehousetrends&amp;amp;utm_content=alexmerced&amp;amp;utm_term=external_blog&quot;&gt;Hands-on Intro with Apache iceberg&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&quot;https://hello.dremio.com/webcast-an-apache-iceberg-lakehouse-crash-course-reg.html?utm_source=ev_external_blog&amp;amp;utm_medium=influencer&amp;amp;utm_campaign=lakehousetrends&amp;amp;utm_content=alexmerced&amp;amp;utm_term=external_blog&quot;&gt;Free Apache Iceberg Crash Course&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&quot;https://hello.dremio.com/wp-apache-iceberg-the-definitive-guide-reg.html?utm_source=ev_external_blog&amp;amp;utm_medium=influencer&amp;amp;utm_campaign=lakehousetrends&amp;amp;utm_content=alexmerced&amp;amp;utm_term=external_blog&quot;&gt;Free Copy Of Apache Iceberg: The Definitive Guide&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;
</content:encoded><author>Alex Merced</author></item><item><title>Using the alexmerced/datanotebook Docker Image</title><link>https://iceberglakehouse.com/posts/2024-8-using-the-alexmerced-datanotebook-image/</link><guid isPermaLink="true">https://iceberglakehouse.com/posts/2024-8-using-the-alexmerced-datanotebook-image/</guid><description>
- [Watch My Intro to Data Playlist](https://www.youtube.com/watch?v=nq8ETrTgT7o&amp;list=PLsLAVBjQJO0p_4Nqz99tIjeoDYE97L0xY&amp;pp=iAQB)
- [Download Free Cop...</description><pubDate>Fri, 30 Aug 2024 09:00:00 GMT</pubDate><content:encoded>&lt;ul&gt;
&lt;li&gt;&lt;a href=&quot;https://www.youtube.com/watch?v=nq8ETrTgT7o&amp;amp;list=PLsLAVBjQJO0p_4Nqz99tIjeoDYE97L0xY&amp;amp;pp=iAQB&quot;&gt;Watch My Intro to Data Playlist&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&quot;https://drmevn.fyi/datanotebook830&quot;&gt;Download Free Copy of &amp;quot;Apache Iceberg: The Definitive Guide&amp;quot;&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&quot;https://drmevn.fyi/datanotecourse830&quot;&gt;Enroll in the Free Apache Iceberg Crash Course&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;Sometimes, you want to spin up a quick data notebook environment when doing quick data work or practice. For this purpose, I&apos;ve built the &lt;a href=&quot;https://hub.docker.com/repository/docker/alexmerced/datanotebook/general&quot;&gt;alexerced/datanotebook&lt;/a&gt; docker image, and this blog explains how you can use it for work you&apos;d like to do. The most significant difference between this image and the &lt;a href=&quot;https://hub.docker.com/repository/docker/alexmerced/spark35notebook/general&quot;&gt;alexmerced/spark35notebook&lt;/a&gt; image is that while the image does have pySpark installed, it does not have Spark running within the same container. You don&apos;t have to worry about a token to access the notebook with this image (I&apos;ll probably do the same with my next spark image whenever I build it).&lt;/p&gt;
&lt;p&gt;To use it just navigate to empty folder on your computer in your terminal and run the following command:&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;docker run -p 8888:8888 -v $(pwd):/home/pydata/work --name my_notebook alexmerced/datanotebook
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;This will map the home directory in the container to your current directory for file persistence. Then, you&apos;ll be able to go to localhost:8888 and start creating notebooks. Here is an example script you can try out.&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-py&quot;&gt;# Import the Polars library
import polars as pl

# Create a sample DataFrame
data = {
    &amp;quot;id&amp;quot;: [1, 2, 3, 4, 5],
    &amp;quot;name&amp;quot;: [&amp;quot;Alice&amp;quot;, &amp;quot;Bob&amp;quot;, &amp;quot;Charlie&amp;quot;, &amp;quot;David&amp;quot;, &amp;quot;Eve&amp;quot;],
    &amp;quot;age&amp;quot;: [25, 30, 35, 40, 45],
    &amp;quot;city&amp;quot;: [&amp;quot;New York&amp;quot;, &amp;quot;Los Angeles&amp;quot;, &amp;quot;Chicago&amp;quot;, &amp;quot;Houston&amp;quot;, &amp;quot;Phoenix&amp;quot;]
}

# Convert the dictionary to a Polars DataFrame
df = pl.DataFrame(data)

# Display the DataFrame
print(&amp;quot;DataFrame:&amp;quot;)
print(df)

# Define the output file path
output_file = &amp;quot;output_data.parquet&amp;quot;

# Write the DataFrame to a Parquet file
df.write_parquet(output_file)

# Confirm the file was written
print(f&amp;quot;\nDataFrame successfully written to {output_file}&amp;quot;)
&lt;/code&gt;&lt;/pre&gt;
&lt;h2&gt;Libraries Provided&lt;/h2&gt;
&lt;p&gt;Libraries you&apos;ll have available out of the box:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Data manipulation: &lt;code&gt;pandas&lt;/code&gt;, &lt;code&gt;numpy&lt;/code&gt;, &lt;code&gt;polars&lt;/code&gt;, &lt;code&gt;dask&lt;/code&gt;, &lt;code&gt;ibis&lt;/code&gt;&lt;/li&gt;
&lt;li&gt;Machine learning: &lt;code&gt;scikit-learn&lt;/code&gt;, &lt;code&gt;tensorflow&lt;/code&gt;, &lt;code&gt;torch&lt;/code&gt;, &lt;code&gt;xgboost&lt;/code&gt;, &lt;code&gt;lightgbm&lt;/code&gt;&lt;/li&gt;
&lt;li&gt;Visualization: &lt;code&gt;matplotlib&lt;/code&gt;, &lt;code&gt;seaborn&lt;/code&gt;, &lt;code&gt;plotly&lt;/code&gt;&lt;/li&gt;
&lt;li&gt;Database access: &lt;code&gt;psycopg2-binary&lt;/code&gt;, &lt;code&gt;mysqlclient&lt;/code&gt;, &lt;code&gt;sqlalchemy&lt;/code&gt;, &lt;code&gt;duckdb&lt;/code&gt;, &lt;code&gt;pyarrow&lt;/code&gt;&lt;/li&gt;
&lt;li&gt;Object storage: &lt;code&gt;boto3&lt;/code&gt;, &lt;code&gt;s3fs&lt;/code&gt;, &lt;code&gt;minio&lt;/code&gt;&lt;/li&gt;
&lt;li&gt;Other utilities: &lt;code&gt;openpyxl&lt;/code&gt;, &lt;code&gt;requests&lt;/code&gt;, &lt;code&gt;beautifulsoup4&lt;/code&gt;, &lt;code&gt;lxml&lt;/code&gt;, &lt;code&gt;pyspark&lt;/code&gt;, &lt;code&gt;dremio-simple-query&lt;/code&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;If you want to install additional libraries you can use the following syntax.&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-py&quot;&gt;# Install Polars and any other necessary libraries
!pip install polars
&lt;/code&gt;&lt;/pre&gt;
&lt;h2&gt;Connecting to Remote Spark Servers&lt;/h2&gt;
&lt;p&gt;pySpark is installed so you can write and run pySpark code but against external Spark servers, so how would you configure that Spark session would look something like this:&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-py&quot;&gt;from pyspark.sql import SparkSession

# MinIO configurations
minio_endpoint = &amp;quot;http://minio-server:9000&amp;quot;  # Replace with your MinIO server URL
access_key = &amp;quot;your-access-key&amp;quot;
secret_key = &amp;quot;your-secret-key&amp;quot;
bucket_name = &amp;quot;your-bucket&amp;quot;
minio_path = f&amp;quot;s3a://{bucket_name}/output_data.parquet&amp;quot;

# Configure the SparkSession to connect to a remote Spark server and MinIO
spark = SparkSession.builder \
    .appName(&amp;quot;MinIOConnection&amp;quot;) \
    .master(&amp;quot;spark://remote-spark-server:7077&amp;quot;) \  # Replace with your Spark master URL
    .config(&amp;quot;spark.driver.memory&amp;quot;, &amp;quot;2g&amp;quot;) \
    .config(&amp;quot;spark.executor.memory&amp;quot;, &amp;quot;4g&amp;quot;) \
    .config(&amp;quot;spark.hadoop.fs.s3a.endpoint&amp;quot;, minio_endpoint) \
    .config(&amp;quot;spark.hadoop.fs.s3a.access.key&amp;quot;, access_key) \
    .config(&amp;quot;spark.hadoop.fs.s3a.secret.key&amp;quot;, secret_key) \
    .config(&amp;quot;spark.hadoop.fs.s3a.path.style.access&amp;quot;, &amp;quot;true&amp;quot;) \
    .config(&amp;quot;spark.hadoop.fs.s3a.impl&amp;quot;, &amp;quot;org.apache.hadoop.fs.s3a.S3AFileSystem&amp;quot;) \
    .config(&amp;quot;spark.hadoop.fs.s3a.connection.ssl.enabled&amp;quot;, &amp;quot;false&amp;quot;) \
    .getOrCreate()

# Confirm the connection by printing the Spark configuration
print(&amp;quot;Spark session connected to:&amp;quot;, spark.sparkContext.master)
print(&amp;quot;MinIO path:&amp;quot;, minio_path)

# Example DataFrame creation
data = [
    (1, &amp;quot;Alice&amp;quot;, 25),
    (2, &amp;quot;Bob&amp;quot;, 30),
    (3, &amp;quot;Charlie&amp;quot;, 35)
]
columns = [&amp;quot;id&amp;quot;, &amp;quot;name&amp;quot;, &amp;quot;age&amp;quot;]

df = spark.createDataFrame(data, columns)

# Show the DataFrame
df.show()

# Perform some operations (e.g., filtering)
filtered_df = df.filter(df.age &amp;gt; 28)
filtered_df.show()

# Write the DataFrame to MinIO as a Parquet file
filtered_df.write.parquet(minio_path)

# Optionally, read the Parquet file back from MinIO
read_df = spark.read.parquet(minio_path)
read_df.show()

# Stop the Spark session
spark.stop()

&lt;/code&gt;&lt;/pre&gt;
&lt;h2&gt;Conclusion&lt;/h2&gt;
&lt;p&gt;Hope you find this docker image a functional flywheel for ad-hoc data engineering and data analytics work!&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;a href=&quot;https://www.youtube.com/watch?v=nq8ETrTgT7o&amp;amp;list=PLsLAVBjQJO0p_4Nqz99tIjeoDYE97L0xY&amp;amp;pp=iAQB&quot;&gt;Watch My Intro to Data Playlist&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&quot;https://drmevn.fyi/datanotebook830&quot;&gt;Download Free Copy of &amp;quot;Apache Iceberg: The Definitive Guide&amp;quot;&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&quot;https://drmevn.fyi/datanotecourse830&quot;&gt;Enroll in the Free Apache Iceberg Crash Course&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;
</content:encoded><author>Alex Merced</author></item><item><title>Understanding Apache Iceberg Delete Files</title><link>https://iceberglakehouse.com/posts/2024-8-understanding-apache-iceberg-delete-files/</link><guid isPermaLink="true">https://iceberglakehouse.com/posts/2024-8-understanding-apache-iceberg-delete-files/</guid><description>
- [Free Copy of Apache Iceberg: The Definitive Guide](https://hello.dremio.com/wp-apache-iceberg-the-definitive-guide-reg.html?utm_source=ev_external...</description><pubDate>Thu, 29 Aug 2024 09:00:00 GMT</pubDate><content:encoded>&lt;ul&gt;
&lt;li&gt;&lt;a href=&quot;https://hello.dremio.com/wp-apache-iceberg-the-definitive-guide-reg.html?utm_source=ev_external_blog&amp;amp;utm_medium=influencer&amp;amp;utm_campaign=deletefileblog&amp;amp;utm_content=alexmerced&amp;amp;utm_term=external_blog&quot;&gt;Free Copy of Apache Iceberg: The Definitive Guide&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&quot;https://hello.dremio.com/webcast-an-apache-iceberg-lakehouse-crash-course-reg.html?utm_source=ev_external_blog&amp;amp;utm_medium=influencer&amp;amp;utm_campaign=deletefileblog&amp;amp;utm_content=alexmerced&amp;amp;utm_term=external_blog&quot;&gt;Free Apache Iceberg Crash Course&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;Apache Iceberg is a powerful open-source table format for large-scale, distributed data storage. It enables complex data management tasks like schema evolution, time travel, and efficient query execution on massive datasets. An important feature of Iceberg is its ability to handle data deletions efficiently without requiring expensive rewrites of entire datasets when a table is &amp;quot;merge-on-read&amp;quot;. This capability is made possible by &lt;strong&gt;delete files&lt;/strong&gt;—specialized files that track row-level deletions in an Iceberg table.&lt;/p&gt;
&lt;p&gt;We&apos;ll dive deep into the role of delete files in Apache Iceberg. We&apos;ll explore delete files, how they work, and why they are essential for maintaining data consistency and optimizing query performance in a data lakehouse environment. By the end of this post, you&apos;ll have a solid understanding of how delete files function within Iceberg and how they can be leveraged to enhance your data management strategies.&lt;/p&gt;
&lt;h2&gt;What Are Delete Files in Apache Iceberg?&lt;/h2&gt;
&lt;p&gt;Delete files in Apache Iceberg are specialized metadata files that store information about rows deleted from a table. Unlike traditional data deletion methods that require entire files or partitions to be rewritten, delete files allow for granular, row-level deletions without altering the original data files. This makes delete operations in Iceberg both efficient and scalable.&lt;/p&gt;
&lt;h3&gt;Types of Delete Files&lt;/h3&gt;
&lt;p&gt;Iceberg supports two types of delete files:&lt;/p&gt;
&lt;ol&gt;
&lt;li&gt;&lt;strong&gt;Position Deletes&lt;/strong&gt;: These delete files specify the exact position of rows within a data file that should be considered deleted. They are used when the physical location of the data (i.e., the row&apos;s position in the file) is known.&lt;/li&gt;
&lt;/ol&gt;
&lt;pre&gt;&lt;code&gt;+--------------------------------------------+-----------------+------------------------------------+
| file_path                                  | pos             | row                                |
+--------------------------------------------+-----------------+------------------------------------+
| s3://bucket/path/to/data-file-1.parquet    | 0               | { &amp;quot;id&amp;quot;: 1, &amp;quot;category&amp;quot;: &amp;quot;marsupial&amp;quot;,|
|                                            |                 |   &amp;quot;name&amp;quot;: &amp;quot;Koala&amp;quot; }               |
| s3://bucket/path/to/data-file-1.parquet    | 102             | { &amp;quot;id&amp;quot;: 2, &amp;quot;category&amp;quot;: &amp;quot;toy&amp;quot;,      |
|                                            |                 |   &amp;quot;name&amp;quot;: &amp;quot;Teddy&amp;quot; }               |
+--------------------------------------------+-----------------+------------------------------------+


&lt;/code&gt;&lt;/pre&gt;
&lt;ol start=&quot;2&quot;&gt;
&lt;li&gt;&lt;strong&gt;Equality Deletes&lt;/strong&gt;: These delete files mark rows for deletion based on specific column values rather than their position. For example, suppose a record with a particular ID needs to be deleted. In that case, an equality delete file can specify that any row matching this ID should be excluded from query results.&lt;/li&gt;
&lt;/ol&gt;
&lt;pre&gt;&lt;code&gt;+-------------------+-----------------+------------------------------------+
| equality_ids      | id              | category        | name             |
+-------------------+-----------------+-----------------+------------------+
| equality_ids=[1]  | 3               | NULL            | Grizzly          |
+-------------------+-----------------+-----------------+------------------+
| equality_ids=[1,2]| 4               | NULL            | Polar            |
+-------------------+-----------------+-----------------+------------------+

&lt;/code&gt;&lt;/pre&gt;
&lt;h3&gt;Importance of Delete Files&lt;/h3&gt;
&lt;p&gt;By separating deletion metadata from the data files, Iceberg ensures that data deletions are handled efficiently when fast writes with many row-level changes are needed. This separation also allows Iceberg to maintain the ACID properties—Atomicity, Consistency, Isolation, and Durability—essential for reliable data management in distributed systems.&lt;/p&gt;
&lt;h2&gt;The Role of Delete Files&lt;/h2&gt;
&lt;p&gt;Delete files allow Iceberg to handle row-level deletions with precision and efficiency, a feature that is particularly valuable in environments where datasets are large, complex, and continuously evolving. By using delete files, Iceberg can apply deletions without physically altering the data files, which leads to several key advantages.&lt;/p&gt;
&lt;h3&gt;Row-Level Deletions Without Rewrites&lt;/h3&gt;
&lt;p&gt;One of the most significant benefits of delete files is their ability to perform row-level deletions without rewriting the original data files. In traditional data management systems, deleting data often involves costly operations where entire files or partitions must be rewritten to exclude the deleted records. This can be both time-consuming and resource-intensive, especially in large-scale datasets.&lt;/p&gt;
&lt;p&gt;In contrast, Apache Iceberg leverages delete files to mark specific rows as deleted while the original data files remain unchanged. This approach significantly reduces the overhead of deletions and ensures that data modifications can be handled quickly and efficiently.&lt;/p&gt;
&lt;h2&gt;Contents Inside a Delete File&lt;/h2&gt;
&lt;p&gt;To understand how delete files achieve their role in Apache Iceberg, it’s important to look at the specific metadata they contain. Each delete file is carefully structured to provide all the information necessary to identify and apply deletions to the relevant data files.&lt;/p&gt;
&lt;h3&gt;Key Fields in a Delete File&lt;/h3&gt;
&lt;p&gt;Here are some of the critical fields you’ll find inside a delete file:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;&lt;code&gt;file_path&lt;/code&gt;&lt;/strong&gt;: This field indicates the path of the data file to which the delete file applies. It’s essential for mapping the delete operations to the correct data file in the dataset.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;&lt;code&gt;pos&lt;/code&gt;&lt;/strong&gt;: Present in position delete files, this field specifies the exact position of the row within the data file that should be marked as deleted. This allows for precise, row-level deletions based on the physical layout of the data.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;&lt;code&gt;row&lt;/code&gt;&lt;/strong&gt;: In equality delete files, the &lt;code&gt;row&lt;/code&gt; field contains the values that identify which rows should be deleted. For instance, if a particular ID needs to be deleted across multiple data files, this field will hold that ID value.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;&lt;code&gt;partition&lt;/code&gt;&lt;/strong&gt;: This field contains the partition information of the data that is subject to deletion. It helps ensure that the delete file is applied only to the relevant partitions, further optimizing the deletion process.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;&lt;code&gt;sequence_number&lt;/code&gt;&lt;/strong&gt;: Iceberg uses sequence numbers to track the order of changes made to the data. The &lt;code&gt;sequence_number&lt;/code&gt; in a delete file indicates when the deletion was committed relative to other changes in the dataset.&lt;/p&gt;
&lt;/li&gt;
&lt;/ul&gt;
&lt;h2&gt;Conclusion&lt;/h2&gt;
&lt;p&gt;Delete files in Apache Iceberg are a powerful tool that enables efficient, performant, and precise row-level updates, especially in large-scale, distributed environments. By allowing row-level deletions without the need to rewrite entire data files, delete files optimize the performance of data lakehouse operations while maintaining the integrity and consistency of the dataset.&lt;/p&gt;
&lt;p&gt;Understanding how to leverage both position and equality delete files is crucial for data engineers looking to implement scalable, performant data architectures.&lt;/p&gt;
&lt;h2&gt;Resources to Learn More about Iceberg&lt;/h2&gt;
&lt;ul&gt;
&lt;li&gt;&lt;a href=&quot;https://www.dremio.com/lakehouse-deep-dives/apache-iceberg-101/?utm_source=ev_external_blog&amp;amp;utm_medium=influencer&amp;amp;utm_campaign=deletefileblog&amp;amp;utm_content=alexmerced&amp;amp;utm_term=external_blog&quot;&gt;Apache Iceberg 101&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&quot;https://www.dremio.com/blog/intro-to-dremio-nessie-and-apache-iceberg-on-your-laptop/?utm_source=ev_external_blog&amp;amp;utm_medium=influencer&amp;amp;utm_campaign=deletefileblog&amp;amp;utm_content=alexmerced&amp;amp;utm_term=external_blog&quot;&gt;Hands-on Intro with Apache iceberg&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&quot;https://hello.dremio.com/webcast-an-apache-iceberg-lakehouse-crash-course-reg.html?utm_source=ev_external_blog&amp;amp;utm_medium=influencer&amp;amp;utm_campaign=deletefileblog&amp;amp;utm_content=alexmerced&amp;amp;utm_term=external_blog&quot;&gt;Free Apache Iceberg Crash Course&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&quot;https://hello.dremio.com/wp-apache-iceberg-the-definitive-guide-reg.html?utm_source=ev_external_blog&amp;amp;utm_medium=influencer&amp;amp;utm_campaign=deletefileblog&amp;amp;utm_content=alexmerced&amp;amp;utm_term=external_blog&quot;&gt;Free Copy Of Apache Iceberg: The Definitive Guide&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;
</content:encoded><author>Alex Merced</author></item><item><title>Understanding the Apache Iceberg Manifest</title><link>https://iceberglakehouse.com/posts/2024-8-understanding-apache-iceberg-manifest/</link><guid isPermaLink="true">https://iceberglakehouse.com/posts/2024-8-understanding-apache-iceberg-manifest/</guid><description>
- [Free Copy of Apache Iceberg: The Definitive Guide](https://hello.dremio.com/wp-apache-iceberg-the-definitive-guide-reg.html?utm_source=ev_external...</description><pubDate>Tue, 27 Aug 2024 09:00:00 GMT</pubDate><content:encoded>&lt;ul&gt;
&lt;li&gt;&lt;a href=&quot;https://hello.dremio.com/wp-apache-iceberg-the-definitive-guide-reg.html?utm_source=ev_external_blog&amp;amp;utm_medium=influencer&amp;amp;utm_campaign=manifestblog&amp;amp;utm_content=alexmerced&amp;amp;utm_term=external_blog&quot;&gt;Free Copy of Apache Iceberg: The Definitive Guide&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&quot;https://hello.dremio.com/webcast-an-apache-iceberg-lakehouse-crash-course-reg.html?utm_source=ev_external_blog&amp;amp;utm_medium=influencer&amp;amp;utm_campaign=manifestblog&amp;amp;utm_content=alexmerced&amp;amp;utm_term=external_blog&quot;&gt;Free Apache Iceberg Crash Course&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;Apache Iceberg is an open lakehouse table format with SQL-like capabilities and guarantees data stored in distributed file systems, making it a cornerstone of modern data lakehouse architectures. It has become a popular choice for managing large datasets due to its ability to handle complex data engineering challenges, such as time travel, schema evolution, and efficient query execution. At the heart of Iceberg&apos;s architecture is its metadata, which is crucial for maintaining the integrity and performance of the data.&lt;/p&gt;
&lt;p&gt;We will explore a key component of Iceberg&apos;s metadata management: the &lt;strong&gt;Manifest File&lt;/strong&gt;. Manifest files are vital in how Iceberg tracks and manages individual data files, ensuring that queries are executed efficiently and data consistency is maintained across snapshots. Understanding the role of these files will help data engineers optimize their data platforms and make the most of Iceberg&apos;s capabilities.&lt;/p&gt;
&lt;h2&gt;What is a Manifest File?&lt;/h2&gt;
&lt;p&gt;A &lt;strong&gt;Manifest File&lt;/strong&gt; in Apache Iceberg is a metadata file that tracks individual data files associated with a particular table snapshot.&lt;/p&gt;
&lt;p&gt;A snapshot (tracked by Manifest List files, covered in a previous blog) in Iceberg includes one or more manifest files, each representing a subset of the data in the table. These manifest files ensure that Iceberg can manage large datasets without compromising performance. By breaking down the data into manageable chunks, Iceberg can efficiently track and query the data without scanning the entire dataset.&lt;/p&gt;
&lt;h2&gt;The Role of Manifest Files in Iceberg&lt;/h2&gt;
&lt;p&gt;Manifest files serve several vital functions within the Apache Iceberg architecture:&lt;/p&gt;
&lt;h3&gt;Tracking Data Files&lt;/h3&gt;
&lt;p&gt;Manifest files are responsible for tracking the data files that comprise a snapshot. Each manifest file lists the data files and metadata about those files, such as their locations, partitioning information, and metrics like record count and file size.&lt;/p&gt;
&lt;h3&gt;Facilitating Efficient Scans&lt;/h3&gt;
&lt;p&gt;One of the manifest files&apos; primary roles is enabling Iceberg to perform efficient scans. By summarizing key information about the data files, manifest files allow query engines to determine which files are relevant to a particular query. This means that only the necessary files are scanned, significantly reducing the amount of data read from storage and improving query performance.&lt;/p&gt;
&lt;p&gt;Manifest files are a crucial component of Apache Iceberg&apos;s architecture, providing the foundation for efficient data tracking, query planning, and snapshot management.&lt;/p&gt;
&lt;h2&gt;Contents Inside a Manifest File&lt;/h2&gt;
&lt;p&gt;A &lt;strong&gt;Manifest File&lt;/strong&gt; in Apache Iceberg is more than just a simple list of data files; it is a rich metadata file that contains detailed information essential for efficient data management and query optimization. Each manifest file serves as a catalog that Iceberg uses to track and manage the state of data files in the table.&lt;/p&gt;
&lt;h3&gt;Key Components of a Manifest File&lt;/h3&gt;
&lt;p&gt;Here are some of the critical fields you’ll find inside a manifest file:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;&lt;code&gt;file_path&lt;/code&gt;&lt;/strong&gt;: This field records the location of the data file in the storage system. It is a string that provides the full path to the file, ensuring that Iceberg can quickly locate the data file when needed.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;&lt;code&gt;partition_data&lt;/code&gt;&lt;/strong&gt;: This field contains information about the partition values for the data in the file. Partitioning is a critical aspect of Iceberg&apos;s architecture, as it allows the data to be organized to make it easier to filter and query efficiently. The &lt;code&gt;partition_data&lt;/code&gt; field ensures that Iceberg can apply the correct partitioning logic during query execution.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;&lt;code&gt;file_format&lt;/code&gt;&lt;/strong&gt;: This field specifies the format of the data file (e.g., Parquet, Avro, ORC). Knowing the format is crucial because it determines how Iceberg reads and writes the file, as well as how it can optimize queries against the data.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;&lt;code&gt;record_count&lt;/code&gt;&lt;/strong&gt;: The &lt;code&gt;record_count&lt;/code&gt; field indicates the number of records contained in the data file. This metadata helps query engines estimate the size of the data set and make decisions about how to optimize query execution.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;&lt;code&gt;file_size_in_bytes&lt;/code&gt;&lt;/strong&gt;: This field provides the total size of the data file in bytes. Like the &lt;code&gt;record_count&lt;/code&gt;, the file size is an essential metric for understanding the scale of the data and for planning efficient scans.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;&lt;code&gt;value_counts&lt;/code&gt;, &lt;code&gt;null_value_counts&lt;/code&gt;, &lt;code&gt;nan_value_counts&lt;/code&gt;&lt;/strong&gt;: These fields are metrics that provide detailed statistics about the data in the file. &lt;code&gt;value_counts&lt;/code&gt; gives the total number of values in each column, &lt;code&gt;null_value_counts&lt;/code&gt; tracks the number of null values, and &lt;code&gt;nan_value_counts&lt;/code&gt; records the number of NaN (Not a Number) values. These metrics are invaluable for query optimization, as they help the query engine determine whether a file should be scanned based on the presence or absence of relevant data.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;&lt;code&gt;lower_bounds&lt;/code&gt; and &lt;code&gt;upper_bounds&lt;/code&gt;&lt;/strong&gt;: These fields store the minimum and maximum values for each column in the data file. Query engines use these bounds to perform min/max pruning—skipping over data files that do not match the query’s filter criteria. For example, if a query is looking for data within a specific date range, and the &lt;code&gt;lower_bounds&lt;/code&gt; and &lt;code&gt;upper_bounds&lt;/code&gt; of a data file fall outside, the query engine can skip reading that file entirely.&lt;/p&gt;
&lt;/li&gt;
&lt;/ul&gt;
&lt;h3&gt;How These Fields Work Together&lt;/h3&gt;
&lt;p&gt;Each of these fields within a manifest file plays a critical role in linking the snapshot to its underlying data files:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Efficient Data Location&lt;/strong&gt;: The &lt;code&gt;file_path&lt;/code&gt; and &lt;code&gt;file_format&lt;/code&gt; fields ensure that Iceberg can quickly locate and correctly interpret the data files, regardless of where they are stored or formatted.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Enhanced Query Optimization&lt;/strong&gt;: Fields like &lt;code&gt;partition_data&lt;/code&gt;, &lt;code&gt;record_count&lt;/code&gt;, &lt;code&gt;file_size_in_bytes&lt;/code&gt;, and the various counts (e.g., &lt;code&gt;value_counts&lt;/code&gt;) provide the metadata necessary for Iceberg to optimize query execution. By understanding the data&apos;s size, format, and structure, Iceberg can plan more efficient scans, reducing the amount of data read and speeding up queries.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;** Data Pruning **: The lower_bounds and upper_bounds fields are particularly important for query optimization. They enable Iceberg to prune unnecessary data files before scanning begins, ensuring that only the most relevant data is processed.&lt;/p&gt;
&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;The contents of a manifest file allow Iceberg to maintain control over large datasets, ensuring that they are managed efficiently and that queries are executed as quickly as possible. By leveraging this metadata, Iceberg can deliver the high performance and scalability that modern data lakehouses require.&lt;/p&gt;
&lt;h2&gt;The Interplay Between Manifest Files and the Manifest List&lt;/h2&gt;
&lt;p&gt;While individual manifest files are crucial for tracking and managing data files, they do not exist in isolation. Instead, they are part of a larger structure that includes the &lt;strong&gt;Manifest List&lt;/strong&gt;. The Manifest List acts as an index that tracks all the manifest files associated with a particular snapshot, summarizing their contents and providing a high-level view of the dataset.&lt;/p&gt;
&lt;h3&gt;Hierarchical Metadata Management&lt;/h3&gt;
&lt;p&gt;The relationship between manifest files and the Manifest List allows Iceberg to manage metadata hierarchically. The Manifest List summarizes the data tracked by each manifest file, while the manifest files provide detailed metadata about the individual data files. This hierarchy ensures that Iceberg can efficiently manage large datasets by organizing metadata into layers, each serving a specific purpose.&lt;/p&gt;
&lt;h3&gt;Snapshot Management&lt;/h3&gt;
&lt;p&gt;When a new snapshot is created in Iceberg, it includes a new Manifest List, which references the relevant manifest files. This structure allows Iceberg to manage snapshots in an atomic and consistent manner. By updating the Manifest List, Iceberg can track changes to the dataset, such as adding or deleting data files, without disrupting ongoing queries.&lt;/p&gt;
&lt;h3&gt;Query Optimization&lt;/h3&gt;
&lt;p&gt;The Manifest List plays a crucial role in query optimization by providing the query engine with a summary of the data in each manifest file. This summary includes the number of files, their sizes, and the partition ranges they cover. By consulting the Manifest List, the query engine can quickly determine which manifest files are relevant to the query and then dive deeper into those files to analyze their detailed metadata.&lt;/p&gt;
&lt;h2&gt;Benefits of the Manifest File Structure&lt;/h2&gt;
&lt;p&gt;The use of manifest files in Apache Iceberg offers several significant benefits, particularly when it comes to scalability, flexibility, and efficiency.&lt;/p&gt;
&lt;h3&gt;Scalability&lt;/h3&gt;
&lt;p&gt;The manifest file structure allows Iceberg to scale to manage vast datasets efficiently. By breaking down metadata into manageable chunks, Iceberg ensures that even as the dataset grows, the system can maintain high performance without overwhelming the query engine or the underlying storage.&lt;/p&gt;
&lt;h3&gt;Flexibility&lt;/h3&gt;
&lt;p&gt;Manifest files provide the flexibility to manage different types of data files and partitioning schemes within the same snapshot. This flexibility is critical for data engineers who need to adapt to changing data requirements without disrupting the overall system.&lt;/p&gt;
&lt;h3&gt;Efficiency&lt;/h3&gt;
&lt;p&gt;By leveraging the detailed metadata in manifest files, Iceberg can optimize query execution and reduce the number of I/O operations. This efficiency translates into faster query times and lower costs, making Iceberg an ideal choice for managing large-scale data in modern lakehouses.&lt;/p&gt;
&lt;h2&gt;Conclusion&lt;/h2&gt;
&lt;p&gt;Manifest files are a foundational component of Apache Iceberg&apos;s architecture, critical in tracking, managing, and optimizing the use of data files within a table snapshot. By understanding how these files work, data engineers can harness the full power of Iceberg to create efficient, scalable, and flexible data lakes. The interplay between manifest files and the Manifest List ensures that Iceberg can easily handle large datasets, delivering high performance and reliability.&lt;/p&gt;
&lt;p&gt;Exploring manifest files is a great place to start for those looking to dive deeper into Apache Iceberg and its metadata management capabilities. By mastering the concepts discussed in this article, you can optimize your data platform to meet the demands of modern data workloads.&lt;/p&gt;
&lt;h2&gt;Resources to Learn More about Iceberg&lt;/h2&gt;
&lt;ul&gt;
&lt;li&gt;&lt;a href=&quot;https://www.dremio.com/lakehouse-deep-dives/apache-iceberg-101/?utm_source=ev_external_blog&amp;amp;utm_medium=influencer&amp;amp;utm_campaign=manifestblog&amp;amp;utm_content=alexmerced&amp;amp;utm_term=external_blog&quot;&gt;Apache Iceberg 101&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&quot;https://www.dremio.com/blog/intro-to-dremio-nessie-and-apache-iceberg-on-your-laptop/?utm_source=ev_external_blog&amp;amp;utm_medium=influencer&amp;amp;utm_campaign=manifestblog&amp;amp;utm_content=alexmerced&amp;amp;utm_term=external_blog&quot;&gt;Hands-on Intro with Apache iceberg&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&quot;https://hello.dremio.com/webcast-an-apache-iceberg-lakehouse-crash-course-reg.html?utm_source=ev_external_blog&amp;amp;utm_medium=influencer&amp;amp;utm_campaign=manifestblog&amp;amp;utm_content=alexmerced&amp;amp;utm_term=external_blog&quot;&gt;Free Apache Iceberg Crash Course&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&quot;https://hello.dremio.com/wp-apache-iceberg-the-definitive-guide-reg.html?utm_source=ev_external_blog&amp;amp;utm_medium=influencer&amp;amp;utm_campaign=manifestblog&amp;amp;utm_content=alexmerced&amp;amp;utm_term=external_blog&quot;&gt;Free Copy Of Apache Iceberg: The Definitive Guide&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;
</content:encoded><author>Alex Merced</author></item><item><title>Understanding the Apache Iceberg Manifest List (Snapshot)</title><link>https://iceberglakehouse.com/posts/2024-8-understanding-apache-iceberg-manifest-list/</link><guid isPermaLink="true">https://iceberglakehouse.com/posts/2024-8-understanding-apache-iceberg-manifest-list/</guid><description>- [Free Copy of Apache Iceberg: The Definitive Guide](https://hello.dremio.com/wp-apache-iceberg-the-definitive-guide-reg.html?utm_source=ev_external_...</description><pubDate>Sun, 25 Aug 2024 09:00:00 GMT</pubDate><content:encoded>&lt;ul&gt;
&lt;li&gt;&lt;a href=&quot;https://hello.dremio.com/wp-apache-iceberg-the-definitive-guide-reg.html?utm_source=ev_external_blog&amp;amp;utm_medium=social_free&amp;amp;utm_campaign=manifestlistblog&amp;amp;utm_content=alexmerced&amp;amp;utm_term=external_blog&quot;&gt;Free Copy of Apache Iceberg: The Definitive Guide&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&quot;https://hello.dremio.com/webcast-an-apache-iceberg-lakehouse-crash-course-reg.html?utm_source=ev_external_blog&amp;amp;utm_medium=social_free&amp;amp;utm_campaign=manifestlistblog&amp;amp;utm_content=alexmerced&amp;amp;utm_term=external_blog&quot;&gt;Free Apache Iceberg Crash Course&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;h2&gt;Introduction&lt;/h2&gt;
&lt;p&gt;Apache Iceberg is an open lakehouse table format designed to take datasets in distributed file systems and turn them into database like tables. It has gained popularity for its ability to handle complex data engineering challenges, such as ensuring data consistency, enabling schema evolution, and supporting efficient query execution. One of the critical components that make this possible is its robust metadata management.&lt;/p&gt;
&lt;p&gt;We will focus on a crucial aspect of Iceberg&apos;s metadata architecture—the &lt;strong&gt;Manifest List&lt;/strong&gt; file. The Manifest List plays a pivotal role in Iceberg&apos;s snapshot mechanism, helping to track changes across the dataset and optimize query performance. Understanding the purpose of the Manifest List, the details it contains, and how query engines utilize it to plan which data files to scan is essential for data engineers looking to maximize the efficiency of their data lakehouses.&lt;/p&gt;
&lt;h2&gt;What is a Manifest List?&lt;/h2&gt;
&lt;p&gt;The &lt;strong&gt;Manifest List&lt;/strong&gt; is a fundamental component within Apache Iceberg’s architecture. It serves as a metadata file that tracks all the manifest files associated with a specific snapshot of a table. In simpler terms, when a snapshot is created, the Manifest List records which groups of data files (manifests) are included in that snapshot.&lt;/p&gt;
&lt;h3&gt;The Role of the Manifest List in Iceberg&lt;/h3&gt;
&lt;p&gt;The primary role of the Manifest List is to efficiently manage and track the state of data within a snapshot. Unlike traditional systems where entire directories or large sets of files are scanned to identify relevant data, Iceberg uses the Manifest List to keep this process highly efficient.&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Efficient Data Tracking&lt;/strong&gt;: The Manifest List keeps a concise record of all manifest files, which in turn track the actual data files. This layered approach ensures that only the necessary metadata is accessed during query planning, significantly reducing the overhead.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Atomic Snapshot Management&lt;/strong&gt;: Every time a new snapshot is created, a new Manifest List is written. This allows for atomic updates, meaning that the changes to the dataset (like adding or removing data files) are committed in one go, ensuring consistency and isolation.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Optimization of Query Execution&lt;/strong&gt;: By summarizing information about the data in the manifests, the Manifest List allows query engines to quickly determine which parts of the data are relevant to a query, thus skipping over unnecessary files.&lt;/p&gt;
&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;In essence, the Manifest List acts as a crucial index that ensures Iceberg can scale to manage massive datasets without compromising on query performance or data integrity.&lt;/p&gt;
&lt;h2&gt;Contents Inside the Manifest List File&lt;/h2&gt;
&lt;p&gt;The Manifest List file is not just a simple pointer to other files; it is a rich metadata file that contains detailed information crucial for the efficient management and querying of data in Apache Iceberg. Each entry in a Manifest List corresponds to a manifest file and includes various fields that describe the state and characteristics of that manifest.&lt;/p&gt;
&lt;h3&gt;Key Components of the Manifest List&lt;/h3&gt;
&lt;p&gt;Here are the essential fields you’ll find inside a Manifest List file:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;&lt;code&gt;manifest_path&lt;/code&gt;&lt;/strong&gt;: This field specifies the location of the manifest file. It is a string that points to the physical file in the storage system where the manifest is stored.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;&lt;code&gt;manifest_length&lt;/code&gt;&lt;/strong&gt;: This field indicates the size of the manifest file in bytes. Knowing the size helps in estimating the cost of reading the manifest, which can be important for optimizing query execution.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;&lt;code&gt;partition_spec_id&lt;/code&gt;&lt;/strong&gt;: Each table in Iceberg can have multiple partition specifications over time as the schema evolves. This field tracks the ID of the partition specification used to write the manifest, allowing Iceberg to apply the correct partitioning logic when reading the data.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;&lt;code&gt;content&lt;/code&gt;&lt;/strong&gt;: This field specifies the type of content tracked by the manifest, which could be either data files (&lt;code&gt;0&lt;/code&gt;) or delete files (&lt;code&gt;1&lt;/code&gt;). This distinction is critical for operations like merges and query planning, where data and deletes are handled differently.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;&lt;code&gt;sequence_number&lt;/code&gt; and &lt;code&gt;min_sequence_number&lt;/code&gt;&lt;/strong&gt;: These fields are part of Iceberg&apos;s versioning system. The &lt;code&gt;sequence_number&lt;/code&gt; represents when the manifest was added to the table, while &lt;code&gt;min_sequence_number&lt;/code&gt; provides the earliest sequence number of all files tracked by this manifest. These fields are crucial for understanding the evolution of data and for implementing time-travel queries.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;&lt;code&gt;added_files_count&lt;/code&gt;, &lt;code&gt;existing_files_count&lt;/code&gt;, &lt;code&gt;deleted_files_count&lt;/code&gt;&lt;/strong&gt;: These fields provide a count of the files in different states within the manifest—added, existing, or deleted. This metadata helps query engines decide if a manifest is relevant for a particular operation, potentially skipping manifests that contain only deleted files or files outside the scope of the query.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Partition Summaries&lt;/strong&gt;: The Manifest List can also include summaries of partition fields, such as &lt;code&gt;lower_bound&lt;/code&gt; and &lt;code&gt;upper_bound&lt;/code&gt; for partition values, &lt;code&gt;contains_null&lt;/code&gt;, and &lt;code&gt;contains_nan&lt;/code&gt;. These summaries are incredibly useful for partition pruning during query planning, as they allow the query engine to skip entire manifests that do not contain relevant data based on the query’s filter conditions.&lt;/p&gt;
&lt;/li&gt;
&lt;/ul&gt;
&lt;h3&gt;How These Fields Relate to Data Files&lt;/h3&gt;
&lt;p&gt;Each of these fields in the Manifest List provides critical metadata that links the snapshot to its underlying data files:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Tracking Data Evolution&lt;/strong&gt;: The &lt;code&gt;sequence_number&lt;/code&gt; fields ensure that Iceberg can accurately track the evolution of data over time, allowing for advanced features like time travel and consistent reads across multiple queries.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Optimizing Queries&lt;/strong&gt;: The combination of &lt;code&gt;content&lt;/code&gt;, &lt;code&gt;partition_spec_id&lt;/code&gt;, and partition summaries allows query engines to prune unnecessary data early in the query planning phase. For instance, if a query’s filter condition does not match the &lt;code&gt;lower_bound&lt;/code&gt; or &lt;code&gt;upper_bound&lt;/code&gt; of a partition, the query engine can skip reading the associated manifest and, consequently, the data files it tracks.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Efficient Data Management&lt;/strong&gt;: By summarizing the number and type of files (&lt;code&gt;added_files_count&lt;/code&gt;, &lt;code&gt;existing_files_count&lt;/code&gt;, &lt;code&gt;deleted_files_count&lt;/code&gt;), Iceberg ensures that only the necessary manifests are read, further optimizing performance and reducing I/O operations.&lt;/p&gt;
&lt;/li&gt;
&lt;/ul&gt;
&lt;pre&gt;&lt;code class=&quot;language-json&quot;&gt;{
  &amp;quot;manifest-list&amp;quot;: [
    {
      &amp;quot;manifest_path&amp;quot;: &amp;quot;s3://bucket/path/to/manifest1.avro&amp;quot;,
      &amp;quot;manifest_length&amp;quot;: 1048576,
      &amp;quot;partition_spec_id&amp;quot;: 1,
      &amp;quot;content&amp;quot;: 0,
      &amp;quot;sequence_number&amp;quot;: 1001,
      &amp;quot;min_sequence_number&amp;quot;: 1000,
      &amp;quot;added_files_count&amp;quot;: 5,
      &amp;quot;existing_files_count&amp;quot;: 10,
      &amp;quot;deleted_files_count&amp;quot;: 2,
      &amp;quot;added_rows_count&amp;quot;: 500000,
      &amp;quot;existing_rows_count&amp;quot;: 1000000,
      &amp;quot;deleted_rows_count&amp;quot;: 200000,
      &amp;quot;partitions&amp;quot;: [
        {
          &amp;quot;contains_null&amp;quot;: false,
          &amp;quot;contains_nan&amp;quot;: false,
          &amp;quot;lower_bound&amp;quot;: &amp;quot;2023-01-01&amp;quot;,
          &amp;quot;upper_bound&amp;quot;: &amp;quot;2023-01-31&amp;quot;
        }
      ]
    },
    {
      &amp;quot;manifest_path&amp;quot;: &amp;quot;s3://bucket/path/to/manifest2.avro&amp;quot;,
      &amp;quot;manifest_length&amp;quot;: 2097152,
      &amp;quot;partition_spec_id&amp;quot;: 2,
      &amp;quot;content&amp;quot;: 0,
      &amp;quot;sequence_number&amp;quot;: 1002,
      &amp;quot;min_sequence_number&amp;quot;: 1001,
      &amp;quot;added_files_count&amp;quot;: 8,
      &amp;quot;existing_files_count&amp;quot;: 7,
      &amp;quot;deleted_files_count&amp;quot;: 3,
      &amp;quot;added_rows_count&amp;quot;: 750000,
      &amp;quot;existing_rows_count&amp;quot;: 700000,
      &amp;quot;deleted_rows_count&amp;quot;: 150000,
      &amp;quot;partitions&amp;quot;: [
        {
          &amp;quot;contains_null&amp;quot;: true,
          &amp;quot;contains_nan&amp;quot;: false,
          &amp;quot;lower_bound&amp;quot;: &amp;quot;2023-02-01&amp;quot;,
          &amp;quot;upper_bound&amp;quot;: &amp;quot;2023-02-28&amp;quot;
        }
      ]
    }
  ]
}
&lt;/code&gt;&lt;/pre&gt;
&lt;h3&gt;Conclusion&lt;/h3&gt;
&lt;p&gt;The Manifest List is an essential component of Apache Iceberg&apos;s architecture, playing a critical role in managing large datasets with efficiency and precision. By tracking the manifest files associated with each snapshot, the Manifest List enables Iceberg to provide powerful features like atomic snapshots, time travel, and optimized query execution.&lt;/p&gt;
&lt;p&gt;Through its detailed metadata, the Manifest List allows query engines to intelligently decide which data files to scan, significantly reducing unnecessary I/O and speeding up query performance. Whether you&apos;re dealing with a data lakehouse or a complex analytics platform, understanding how the Manifest List operates can help you harness the full potential of Apache Iceberg.&lt;/p&gt;
&lt;p&gt;As the landscape of data engineering continues to evolve, tools like Iceberg, with its robust metadata management, will be increasingly vital in ensuring that data platforms remain scalable, efficient, and capable of handling the demands of modern data workloads.&lt;/p&gt;
&lt;p&gt;For those looking to dive deeper into Apache Iceberg, consider exploring the following resources:&lt;/p&gt;
&lt;h2&gt;Resources to Learn More about Iceberg&lt;/h2&gt;
&lt;ul&gt;
&lt;li&gt;&lt;a href=&quot;https://www.dremio.com/lakehouse-deep-dives/apache-iceberg-101/?utm_source=ev_external_blog&amp;amp;utm_medium=social_free&amp;amp;utm_campaign=manifestlistblog&amp;amp;utm_content=alexmerced&amp;amp;utm_term=external_blog&quot;&gt;Apache Iceberg 101&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&quot;https://www.dremio.com/blog/intro-to-dremio-nessie-and-apache-iceberg-on-your-laptop/?utm_source=ev_external_blog&amp;amp;utm_medium=social_free&amp;amp;utm_campaign=manifestlistblog&amp;amp;utm_content=alexmerced&amp;amp;utm_term=external_blog&quot;&gt;Hands-on Intro with Apache iceberg&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&quot;https://hello.dremio.com/webcast-an-apache-iceberg-lakehouse-crash-course-reg.html?utm_source=ev_external_blog&amp;amp;utm_medium=social_free&amp;amp;utm_campaign=manifestlistblog&amp;amp;utm_content=alexmerced&amp;amp;utm_term=external_blog&quot;&gt;Free Apache Iceberg Crash Course&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&quot;https://hello.dremio.com/wp-apache-iceberg-the-definitive-guide-reg.html?utm_source=ev_external_blog&amp;amp;utm_medium=social_free&amp;amp;utm_campaign=manifestlistblog&amp;amp;utm_content=alexmerced&amp;amp;utm_term=external_blog&quot;&gt;Free Copy Of Apache Iceberg: The Definitive Guide&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;
</content:encoded><author>Alex Merced</author></item><item><title>Understanding Apache Iceberg&apos;s Metadata.json</title><link>https://iceberglakehouse.com/posts/2024-8-apache-iceberg-metadata-json/</link><guid isPermaLink="true">https://iceberglakehouse.com/posts/2024-8-apache-iceberg-metadata-json/</guid><description>
- [Free Copy of Apache Iceberg: The Definitive Guide](https://hello.dremio.com/wp-apache-iceberg-the-definitive-guide-reg.html?utm_source=alexmerced&amp;...</description><pubDate>Wed, 21 Aug 2024 09:00:00 GMT</pubDate><content:encoded>&lt;ul&gt;
&lt;li&gt;&lt;a href=&quot;https://hello.dremio.com/wp-apache-iceberg-the-definitive-guide-reg.html?utm_source=alexmerced&amp;amp;utm_medium=external_blog&amp;amp;utm_campaign=metadatajson&quot;&gt;Free Copy of Apache Iceberg: The Definitive Guide&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&quot;https://hello.dremio.com/webcast-an-apache-iceberg-lakehouse-crash-course-reg.html?utm_source=alexmerced&amp;amp;utm_medium=external_blog&amp;amp;utm_campaign=metadatajson&quot;&gt;Free Apache Iceberg Crash Course&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;h2&gt;Introduction&lt;/h2&gt;
&lt;p&gt;Apache Iceberg is a data lakehouse table format designed to solve many of the problems associated with large-scale data lakes turning them in data warehouses called data lakehouses. It allows for schema evolution, time travel queries, and efficient data partitioning, all while maintaining compatibility with existing data processing engines. Central to Iceberg&apos;s functionality is the &lt;code&gt;metadata.json&lt;/code&gt; file, which serves as the heart of table metadata management.&lt;/p&gt;
&lt;h3&gt;Purpose of metadata.json&lt;/h3&gt;
&lt;p&gt;The &lt;code&gt;metadata.json&lt;/code&gt; file in Apache Iceberg serves several critical purposes:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Centralized Table Information&lt;/strong&gt;: It acts as a single source of truth for all metadata related to an Iceberg table. This includes schema definitions, partitioning strategies, and snapshot history, making it easier for data engines to understand and interact with the table.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Schema Evolution&lt;/strong&gt;: Iceberg tables can evolve over time, with new columns added or existing ones modified. The &lt;code&gt;metadata.json&lt;/code&gt; tracks these changes, ensuring that historical data remains accessible and that queries can be executed against any point in the table&apos;s history.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Data Partitioning and Organization&lt;/strong&gt;: By defining how data is partitioned, &lt;code&gt;metadata.json&lt;/code&gt; helps in optimizing data storage and query performance. Partitioning strategies can be updated (partition evolution), and this file keeps track of all such changes.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Snapshot Management&lt;/strong&gt;: Iceberg allows for snapshots, which are essentially versions of the table at different points in time. The metadata file records these snapshots, enabling features like time travel queries where users can query the table as it existed in the past.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Consistency and Integrity&lt;/strong&gt;: It ensures that all operations on the table maintain data integrity by providing a clear reference for what data should exist where and in what state.&lt;/p&gt;
&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;This file is not just a static record but a dynamic document that evolves with the table, making it an indispensable component of Apache Iceberg&apos;s architecture.&lt;/p&gt;
&lt;h2&gt;Detailed Breakdown of Fields&lt;/h2&gt;
&lt;h3&gt;Identification and Versioning&lt;/h3&gt;
&lt;h4&gt;&lt;strong&gt;format-version&lt;/strong&gt;&lt;/h4&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Data&lt;/strong&gt;: Integer (1 or 2).&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Purpose&lt;/strong&gt;: This field indicates the version of the Iceberg format used by the table. It&apos;s crucial for compatibility reasons; if an implementation encounters a version higher than what it supports, it must throw an exception to prevent potential data corruption or misinterpretation.&lt;/li&gt;
&lt;/ul&gt;
&lt;h4&gt;&lt;strong&gt;table-uuid&lt;/strong&gt;&lt;/h4&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Data&lt;/strong&gt;: UUID.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Purpose&lt;/strong&gt;: Each table in Iceberg has a unique identifier, the &lt;code&gt;table-uuid&lt;/code&gt;. This UUID is generated upon table creation and is used to ensure that the table&apos;s metadata matches across different operations, especially after metadata refreshes. If there&apos;s a mismatch, it indicates a potential conflict or corruption, prompting an exception.&lt;/li&gt;
&lt;/ul&gt;
&lt;h3&gt;Table Structure and Location&lt;/h3&gt;
&lt;h4&gt;&lt;strong&gt;location&lt;/strong&gt;&lt;/h4&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Data&lt;/strong&gt;: URI.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Purpose&lt;/strong&gt;: This field specifies the base location where the table&apos;s data files, manifest files, and metadata files are stored. Writers use this to determine where to place new data, ensuring all parts of the table are correctly located.&lt;/li&gt;
&lt;/ul&gt;
&lt;h4&gt;&lt;strong&gt;last-updated-ms&lt;/strong&gt;&lt;/h4&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Data&lt;/strong&gt;: Timestamp in milliseconds since the Unix epoch.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Purpose&lt;/strong&gt;: This field records the last time the metadata was updated. It&apos;s updated just before writing the metadata file, providing a timestamp for when the latest changes were committed.&lt;/li&gt;
&lt;/ul&gt;
&lt;h3&gt;Schema Management&lt;/h3&gt;
&lt;h4&gt;&lt;strong&gt;schemas&lt;/strong&gt; and &lt;strong&gt;current-schema-id&lt;/strong&gt;&lt;/h4&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Data&lt;/strong&gt;: &lt;code&gt;schemas&lt;/code&gt; is a list of schema objects, each with a unique &lt;code&gt;schema-id&lt;/code&gt;. &lt;code&gt;current-schema-id&lt;/code&gt; is the ID of the schema currently in use.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Purpose&lt;/strong&gt;: Iceberg supports schema evolution, allowing tables to change over time. The &lt;code&gt;schemas&lt;/code&gt; list keeps track of all schemas that have been used for the table, while &lt;code&gt;current-schema-id&lt;/code&gt; points to the latest schema. This setup allows for historical data to be queried with the schema it was originally written with, ensuring data consistency and flexibility in schema changes.&lt;/li&gt;
&lt;/ul&gt;
&lt;h3&gt;Data Partitioning&lt;/h3&gt;
&lt;h4&gt;&lt;strong&gt;partition-specs&lt;/strong&gt; and &lt;strong&gt;default-spec-id&lt;/strong&gt;&lt;/h4&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Data&lt;/strong&gt;: &lt;code&gt;partition-specs&lt;/code&gt; is a list of full partition spec objects, each detailing how data should be partitioned. &lt;code&gt;default-spec-id&lt;/code&gt; points to the ID of the partition spec that writers should use by default.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Purpose&lt;/strong&gt;: Partitioning in Iceberg is crucial for data organization and query optimization. This field defines how data is divided into partitions, which can be based on various criteria like date, category, etc. The &lt;code&gt;default-spec-id&lt;/code&gt; ensures that new data is partitioned according to the latest strategy unless specified otherwise.&lt;/li&gt;
&lt;/ul&gt;
&lt;h3&gt;Snapshots and History&lt;/h3&gt;
&lt;h4&gt;&lt;strong&gt;last-sequence-number&lt;/strong&gt;, &lt;strong&gt;current-snapshot-id&lt;/strong&gt;, &lt;strong&gt;snapshots&lt;/strong&gt;, &lt;strong&gt;snapshot-log&lt;/strong&gt;&lt;/h4&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Data&lt;/strong&gt;: &lt;code&gt;last-sequence-number&lt;/code&gt; is a monotonically increasing long, &lt;code&gt;current-snapshot-id&lt;/code&gt; is the ID of the latest snapshot, &lt;code&gt;snapshots&lt;/code&gt; is a list of valid snapshots, and &lt;code&gt;snapshot-log&lt;/code&gt; records changes in the current snapshot.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Purpose&lt;/strong&gt;: These fields manage the history of the table&apos;s states. Snapshots allow for time travel queries, where data can be queried as it existed at any point in time. The &lt;code&gt;snapshot-log&lt;/code&gt; helps in tracking changes to the current snapshot, aiding in understanding the evolution of the table over time.&lt;/li&gt;
&lt;/ul&gt;
&lt;h3&gt;Metadata Logging&lt;/h3&gt;
&lt;h4&gt;&lt;strong&gt;metadata-log&lt;/strong&gt;&lt;/h4&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Data&lt;/strong&gt;: A list of timestamp and metadata file location pairs.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Purpose&lt;/strong&gt;: This field logs previous metadata files, providing a history of metadata changes. It&apos;s useful for auditing, rollback operations, or understanding the evolution of the table&apos;s metadata over time.&lt;/li&gt;
&lt;/ul&gt;
&lt;h3&gt;Sorting and Ordering&lt;/h3&gt;
&lt;h4&gt;&lt;strong&gt;sort-orders&lt;/strong&gt; and &lt;strong&gt;default-sort-order-id&lt;/strong&gt;&lt;/h4&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Data&lt;/strong&gt;: &lt;code&gt;sort-orders&lt;/code&gt; is a list of sort order objects, and &lt;code&gt;default-sort-order-id&lt;/code&gt; specifies the default sort order.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Purpose&lt;/strong&gt;: These fields define how data is sorted within partitions or tables. While sorting is more relevant for writers, it can also affect how data is read, especially for certain types of queries where data order matters.&lt;/li&gt;
&lt;/ul&gt;
&lt;h2&gt;Example Metadata.json&lt;/h2&gt;
&lt;pre&gt;&lt;code class=&quot;language-json&quot;&gt;{
  &amp;quot;format-version&amp;quot;: 2,
  &amp;quot;table-uuid&amp;quot;: &amp;quot;5f8b14d8-0a14-4e6a-8b04-7b1b9341c939&amp;quot;,
  &amp;quot;location&amp;quot;: &amp;quot;s3://my-bucket/tables/my_table&amp;quot;,
  &amp;quot;last-updated-ms&amp;quot;: 1692643200000,
  &amp;quot;last-sequence-number&amp;quot;: 100,
  &amp;quot;last-column-id&amp;quot;: 10,
  &amp;quot;schemas&amp;quot;: [
    {
      &amp;quot;schema-id&amp;quot;: 1,
      &amp;quot;columns&amp;quot;: [
        {&amp;quot;name&amp;quot;: &amp;quot;id&amp;quot;, &amp;quot;type&amp;quot;: &amp;quot;integer&amp;quot;, &amp;quot;id&amp;quot;: 1},
        {&amp;quot;name&amp;quot;: &amp;quot;name&amp;quot;, &amp;quot;type&amp;quot;: &amp;quot;string&amp;quot;, &amp;quot;id&amp;quot;: 2}
      ]
    },
    {
      &amp;quot;schema-id&amp;quot;: 2,
      &amp;quot;columns&amp;quot;: [
        {&amp;quot;name&amp;quot;: &amp;quot;id&amp;quot;, &amp;quot;type&amp;quot;: &amp;quot;integer&amp;quot;, &amp;quot;id&amp;quot;: 1},
        {&amp;quot;name&amp;quot;: &amp;quot;name&amp;quot;, &amp;quot;type&amp;quot;: &amp;quot;string&amp;quot;, &amp;quot;id&amp;quot;: 2},
        {&amp;quot;name&amp;quot;: &amp;quot;age&amp;quot;, &amp;quot;type&amp;quot;: &amp;quot;integer&amp;quot;, &amp;quot;id&amp;quot;: 3}
      ]
    }
  ],
  &amp;quot;current-schema-id&amp;quot;: 2,
  &amp;quot;partition-specs&amp;quot;: [
    {
      &amp;quot;spec-id&amp;quot;: 1,
      &amp;quot;fields&amp;quot;: [
        {&amp;quot;name&amp;quot;: &amp;quot;name&amp;quot;, &amp;quot;transform&amp;quot;: &amp;quot;identity&amp;quot;, &amp;quot;source-id&amp;quot;: 2}
      ]
    },
    {
      &amp;quot;spec-id&amp;quot;: 2,
      &amp;quot;fields&amp;quot;: [
        {&amp;quot;name&amp;quot;: &amp;quot;age&amp;quot;, &amp;quot;transform&amp;quot;: &amp;quot;bucket[4]&amp;quot;, &amp;quot;source-id&amp;quot;: 3}
      ]
    }
  ],
  &amp;quot;default-spec-id&amp;quot;: 2,
  &amp;quot;last-partition-id&amp;quot;: 4,
  &amp;quot;properties&amp;quot;: {
    &amp;quot;commit.retry.num-retries&amp;quot;: &amp;quot;5&amp;quot;
  },
  &amp;quot;current-snapshot-id&amp;quot;: 3,
  &amp;quot;snapshots&amp;quot;: [
    {&amp;quot;snapshot-id&amp;quot;: 1, &amp;quot;timestamp-ms&amp;quot;: 1692643200000},
    {&amp;quot;snapshot-id&amp;quot;: 2, &amp;quot;timestamp-ms&amp;quot;: 1692643500000},
    {&amp;quot;snapshot-id&amp;quot;: 3, &amp;quot;timestamp-ms&amp;quot;: 1692643800000}
  ],
  &amp;quot;snapshot-log&amp;quot;: [
    {&amp;quot;timestamp-ms&amp;quot;: 1692643200000, &amp;quot;snapshot-id&amp;quot;: 1},
    {&amp;quot;timestamp-ms&amp;quot;: 1692643500000, &amp;quot;snapshot-id&amp;quot;: 2},
    {&amp;quot;timestamp-ms&amp;quot;: 1692643800000, &amp;quot;snapshot-id&amp;quot;: 3}
  ],
  &amp;quot;metadata-log&amp;quot;: [
    {&amp;quot;timestamp-ms&amp;quot;: 1692643200000, &amp;quot;metadata-file&amp;quot;: &amp;quot;s3://my-bucket/tables/my_table/metadata/00001.json&amp;quot;},
    {&amp;quot;timestamp-ms&amp;quot;: 1692643500000, &amp;quot;metadata-file&amp;quot;: &amp;quot;s3://my-bucket/tables/my_table/metadata/00002.json&amp;quot;}
  ],
  &amp;quot;sort-orders&amp;quot;: [
    {
      &amp;quot;order-id&amp;quot;: 1,
      &amp;quot;fields&amp;quot;: [
        {&amp;quot;name&amp;quot;: &amp;quot;id&amp;quot;, &amp;quot;direction&amp;quot;: &amp;quot;ASC&amp;quot;, &amp;quot;null-order&amp;quot;: &amp;quot;NULLS_FIRST&amp;quot;}
      ]
    }
  ],
  &amp;quot;default-sort-order-id&amp;quot;: 1,
  &amp;quot;refs&amp;quot;: {
    &amp;quot;main&amp;quot;: {&amp;quot;snapshot-id&amp;quot;: 3}
  },
  &amp;quot;statistics&amp;quot;: [
    {
      &amp;quot;snapshot-id&amp;quot;: &amp;quot;3&amp;quot;,
      &amp;quot;statistics-path&amp;quot;: &amp;quot;s3://my-bucket/tables/my_table/stats/00003.puffin&amp;quot;,
      &amp;quot;file-size-in-bytes&amp;quot;: 1024,
      &amp;quot;file-footer-size-in-bytes&amp;quot;: 64,
      &amp;quot;blob-metadata&amp;quot;: [
        {
          &amp;quot;type&amp;quot;: &amp;quot;table-stats&amp;quot;,
          &amp;quot;snapshot-id&amp;quot;: 3,
          &amp;quot;sequence-number&amp;quot;: 100,
          &amp;quot;fields&amp;quot;: [1, 2, 3],
          &amp;quot;properties&amp;quot;: {
            &amp;quot;statistic-type&amp;quot;: &amp;quot;summary&amp;quot;
          }
        }
      ]
    }
  ],
  &amp;quot;partition-statistics&amp;quot;: [
    {
      &amp;quot;snapshot-id&amp;quot;: 3,
      &amp;quot;statistics-path&amp;quot;: &amp;quot;s3://my-bucket/tables/my_table/partition_stats/00003.parquet&amp;quot;,
      &amp;quot;file-size-in-bytes&amp;quot;: 512
    }
  ]
}
&lt;/code&gt;&lt;/pre&gt;
&lt;h2&gt;How Engines Use metadata.json&lt;/h2&gt;
&lt;h3&gt;Query Planning&lt;/h3&gt;
&lt;p&gt;One of the primary uses of the &lt;code&gt;metadata.json&lt;/code&gt; by data processing engines is in query planning. Here&apos;s how:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Partition Pruning&lt;/strong&gt;: With the &lt;code&gt;partition-specs&lt;/code&gt; information, engines can have a the details of each historical partitioning scheme to match up with the partition id&apos;s referenced in manifest lists and manifest entries allowing it to prune data in unnecessary partitions from the scan plan.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Schema Validation&lt;/strong&gt;: Before executing a query, engines check the &lt;code&gt;current-schema-id&lt;/code&gt; and the corresponding schema to ensure it uses the write schema to write new data files.&lt;/p&gt;
&lt;/li&gt;
&lt;/ul&gt;
&lt;h3&gt;Schema Evolution&lt;/h3&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Schema Tracking&lt;/strong&gt;: The &lt;code&gt;schemas&lt;/code&gt; list and &lt;code&gt;current-schema-id&lt;/code&gt; allow engines to understand the evolution of the table&apos;s schema. When a query involves historical data, the engine can use the schema that was active at the time the data was written, ensuring accurate results.&lt;/li&gt;
&lt;/ul&gt;
&lt;h3&gt;Data Consistency&lt;/h3&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Snapshot Management&lt;/strong&gt;: Engines use the &lt;code&gt;current-snapshot-id&lt;/code&gt; and &lt;code&gt;snapshots&lt;/code&gt; list to ensure they are working with the latest state of the table or a specific historical snapshot. This feature is particularly useful for time travel queries or ensuring data consistency in distributed environments.&lt;/li&gt;
&lt;/ul&gt;
&lt;h3&gt;Data Layout and Sorting&lt;/h3&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Data Layout&lt;/strong&gt;: By looking at the fields regarding partitioning and sort order, engines can make sure to write new data organized based on the right partitioning and sorting logic.&lt;/li&gt;
&lt;/ul&gt;
&lt;h3&gt;Metadata Updates&lt;/h3&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Metadata Logging&lt;/strong&gt;: The &lt;code&gt;metadata-log&lt;/code&gt; provides engines with a history of metadata changes. This can be used for rolling back the table to a previous state by identifying which metadata.json the catalog should reference.&lt;/li&gt;
&lt;/ul&gt;
&lt;h3&gt;Optimistic Concurrency Controls&lt;/h3&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Sequence Number&lt;/strong&gt;: The sequence number property can be used to help maintain consistency when there are concurrent transactions. A write will project it&apos;s new sequence number before it begins and confirm that number is still the next one in sequence before committing. If another transaction has claimed the next sequence number before the original write commits, then the write can be reattempted.&lt;/li&gt;
&lt;/ul&gt;
&lt;h3&gt;Conclusion on Engine Usage&lt;/h3&gt;
&lt;p&gt;The &lt;code&gt;metadata.json&lt;/code&gt; in Apache Iceberg acts as a comprehensive guide for data engines, enabling them to efficiently manage, query, and evolve large-scale data tables. By providing detailed metadata, it allows for optimizations at various levels, from query planning to data consistency, making Iceberg tables highly performant and flexible.&lt;/p&gt;
&lt;h2&gt;Resources to Learn More about Iceberg&lt;/h2&gt;
&lt;ul&gt;
&lt;li&gt;&lt;a href=&quot;https://www.dremio.com/lakehouse-deep-dives/apache-iceberg-101/?utm_source=alexmerced&amp;amp;utm_medium=external_blog&amp;amp;utm_campaign=metadatajson&quot;&gt;Apache Iceberg 101&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&quot;https://www.dremio.com/blog/intro-to-dremio-nessie-and-apache-iceberg-on-your-laptop/?utm_source=alexmerced&amp;amp;utm_medium=external_blog&amp;amp;utm_campaign=metadatajson&quot;&gt;Hands-on Intro with Apache iceberg&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&quot;https://hello.dremio.com/webcast-an-apache-iceberg-lakehouse-crash-course-reg.html?utm_source=alexmerced&amp;amp;utm_medium=external_blog&amp;amp;utm_campaign=metadatajson&quot;&gt;Free Apache Iceberg Crash Course&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&quot;https://hello.dremio.com/wp-apache-iceberg-the-definitive-guide-reg.html?utm_source=alexmerced&amp;amp;utm_medium=external_blog&amp;amp;utm_campaign=metadatajson&quot;&gt;Free Copy Of Apache Iceberg: The Definitive Guide&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;
</content:encoded><author>Alex Merced</author></item><item><title>What Apache Iceberg REST Catalog is and isn&apos;t</title><link>https://iceberglakehouse.com/posts/2024-8-what-apache-iceberg-rest-catalog-is-and-isnt/</link><guid isPermaLink="true">https://iceberglakehouse.com/posts/2024-8-what-apache-iceberg-rest-catalog-is-and-isnt/</guid><description>
- [Free Copy of Apache Iceberg: The Definitive Guide](https://hello.dremio.com/wp-apache-iceberg-the-definitive-guide-reg.html?utm_source=alexmerced&amp;...</description><pubDate>Sun, 18 Aug 2024 09:00:00 GMT</pubDate><content:encoded>&lt;ul&gt;
&lt;li&gt;&lt;a href=&quot;https://hello.dremio.com/wp-apache-iceberg-the-definitive-guide-reg.html?utm_source=alexmerced&amp;amp;utm_medium=external_blog&amp;amp;utm_campaign=rest_catalog_is_isnt&quot;&gt;Free Copy of Apache Iceberg: The Definitive Guide&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&quot;https://hello.dremio.com/webcast-an-apache-iceberg-lakehouse-crash-course-reg.html?utm_source=alexmerced&amp;amp;utm_medium=external_blog&amp;amp;utm_campaign=rest_catalog_is_isnt&quot;&gt;Free Apache Iceberg Crash Course&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;I&apos;ve recently written a few blogs on the evolution of Apache Iceberg catalogs:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;a href=&quot;https://www.dremio.com/blog/the-evolution-of-apache-iceberg-catalogs/?utm_source=alexmerced&amp;amp;utm_medium=external_blog&amp;amp;utm_campaign=rest_catalog_is_isnt&quot;&gt;The Evolution of Apache Iceberg Catalogs&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&quot;https://medium.com/data-engineering-with-dremio/understanding-the-future-of-apache-iceberg-catalogs-ff2a2878fbc0&quot;&gt;The Future of Apache Iceberg Catalogs&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;In this article, I aim to clarify the scope of the REST catalog specification to provide a clearer understanding of the role it plays within the broader Apache Iceberg catalog ecosystem.&lt;/p&gt;
&lt;h2&gt;What the REST Catalog Does&lt;/h2&gt;
&lt;h3&gt;Creates a Uniform Interface for Table Operations&lt;/h3&gt;
&lt;p&gt;The REST catalog provides an interface that allows any catalog to immediately support various table-level operations across multiple tools, including:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Reading a table&lt;/li&gt;
&lt;li&gt;Creating a table&lt;/li&gt;
&lt;li&gt;Inserting data into a table&lt;/li&gt;
&lt;li&gt;Updating a table&lt;/li&gt;
&lt;li&gt;Branching at the table level&lt;/li&gt;
&lt;li&gt;Altering a table&lt;/li&gt;
&lt;/ul&gt;
&lt;h2&gt;What the REST Catalog Does Not Do&lt;/h2&gt;
&lt;h3&gt;Does Not Create a Uniform Interface for Non-Table Operations&lt;/h3&gt;
&lt;p&gt;The REST catalog is focused solely on table operations and does not address:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Non-table level management at the catalog (e.g., Nessie) or file level (e.g., LakeFS)&lt;/li&gt;
&lt;li&gt;Security at the table or catalog level&lt;/li&gt;
&lt;li&gt;Handling non-table objects like machine learning features and other related data&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;While catalog services can offer a wide range of functionalities beyond managing Iceberg tables, the REST catalog interface is specifically designed for table-level operations. This doesn’t preclude the possibility of future standard interfaces for broader catalog management APIs, which may emerge from open-source catalog projects like Nessie or Apache Polaris (Incubating).&lt;/p&gt;
&lt;h3&gt;Is Not a Catalog Implementation&lt;/h3&gt;
&lt;p&gt;The REST catalog is not a deployable catalog; rather, it is a REST API specification. This specification enables multiple catalog implementations, such as Polaris and Nessie, to leverage existing REST catalog clients. By doing so, these catalogs avoid the need to create their own clients in various languages, and they can offload more logic to the server side, as opposed to the client, unlike previous catalog paradigms.&lt;/p&gt;
&lt;h3&gt;REST Catalog Support Does Not Guarantee Full Functionality&lt;/h3&gt;
&lt;p&gt;Catalogs that claim to support the REST catalog specification may implement only a subset of the available endpoints. For example, Unity OSS might utilize endpoints that allow reading an Iceberg table as part of its Delta Lake support but may not support the write endpoints necessary for writing to an Iceberg table. Therefore, when evaluating a catalog&apos;s REST catalog support, it&apos;s essential to ensure it meets the specific needs of your workloads.&lt;/p&gt;
&lt;h2&gt;Conclusion&lt;/h2&gt;
&lt;p&gt;The REST catalog specification is a powerful tool for standardizing table operations across various catalogs, but it’s important to understand its limitations and the scope of its functionality. As the Apache Iceberg ecosystem continues to evolve, the REST catalog will likely play a critical role in enabling interoperability between different catalogs, but users should remain aware of the specific capabilities and limitations of their chosen catalog implementations.&lt;/p&gt;
&lt;h2&gt;Resources to Learn More about Iceberg&lt;/h2&gt;
&lt;ul&gt;
&lt;li&gt;&lt;a href=&quot;https://www.dremio.com/lakehouse-deep-dives/apache-iceberg-101/?utm_source=alexmerced&amp;amp;utm_medium=external_blog&amp;amp;utm_campaign=iceberg-acid&quot;&gt;Apache Iceberg 101&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&quot;https://www.dremio.com/blog/intro-to-dremio-nessie-and-apache-iceberg-on-your-laptop/?utm_source=alexmerced&amp;amp;utm_medium=external_blog&amp;amp;utm_campaign=iceberg-acid&quot;&gt;Hands-on Intro with Apache iceberg&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&quot;https://hello.dremio.com/webcast-an-apache-iceberg-lakehouse-crash-course-reg.html?utm_source=alexmerced&amp;amp;utm_medium=external_blog&amp;amp;utm_campaign=iceberg-acid&quot;&gt;Free Apache Iceberg Crash Course&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&quot;https://hello.dremio.com/wp-apache-iceberg-the-definitive-guide-reg.html?utm_source=alexmerced&amp;amp;utm_medium=external_blog&amp;amp;utm_campaign=iceberg-acid&quot;&gt;Free Copy Of Apache Iceberg: The Definitive Guide&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;
</content:encoded><author>Alex Merced</author></item><item><title>ACID Guarantees and Apache Iceberg - Turning Any Storage into a Data Warehouse</title><link>https://iceberglakehouse.com/posts/2024-8-acid-guarantees-and-apache-iceberg/</link><guid isPermaLink="true">https://iceberglakehouse.com/posts/2024-8-acid-guarantees-and-apache-iceberg/</guid><description>
Apache Iceberg has become a prominent name in the data world, with numerous platforms integrating support for Iceberg tables as part of the growing o...</description><pubDate>Thu, 15 Aug 2024 09:00:00 GMT</pubDate><content:encoded>&lt;p&gt;Apache Iceberg has become a prominent name in the data world, with numerous platforms integrating support for Iceberg tables as part of the growing open data lakehouse ecosystem. A key feature often highlighted is Iceberg&apos;s ability to enable ACID transactions. In this blog, I will explore what ACID guarantees mean and how Iceberg delivers them, to help you better understand the value Apache Iceberg brings to the table.&lt;/p&gt;
&lt;h2&gt;What are ACID Guarantees?&lt;/h2&gt;
&lt;p&gt;ACID is an acronym that outlines the key guarantees a data system should provide—guarantees that are typically offered by most SQL-based databases and data warehouses. These guarantees include:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Atomicity&lt;/strong&gt;: This ensures that when a change is made, it either completes successfully or doesn&apos;t occur at all. This prevents partial changes, which can be difficult and time-consuming to resolve. If a change doesn&apos;t succeed, you can simply retry it without worry.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Consistency&lt;/strong&gt;: This ensures that everyone accessing the data sees the same version of it, maintaining uniformity across the system.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Isolation&lt;/strong&gt;: This allows multiple users to make updates or query data simultaneously without interfering with one another.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Durability&lt;/strong&gt;: This guarantees that once data is stored, it remains available for future access.&lt;/p&gt;
&lt;/li&gt;
&lt;/ul&gt;
&lt;h2&gt;How Databases and Data Warehouses Do ACID&lt;/h2&gt;
&lt;p&gt;Database and Data Warehouse systems manage these guarantees by tightly coupling all the functions of a data system within their software. Their software writes data to storage in a format they control, employs its own method to catalog the written data into different tables to consistently return the correct data, and has built-in mechanisms to prevent concurrent transactions from affecting each other or allowing partial completion. These guarantees are possible because every aspect of the system is designed to work seamlessly together, effectively trapping the data within the system.&lt;/p&gt;
&lt;h2&gt;Apache Iceberg Unleashes ACID on Data Lakes&lt;/h2&gt;
&lt;p&gt;A Lakehouse Table Format like Apache Iceberg takes what previously required tightly coupled systems and achieves it by creating a specification for a series of metadata files that define a table and the individual files from storage that belong to that table. This metadata inherently ensures consistency, as instead of manually listing which files constitute a dataset, users can simply point their tools to the metadata to get a consistent definition.&lt;/p&gt;
&lt;p&gt;To incorporate atomicity and isolation, Iceberg introduces the concept of a catalog, which acts as both an arbiter of truth and a traffic controller for those requesting to update or read particular tables. An update isn&apos;t visible to readers until the catalog is updated with the address of the newest metadata from the successful transaction. If a transaction partially completes and fails, the data is never exposed since the catalog never references it. Each update to the table is assigned a sequence number, allowing subsequent updates to predict what number they should receive and double-check whether other transactions have completed before committing their own. This approach effectively turns many of the traditional guarantees into file-based operations rather than software-based, with the software that fills in the gaps being decoupled and modular. This creates a plug-and-play data system that doesn&apos;t lock the data within any particular layer.&lt;/p&gt;
&lt;h2&gt;Conclusion&lt;/h2&gt;
&lt;p&gt;Apache Iceberg represents a significant evolution in how ACID guarantees can be applied turning storage system based data lakes into data warehouse like data lakehouses. By decoupling the traditional functions of databases and data warehouses, Iceberg empowers data lakes with the ability to maintain consistency, atomicity, isolation, and durability without the need for tightly coupled systems. This flexibility allows for a modular, scalable, and open architecture that can adapt to various use cases and integrate with a wide range of tools.&lt;/p&gt;
&lt;h2&gt;Resources to Learn More about Iceberg&lt;/h2&gt;
&lt;ul&gt;
&lt;li&gt;&lt;a href=&quot;https://www.dremio.com/lakehouse-deep-dives/apache-iceberg-101/?utm_source=alexmerced&amp;amp;utm_medium=external_blog&amp;amp;utm_campaign=iceberg-acid&quot;&gt;Apache Iceberg 101&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&quot;https://www.dremio.com/blog/intro-to-dremio-nessie-and-apache-iceberg-on-your-laptop/?utm_source=alexmerced&amp;amp;utm_medium=external_blog&amp;amp;utm_campaign=iceberg-acid&quot;&gt;Hands-on Intro with Apache iceberg&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&quot;https://hello.dremio.com/webcast-an-apache-iceberg-lakehouse-crash-course-reg.html?utm_source=alexmerced&amp;amp;utm_medium=external_blog&amp;amp;utm_campaign=iceberg-acid&quot;&gt;Free Apache Iceberg Crash Course&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&quot;https://hello.dremio.com/wp-apache-iceberg-the-definitive-guide-reg.html?utm_source=alexmerced&amp;amp;utm_medium=external_blog&amp;amp;utm_campaign=iceberg-acid&quot;&gt;Free Copy Of Apache Iceberg: The Definitive Guide&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;
</content:encoded><author>Alex Merced</author></item><item><title>Data Lakehouse 101 - The Who, What and Why of Data Lakehouses</title><link>https://iceberglakehouse.com/posts/2024-8-data-lakehouses-101/</link><guid isPermaLink="true">https://iceberglakehouse.com/posts/2024-8-data-lakehouses-101/</guid><description>
- [Sign-up for this free Apache Iceberg Crash Course](https://bit.ly/am-2024-iceberg-live-crash-course-1)
- [Get a free copy of Apache Iceberg the De...</description><pubDate>Mon, 05 Aug 2024 09:00:00 GMT</pubDate><content:encoded>&lt;ul&gt;
&lt;li&gt;&lt;a href=&quot;https://bit.ly/am-2024-iceberg-live-crash-course-1&quot;&gt;Sign-up for this free Apache Iceberg Crash Course&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&quot;https://bit.ly/am-iceberg-book&quot;&gt;Get a free copy of Apache Iceberg the Definitive Guide&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;The scale of data is growing every day, with storage now reaching petabyte and exabyte levels that need to be utilized in increasingly diverse ways. This evolution&apos;s cost and practicality make the old paradigm—using data lakes to store both structured and unstructured data, then moving portions of that structured data into data warehouses for reporting, analytics, dashboards, and more—full of friction. The friction arises from storing multiple copies of data for each system used, keeping this data in sync and consistent, and delivering it at the high speeds modern data needs demand. Addressing these challenges is where a new architecture called data lakehouses comes into play.&lt;/p&gt;
&lt;h2&gt;WHAT is a Data Lakehouse?&lt;/h2&gt;
&lt;p&gt;A data lakehouse aims to bring the performance and ease of use of a data warehouse to the data already stored in your data lake. It establishes your data lake (a storage layer for storing data as files) as the source of truth, with the goal of keeping the bulk of your data on the lake. This is made possible through several key technologies:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Apache Parquet&lt;/strong&gt;: A binary columnar file format for fast analytics on datasets, used for storing structured data.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Apache Iceberg&lt;/strong&gt;: An open lakehouse table format, a standard for reading and writing metadata that allows for consistent recognition of a group of Parquet files as a table. This enables tools to treat datasets across multiple Parquet files as singular database-like tables with features such as time travel, schema evolution, partition evolution, and ACID guarantees.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Open Source Catalogs (e.g., Nessie and Polaris)&lt;/strong&gt;: These technologies allow you to track the tables that exist in your data lakehouse, ensuring any tool can have immediate awareness of your entire library of datasets.&lt;/p&gt;
&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;By using a storage layer like Hadoop or object storage and leveraging the technologies above, you can construct a data lakehouse, which can also be thought of as:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;A modern data lake&lt;/li&gt;
&lt;li&gt;A headless data warehouse&lt;/li&gt;
&lt;li&gt;A deconstructed data warehouse&lt;/li&gt;
&lt;li&gt;A modular data platform&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;At this point, you can use various tools like Dremio, Snowflake, Apache Spark, Apache Flink, and more to run workloads on your data lakehouse without duplicating your data across each platform you use.&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;            +----------------------------+
            |      Data Lakehouse        |
            +----------------------------+
                           |
           +------------------------------------+
           |           Storage Layer            |
           |  (Hadoop, Object Storage, etc.)    |
           +------------------------------------+
                           |
           +------------------------------------+
           |         File Formats               |
           |      (Apache Parquet, etc.)        |
           +------------------------------------+
                           |
           +------------------------------------+
           |           Table Formats            |
           |      (Apache Iceberg, etc.)        |
           +------------------------------------+
                           |
           +------------------------------------+
           |         Metadata Catalogs          |
           |      (Nessie, Polaris, etc.)       |
           +------------------------------------+
                           |
           +------------------------------------+
           |         Data Processing            |
           | (Dremio, Snowflake, Apache Spark,  |
           |        Apache Flink, etc.)         |
           +------------------------------------+

&lt;/code&gt;&lt;/pre&gt;
&lt;h2&gt;WHY a data lakehouse?&lt;/h2&gt;
&lt;p&gt;Transitioning to a data lakehouse architecture offers significant advantages, particularly by reducing the need for multiple copies of your data. Traditional data architectures often require creating separate copies of data for different systems, leading to increased storage costs and complexities in data management. A data lakehouse centralizes your data storage, allowing various tools and applications to access the same data without duplication. This not only simplifies data governance and consistency but also reduces the storage overhead, resulting in substantial cost savings.&lt;/p&gt;
&lt;p&gt;Moreover, a data lakehouse facilitates seamless toolset migration and concurrent use of multiple tools without incurring high migration costs. By leveraging open standards and interoperable technologies like Apache Iceberg and open-source catalogs, you can easily switch between different data processing and analytics tools such as Dremio, Snowflake, Apache Spark, and Apache Flink without needing to move or transform your data. This flexibility reduces the total cost of ownership, as you avoid the egress fees and compute expenses associated with transferring data between platforms. Consequently, you can achieve lower overall compute, storage, and egress costs, while simultaneously benefiting from the strengths of various tools operating on the same unified dataset.&lt;/p&gt;
&lt;h2&gt;HOW to Migrate to a Data Lakehouse&lt;/h2&gt;
&lt;p&gt;Migrating from existing systems to a data lakehouse is a process where &lt;a href=&quot;https://www.dremio.com/solutions/data-lakehouse/&quot;&gt;Dremio, the data lakehouse platform&lt;/a&gt;, truly excels, thanks to its data virtualization features. Dremio provides an easy-to-use interface across all your data, wherever it resides. If Dremio supports your legacy and new data systems, you can follow this migration pattern to ensure a smooth transition.&lt;/p&gt;
&lt;h3&gt;Step 1: Apply Dremio Over Your Legacy System&lt;/h3&gt;
&lt;p&gt;Begin by implementing Dremio on top of your existing legacy system. This initial step offers immediate ease-of-use and performance improvements, allowing your teams to familiarize themselves with workflows that will persist throughout the migration process. This approach ensures minimal disruption, as teams can continue their operations seamlessly.&lt;/p&gt;
&lt;h3&gt;Step 2: Connect Both Old and New Data Systems to Dremio&lt;/h3&gt;
&lt;p&gt;Next, connect both your legacy and new data systems to Dremio. Start migrating data to your data lakehouse while maintaining minimal disruption to your end users, who will continue using Dremio as their unified interface. This dual connection phase enables a smooth transition by ensuring that data is accessible and manageable from a single point, regardless of its location.&lt;/p&gt;
&lt;h3&gt;Step 3: Retire Old Systems After Data Migration&lt;/h3&gt;
&lt;p&gt;Once the data migration is complete, you can retire your old systems. Thanks to Dremio&apos;s unified interface, this step involves no major disruptions. Users will continue to access data seamlessly through Dremio, without adapting to new systems or interfaces. This continuity ensures that your operations remain efficient and uninterrupted.&lt;/p&gt;
&lt;p&gt;Dremio&apos;s power lies in providing a central, unified interface across all your data lakes, data warehouses, lakehouse catalogs, and databases. This means your end users don&apos;t have to worry about where the data lives, allowing them to focus on deriving insights and driving value from the data.&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;+-------------------------------------------------------+
|           Migration to a Data Lakehouse               |
+-------------------------------------------------------+
|                                                       |
| Step 1: Apply Dremio Over Legacy System               |
|   +---------------------------------------------+     |
|   |          Legacy System                      |     |
|   |---------------------------------------------|     |
|   |                                             |     |
|   | +-------------+    +-------------+          |     |
|   | |  Data Store |    |  Data Store |          |     |
|   | +-------------+    +-------------+          |     |
|   +---------------------------------------------+     |
|                         |                             |
|                         |                             |
|                         v                             |
|                  +-------------+                      |
|                  |    Dremio   |                      |
|                  +-------------+                      |
|                                                       |
+-------------------------------------------------------+
|                                                       |
| Step 2: Connect Both Old and New Systems to Dremio    |
|   +---------------------------------------------+     |
|   |          Legacy System                      |     |
|   |---------------------------------------------|     |
|   |                                             |     |
|   | +-------------+    +-------------+          |     |
|   | |  Data Store |    |  Data Store |          |     |
|   | +-------------+    +-------------+          |     |
|   +---------------------------------------------+     |
|                         |                             |
|                         v                             |
|                  +-------------+                      |
|                  |    Dremio   |                      |
|                  +-------------+                      |
|                         |                             |
|                         v                             |
|   +---------------------------------------------+     |
|   |           New Data Lakehouse                |     |
|   |---------------------------------------------|     |
|   |                                             |     |
|   | +-------------+    +-------------+          |     |
|   | |  Data Store |    |  Data Store |          |     |
|   | +-------------+    +-------------+          |     |
|   +---------------------------------------------+     |
|                                                       |
+-------------------------------------------------------+
|                                                       |
| Step 3: Retire Old Systems After Data Migration       |
|   +---------------------------------------------+     |
|   |           New Data Lakehouse                |     |
|   |---------------------------------------------|     |
|   |                                             |     |
|   | +-------------+    +-------------+          |     |
|   | |  Data Store |    |  Data Store |          |     |
|   | +-------------+    +-------------+          |     |
|   +---------------------------------------------+     |
|                         |                             |
|                         v                             |
|                  +-------------+                      |
|                  |    Dremio   |                      |
|                  +-------------+                      |
|                                                       |
+-------------------------------------------------------+

&lt;/code&gt;&lt;/pre&gt;
&lt;h2&gt;When Should you go for a data lakehouse?&lt;/h2&gt;
&lt;p&gt;Moving to a data lakehouse or staying with your existing systems depends on several critical factors. Here are key parameters to consider:&lt;/p&gt;
&lt;h3&gt;1. Data Volume and Growth&lt;/h3&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Current Data Volume&lt;/strong&gt;: Assess the data you currently manage. A data lakehouse might offer better scalability if you are already dealing with petabytes of data.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Projected Data Growth&lt;/strong&gt;: Consider future data growth. If you expect a significant increase in data volume, a data lakehouse can provide the necessary infrastructure to handle it efficiently.&lt;/li&gt;
&lt;/ul&gt;
&lt;h3&gt;2. Data Complexity&lt;/h3&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Structured vs. Unstructured Data&lt;/strong&gt;: Evaluate the types of data you store. A data lakehouse is ideal for environments that handle a mix of structured and unstructured data.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Data Sources&lt;/strong&gt;: Determine the number of data sources you integrate. A data lakehouse can simplify management and access to diverse data sources.&lt;/li&gt;
&lt;/ul&gt;
&lt;h3&gt;3. Performance Needs&lt;/h3&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Query Performance&lt;/strong&gt;: If your current system struggles with slow query performance, a data lakehouse can offer improved speed and efficiency through optimized storage formats and indexing.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Real-Time Analytics&lt;/strong&gt;: Consider if real-time analytics are crucial for your business. Data lakehouses support real-time data processing and analytics, making them suitable for dynamic and fast-paced environments.&lt;/li&gt;
&lt;/ul&gt;
&lt;h3&gt;4. Cost Considerations&lt;/h3&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Storage Costs&lt;/strong&gt;: Compare the storage costs of maintaining multiple copies of data in your current system versus a centralized data lakehouse.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Compute Costs&lt;/strong&gt;: Analyze compute costs associated with your existing architecture. Data lakehouses often provide more cost-effective compute options.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Migration Costs&lt;/strong&gt;: Factor in the costs of migrating data and operations. Dremio&apos;s data virtualization can minimize migration expenses by allowing you to use existing workflows during the transition.&lt;/li&gt;
&lt;/ul&gt;
&lt;h3&gt;5. Tool Compatibility&lt;/h3&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Existing Tools&lt;/strong&gt;: Review the tools you currently use. Ensure that they are compatible with a data lakehouse architecture.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Future Tools&lt;/strong&gt;: Consider future tool requirements. A data lakehouse offers flexibility and interoperability with a wide range of data processing and analytics tools.&lt;/li&gt;
&lt;/ul&gt;
&lt;h3&gt;6. Data Governance and Compliance&lt;/h3&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Data Governance&lt;/strong&gt;: Evaluate your data governance needs. Data lakehouses provide robust governance features like metadata management and data lineage tracking.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Compliance Requirements&lt;/strong&gt;: Ensure that a data lakehouse can meet regulatory compliance standards relevant to your industry.&lt;/li&gt;
&lt;/ul&gt;
&lt;h3&gt;7. Business Objectives&lt;/h3&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Strategic Goals&lt;/strong&gt;: Align the decision with your strategic business goals. If agility, scalability, and innovation are priorities, a data lakehouse might be the right choice.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;User Experience&lt;/strong&gt;: Consider the impact on end users. A unified interface through a data lakehouse can simplify data access and enhance user productivity.&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;By carefully evaluating these parameters, you can decide whether to transition to a data lakehouse or continue with your existing systems. Each organization’s needs are unique, so it’s essential to weigh these factors in the context of your specific requirements and objectives.&lt;/p&gt;
&lt;h2&gt;Where can I learn more about Data Lakehouse?&lt;/h2&gt;
&lt;p&gt;Below are several tutorial you can use to get hands-on with the data lakehouse on your laptop to see this architecture in action so you can then apply it to your own use case.&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;a href=&quot;https://main.datalakehousehub.com/blog/2024-04-end-to-end-data-engineering-tutorial-ingest-dashboards/&quot;&gt;End-to-End Basic Data Engineering Tutorial (Spark, Dremio, Superset)&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&quot;https://bit.ly/am-dremio-lakehouse-laptop&quot;&gt;Intro to Apache Iceberg, Nessie and Dremio on your Laptop&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&quot;https://bit.ly/am-json-csv-parquet-dremio&quot;&gt;JSON/CSV/Parquet to Apache Iceberg to BI Dashboard&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&quot;https://bit.ly/am-mongodb-dashboard&quot;&gt;From MongoDB to Apache Iceberg to BI Dashboard&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&quot;https://bit.ly/am-sqlserver-dashboard&quot;&gt;From SQLServer to Apache Iceberg to BI Dashboard&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&quot;https://bit.ly/am-postgres-to-dashboard&quot;&gt;From Postgres to Apache Iceberg to BI Dashboard&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&quot;https://bit.ly/dremio-experience&quot;&gt;Mongo/Postgres to Apache Iceberg to BI Dashboard using Git for Data and DBT&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&quot;https://bit.ly/am-dremio-elastic&quot;&gt;Elasticsearch to Apache Iceberg to BI Dashboard&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&quot;https://bit.ly/am-dremio-mysql-dashboard&quot;&gt;MySQL to Apache Iceberg to BI Dashboard&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&quot;https://bit.ly/am-kafka-connect-dremio&quot;&gt;Apache Kafka to Apache Iceberg to Dremio&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&quot;https://bit.ly/am-iceberg-lakehouse-engineering&quot;&gt;Apache Iceberg Lakehouse Engineering Video Playlist&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&quot;https://bit.ly/am-druid-dremio&quot;&gt;Apache Druid to Apache Iceberg to BI Dashboard&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&quot;https://bit.ly/end-to-end-de-tutorial&quot;&gt;Postgres to Apache Iceberg to Dashboard with Spark &amp;amp; Dremio&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;
</content:encoded><author>Alex Merced</author></item><item><title>Apache Iceberg Reliability</title><link>https://iceberglakehouse.com/posts/2024-7-Apache-Iceberg-Reliability/</link><guid isPermaLink="true">https://iceberglakehouse.com/posts/2024-7-Apache-Iceberg-Reliability/</guid><description>
- [Get a Free Copy of &quot;Apache Iceberg: The Definitive Guide&quot;](https://bit.ly/am-iceberg-book)
- [Sign Up for the Free Apache Iceberg Crash Course](ht...</description><pubDate>Fri, 26 Jul 2024 09:00:00 GMT</pubDate><content:encoded>&lt;ul&gt;
&lt;li&gt;&lt;a href=&quot;https://bit.ly/am-iceberg-book&quot;&gt;Get a Free Copy of &amp;quot;Apache Iceberg: The Definitive Guide&amp;quot;&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&quot;https://bit.ly/am-2024-iceberg-live-crash-course-1&quot;&gt;Sign Up for the Free Apache Iceberg Crash Course&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&quot;https://lu.ma/Lakehouselinkups&quot;&gt;Calendar of Data Lakehouse Events&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;&lt;a href=&quot;https://www.dremio.com/blog/apache-iceberg-101-your-guide-to-learning-apache-iceberg-concepts-and-practices/&quot;&gt;Apache Iceberg&lt;/a&gt; is a powerful table format designed to handle large analytic datasets reliably and efficiently. Reliability in data management is crucial for ensuring data integrity, consistency, and availability. This blog explores how Apache Iceberg addresses reliability concerns and provides robust solutions for data lakehouse architectures.&lt;/p&gt;
&lt;h2&gt;Background&lt;/h2&gt;
&lt;h3&gt;Problems with Hive Tables in S3&lt;/h3&gt;
&lt;p&gt;Hive tables have long been used for managing data in distributed systems like S3. However, they come with several inherent problems:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Central Metastore and File System Tracking&lt;/strong&gt;: Hive tables use a central metastore to track partitions and a file system to track individual files. This setup makes atomic changes to a table&apos;s contents complex beyond updating a single partition by rewriting and then swapping.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Eventual Consistency&lt;/strong&gt;: In eventually consistent stores like S3, listing files to reconstruct the state of a table can lead to incorrect results.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Slow Listing Calls&lt;/strong&gt;: Job planning requires many slow listing calls (O(n) with the number of partitions), which can significantly impact performance.&lt;/li&gt;
&lt;/ul&gt;
&lt;h2&gt;Apache Iceberg&apos;s Approach to Reliability&lt;/h2&gt;
&lt;h3&gt;Persistent Tree Structure&lt;/h3&gt;
&lt;p&gt;Apache Iceberg was designed to overcome these issues by &lt;a href=&quot;https://www.dremio.com/blog/how-apache-iceberg-is-built-for-open-optimized-performance/&quot;&gt;implementing a persistent tree structure&lt;/a&gt; to track data files:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Snapshots&lt;/strong&gt;: Each write or delete operation produces a new snapshot that includes the complete list of data files.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Metadata Reuse&lt;/strong&gt;: Iceberg reuses as much of the previous snapshot&apos;s metadata tree as possible to minimize write volumes.&lt;/p&gt;
&lt;/li&gt;
&lt;/ul&gt;
&lt;h3&gt;Atomic Operations&lt;/h3&gt;
&lt;p&gt;Iceberg ensures atomicity in its operations by:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Table Metadata File&lt;/strong&gt;: Valid snapshots are stored in the table metadata file (metadata.json), with a reference to the current snapshot.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Atomic Commits&lt;/strong&gt;: Commits replace the path of the current table metadata file using an atomic operation, ensuring that all updates to table data and metadata are atomic. This is the basis for serializable isolation.&lt;/p&gt;
&lt;/li&gt;
&lt;/ul&gt;
&lt;h3&gt;Serializable Isolation&lt;/h3&gt;
&lt;p&gt;Serializable isolation is a key feature that enhances reliability:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Linear History&lt;/strong&gt;: All table changes occur in a linear history of atomic updates.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Consistent Snapshot Reads&lt;/strong&gt;: Readers always use a consistent snapshot of the table without holding a lock, ensuring reliable reads.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Version History and Rollback&lt;/strong&gt;: Table snapshots are kept as history, allowing tables to roll back to previous states if a job produces bad data.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Safe File-Level Operations&lt;/strong&gt;: By supporting atomic changes, Iceberg enables safe operations like compacting small files and appending late data to tables.&lt;/p&gt;
&lt;/li&gt;
&lt;/ul&gt;
&lt;h2&gt;Benefits of Iceberg&apos;s Design&lt;/h2&gt;
&lt;h3&gt;Improved Reliability Guarantees&lt;/h3&gt;
&lt;p&gt;Iceberg&apos;s design provides several reliability guarantees:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Serializable Isolation&lt;/strong&gt;: Ensures all changes are atomic and occur in a linear sequence.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Reliable Reads&lt;/strong&gt;: Readers always access a consistent state of the table.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Version History and Rollback&lt;/strong&gt;: Facilitates easy rollback to previous table states.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Safe File-Level Operations&lt;/strong&gt;: Supports operations that require atomic changes, enhancing data integrity.&lt;/p&gt;
&lt;/li&gt;
&lt;/ul&gt;
&lt;h3&gt;Performance Benefits&lt;/h3&gt;
&lt;p&gt;In addition to reliability, Iceberg&apos;s design also offers performance advantages:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;O(1) RPCs for Job Planning&lt;/strong&gt;: Instead of listing O(n) directories, planning a job requires O(1) RPC calls.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Distributed Planning&lt;/strong&gt;: File pruning and predicate push-down are distributed to jobs, eliminating the metastore as a bottleneck.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Finer Granularity Partitioning&lt;/strong&gt;: Removes barriers to finer-grained partitioning, improving query performance.&lt;/p&gt;
&lt;/li&gt;
&lt;/ul&gt;
&lt;h2&gt;Concurrent Write Operations&lt;/h2&gt;
&lt;h3&gt;Optimistic Concurrency&lt;/h3&gt;
&lt;p&gt;Apache Iceberg supports multiple concurrent writes using optimistic concurrency:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Assumption of No Concurrent Writers&lt;/strong&gt;: Each writer operates under the assumption that no other writers are working simultaneously.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Atomic Swaps&lt;/strong&gt;: Writers attempt to commit by atomically swapping the new table metadata file for the existing one.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Retry Mechanism&lt;/strong&gt;: If the atomic swap fails due to another writer&apos;s commit, the failed writer retries by writing a new metadata tree based on the latest table state.&lt;/p&gt;
&lt;/li&gt;
&lt;/ul&gt;
&lt;h3&gt;Cost of Retries&lt;/h3&gt;
&lt;p&gt;Iceberg minimizes the cost of retries by structuring changes to be reusable:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Reusable Work&lt;/strong&gt;: For instance, appends usually create a new manifest file for the appended data files, which can be added without rewriting the manifest on every attempt.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Efficient Retry&lt;/strong&gt;: This approach avoids expensive operations during retries, making the process efficient and reliable.&lt;/p&gt;
&lt;/li&gt;
&lt;/ul&gt;
&lt;h3&gt;Retry Validation&lt;/h3&gt;
&lt;p&gt;Commit operations in Iceberg are based on assumptions and actions:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Assumption Checking&lt;/strong&gt;: A writer checks if the assumptions are still valid based on the current table state after a conflict.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Safe Re-application&lt;/strong&gt;: If the assumptions hold, the writer re-applies the actions and commits the changes.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Example&lt;/strong&gt;: A compaction operation might rewrite &lt;code&gt;file_a.avro&lt;/code&gt; and &lt;code&gt;file_b.avro&lt;/code&gt; into &lt;code&gt;merged.parquet&lt;/code&gt;. The commit is safe if both source files remain in the table. If not, the operation fails.&lt;/p&gt;
&lt;/li&gt;
&lt;/ul&gt;
&lt;h2&gt;Compatibility and Format Versioning&lt;/h2&gt;
&lt;h3&gt;Compatibility with Object Stores&lt;/h3&gt;
&lt;p&gt;Iceberg tables are designed to be compatible with any object store:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Avoiding File Listing and Rename Operations&lt;/strong&gt;: Iceberg tables do not rely on these operations, which makes them compatible with eventually consistent stores like S3.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Metadata-Driven Operations&lt;/strong&gt;: All operations are driven by metadata, ensuring consistency and reliability.&lt;/p&gt;
&lt;/li&gt;
&lt;/ul&gt;
&lt;h2&gt;Conclusion&lt;/h2&gt;
&lt;p&gt;Apache Iceberg&apos;s design principles and features make it a highly reliable solution for managing large analytic datasets:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Serializable Isolation and Atomic Operations&lt;/strong&gt;: Ensure data consistency and reliability.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Optimistic Concurrency&lt;/strong&gt;: Supports efficient and reliable concurrent write operations.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Compatibility and Performance&lt;/strong&gt;: Offers compatibility with various object stores and enhances performance through efficient metadata management.&lt;/p&gt;
&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;By adopting Apache Iceberg, organizations can achieve a reliable, scalable, and performant data lakehouse architecture, ensuring data integrity and consistency across their data management workflows.&lt;/p&gt;
&lt;h5&gt;GET HANDS-ON&lt;/h5&gt;
&lt;p&gt;Below are list of exercises to help you get hands-on with Apache Iceberg to see all of this in action yourself!&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;a href=&quot;https://bit.ly/am-dremio-lakehouse-laptop&quot;&gt;Intro to Apache Iceberg, Nessie and Dremio on your Laptop&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&quot;https://bit.ly/am-json-csv-parquet-dremio&quot;&gt;JSON/CSV/Parquet to Apache Iceberg to BI Dashboard&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&quot;https://bit.ly/am-mongodb-dashboard&quot;&gt;From MongoDB to Apache Iceberg to BI Dashboard&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&quot;https://bit.ly/am-sqlserver-dashboard&quot;&gt;From SQLServer to Apache Iceberg to BI Dashboard&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&quot;https://bit.ly/am-postgres-to-dashboard&quot;&gt;From Postgres to Apache Iceberg to BI Dashboard&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&quot;https://bit.ly/dremio-experience&quot;&gt;Mongo/Postgres to Apache Iceberg to BI Dashboard using Git for Data and DBT&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&quot;https://bit.ly/am-dremio-elastic&quot;&gt;Elasticsearch to Apache Iceberg to BI Dashboard&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&quot;https://bit.ly/am-dremio-mysql-dashboard&quot;&gt;MySQL to Apache Iceberg to BI Dashboard&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&quot;https://bit.ly/am-kafka-connect-dremio&quot;&gt;Apache Kafka to Apache Iceberg to Dremio&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&quot;https://bit.ly/am-iceberg-lakehouse-engineering&quot;&gt;Apache Iceberg Lakehouse Engineering Video Playlist&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&quot;https://bit.ly/am-druid-dremio&quot;&gt;Apache Druid to Apache Iceberg to BI Dashboard&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&quot;https://bit.ly/end-to-end-de-tutorial&quot;&gt;Postgres to Apache Iceberg to Dashboard with Spark &amp;amp; Dremio&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;
</content:encoded><author>Alex Merced</author></item><item><title>Understanding the Polaris Iceberg Catalog and Its Architecture</title><link>https://iceberglakehouse.com/posts/2024-7-Understanding-Polaris-Apache-Iceberg-Catalog/</link><guid isPermaLink="true">https://iceberglakehouse.com/posts/2024-7-Understanding-Polaris-Apache-Iceberg-Catalog/</guid><description>
NOTE: I am working on a hands-on tutorial for Polaris, so please watch for the [Dremio Blog](https://www.dremio.com/blog) in the coming days. Also, c...</description><pubDate>Sat, 20 Jul 2024 09:00:00 GMT</pubDate><content:encoded>&lt;p&gt;NOTE: I am working on a hands-on tutorial for Polaris, so please watch for the &lt;a href=&quot;https://www.dremio.com/blog&quot;&gt;Dremio Blog&lt;/a&gt; in the coming days. Also, check out many other great articles on the Dremio blog about Apache Iceberg, Data Lakehouses, and more.&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;a href=&quot;https://hello.dremio.com/webcast-an-apache-iceberg-lakehouse-crash-course-reg.html&quot;&gt;Apache Iceberg Crash Course&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&quot;https://www.dremio.com/blog/the-evolution-of-apache-iceberg-catalogs/&quot;&gt;Free Copy of &amp;quot;Apache Iceberg: The Definitive Guide&amp;quot;&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;h2&gt;Introduction&lt;/h2&gt;
&lt;p&gt;&lt;a href=&quot;https://www.dremio.com/lakehouse-deep-dives/apache-iceberg-101/&quot;&gt;Apache Iceberg&lt;/a&gt; has gained popularity for transforming data lakes into &lt;a href=&quot;https://www.dremio.com/lakehouse-deep-dives/dremio-101/&quot;&gt;data lakehouses&lt;/a&gt;, enabling them to serve as the central hub of modern data architecture. A key component of this transformation is using catalogs, which organize and manage data efficiently. &lt;a href=&quot;https://bit.ly/am-polaris-repo&quot;&gt;Polaris, a new open-source solution&lt;/a&gt;, offers a robust and flexible way to work with Apache Iceberg catalogs. In this blog, we&apos;ll explore Polaris, the concepts of internal and external catalogs, the various entities it manages, and its principles and security models. By the end, you&apos;ll understand how Polaris enhances your data lakehouse experience and streamlines data management.&lt;/p&gt;
&lt;p&gt;Resources Around the Open-Sourcing of Polaris:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;a href=&quot;https://bit.ly/am-polaris-catalog-announce&quot;&gt;Announcement Blog&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&quot;https://bit.ly/am-datanami-polaris-nessie&quot;&gt;Datanami Article about Merging of Polaris/Nessie&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;h3&gt;What is Polaris?&lt;/h3&gt;
&lt;p&gt;Polaris is an open-source catalog service designed to manage Apache Iceberg catalogs efficiently. It provides a robust framework for organizing, accessing, and securing data within a data lakehouse architecture. Polaris is a critical layer bridging the gap between raw data storage and advanced data processing and analytics tools.&lt;/p&gt;
&lt;p&gt;Polaris offers several features that make it a valuable tool for data management:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Scalability&lt;/strong&gt;: Polaris can handle large-scale data operations, making it suitable for enterprises with vast amounts of data.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Flexibility&lt;/strong&gt;: It supports various storage types, including S3, Azure, and GCS, allowing you to choose the best storage solution for your needs.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Interoperability&lt;/strong&gt;: Polaris integrates seamlessly with popular data processing engines like Apache Spark, Apache Flink, Snowflake and Dremio, enabling you to leverage existing tools and workflows.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Security&lt;/strong&gt;: It implements a role-based access control (RBAC) model, ensuring that data access and management are both secure and compliant with organizational policies.&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;By providing a unified catalog service, Polaris simplifies the management of Iceberg tables and views, helping organizations maintain a coherent and efficient data architecture. In the following sections, we will delve deeper into the internal and external catalog concepts, entities managed by Polaris, and its principles and security model.&lt;/p&gt;
&lt;h2&gt;Entities in Polaris&lt;/h2&gt;
&lt;p&gt;Polaris manages &lt;a href=&quot;https://github.com/polaris-catalog/polaris/blob/main/docs/entities.md&quot;&gt;several key entities&lt;/a&gt; that form the backbone of its cataloging system. These entities include catalogs, namespaces, tables, and views. Understanding these entities and their relationships is crucial for effectively utilizing Polaris in a data lakehouse architecture.&lt;/p&gt;
&lt;h3&gt;Catalogs&lt;/h3&gt;
&lt;p&gt;A catalog is Polaris&apos;s top-level entity. It serves as a container for other entities, organizing data into a structured hierarchy. Polaris catalogs map directly to Apache Iceberg catalogs and are associated with a specific storage type, such as S3, Azure, or GCS. This association defines where the data within the catalog resides and the credentials required to access it.&lt;/p&gt;
&lt;h3&gt;Namespaces&lt;/h3&gt;
&lt;p&gt;Namespaces are logical entities within a catalog that can contain tables and views. They act as organizational units, similar to schemas or databases in traditional relational database systems. Namespaces in Polaris can be nested up to 16 levels, allowing for flexible and granular data organization. For example, a namespace structure like &lt;code&gt;a.b.c.d&lt;/code&gt; represents a nested hierarchy where &lt;code&gt;d&lt;/code&gt; resides within &lt;code&gt;c&lt;/code&gt;, which is within &lt;code&gt;b&lt;/code&gt;, and so on.&lt;/p&gt;
&lt;h3&gt;Tables&lt;/h3&gt;
&lt;p&gt;Polaris tables are entities that map to Apache Iceberg tables. They store actual data and metadata necessary for managing and querying the data efficiently. Tables in Polaris benefit from Iceberg&apos;s capabilities, such as ACID transactions, schema evolution, and partitioning, making them highly reliable and performant for large-scale data operations.&lt;/p&gt;
&lt;h3&gt;Views&lt;/h3&gt;
&lt;p&gt;Views in Polaris are entities that map to Apache Iceberg views. They provide a way to define virtual tables based on the results of a query. Views are useful for creating reusable, queryable abstractions on top of existing tables without duplicating data. They can simplify complex queries and enhance data accessibility for different user groups.&lt;/p&gt;
&lt;p&gt;By organizing data into these entities, Polaris ensures that your data lakehouse is well-structured, easily navigable, and ready for advanced data processing and analytics tasks. In the next section, we will explore the concepts of internal and external catalogs within Polaris.&lt;/p&gt;
&lt;h2&gt;Catalogs in Polaris&lt;/h2&gt;
&lt;p&gt;Polaris introduces the concepts of internal and external catalogs to provide flexibility and control over how data is organized and accessed within a data lakehouse architecture.&lt;/p&gt;
&lt;h3&gt;Internal Catalogs&lt;/h3&gt;
&lt;p&gt;Internal catalogs are managed entirely by Polaris. They are self-contained within the Polaris system, meaning that all metadata and data management operations are handled by Polaris itself. This approach simplifies the setup and management process, as users can rely on Polaris to maintain consistency, security, and performance.&lt;/p&gt;
&lt;p&gt;Internal catalogs are ideal for scenarios where centralized control over data management is desired. They ensure that all data governance policies, access controls, and data lineage tracking are enforced consistently. By using internal catalogs, organizations can benefit from a streamlined data management experience without needing to configure and manage external systems.&lt;/p&gt;
&lt;h3&gt;External Catalogs&lt;/h3&gt;
&lt;p&gt;External catalogs, on the other hand, integrate with existing catalog services outside of Polaris. They allow Polaris to interact with and manage data that resides in external systems. This integration enables organizations to leverage Polaris&apos;s features while still utilizing their current data infrastructure.&lt;/p&gt;
&lt;p&gt;External catalogs are useful for organizations that have existing investments in other catalog systems and want to incorporate Polaris&apos;s capabilities without disrupting their current workflows. By configuring Polaris to work with external catalogs, users can extend Polaris&apos;s benefits, such as enhanced security and interoperability, to their existing data assets.&lt;/p&gt;
&lt;h3&gt;Comparison and Use Cases&lt;/h3&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Internal Catalogs&lt;/strong&gt;: Best suited for organizations looking for a unified and centralized data management solution. They offer simplicity and comprehensive control over data governance.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;External Catalogs&lt;/strong&gt;: Ideal for organizations with established catalog systems that want to enhance their data management capabilities without a complete overhaul. They provide flexibility and integration with existing infrastructures.&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;The distinction between internal and external catalogs in Polaris offers organizations the flexibility to choose the best approach.&lt;/p&gt;
&lt;h2&gt;Principles in Polaris&lt;/h2&gt;
&lt;p&gt;Polaris employs a robust security and governance framework centered around &lt;a href=&quot;https://github.com/polaris-catalog/polaris/blob/main/docs/access-control.md&quot;&gt;the concept of principles&lt;/a&gt;. This framework ensures that data access and management are secure and compliant with organizational policies. The key components of this framework are principal roles, catalog roles, and privileges, all managed through a role-based access control (RBAC) model.&lt;/p&gt;
&lt;h3&gt;Principal Roles&lt;/h3&gt;
&lt;p&gt;Principal roles logically group service principals (users or services) together, making it easier to manage permissions and access controls. Each principal role can be assigned to multiple service principals, allowing consistent access policy application across different users and services. For example, a &lt;code&gt;DataEngineer&lt;/code&gt; principal role might be assigned to all users performing data engineering tasks, granting them the necessary privileges to manage tables and execute queries.&lt;/p&gt;
&lt;h3&gt;Catalog Roles&lt;/h3&gt;
&lt;p&gt;Catalog roles define a set of permissions for actions that can be performed on a catalog and its entities, such as namespaces and tables. These roles are specific to a particular catalog and can be assigned to one or more principal roles. For instance, a &lt;code&gt;CatalogAdministrator&lt;/code&gt; role might be granted full access to create, modify, and delete tables within a specific catalog. In contrast, a &lt;code&gt;CatalogReader&lt;/code&gt; role might be limited to read-only access.&lt;/p&gt;
&lt;h3&gt;Privileges&lt;/h3&gt;
&lt;p&gt;Privileges are the specific actions that can be performed on securable objects within Polaris. These actions include creating, reading, updating, and deleting catalogs, namespaces, tables, and views. Privileges are granted to catalog roles, which are then assigned to principal roles, ensuring a hierarchical and controlled distribution of access rights. Some common privileges include:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;CATALOG_MANAGE_ACCESS&lt;/strong&gt;: Grants the ability to manage access permissions for a catalog.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;CATALOG_MANAGE_CONTENT&lt;/strong&gt;: Allows full management of a catalog&apos;s content, including metadata and data operations.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;TABLE_READ_DATA&lt;/strong&gt;: Permits reading data from a table.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;TABLE_WRITE_DATA&lt;/strong&gt;: Allows writing data to a table.&lt;/li&gt;
&lt;/ul&gt;
&lt;h4&gt;Role-Based Access Control (RBAC)&lt;/h4&gt;
&lt;p&gt;The RBAC model in Polaris ensures access is granted based on roles rather than directly to individual users or services. This model simplifies permissions management and ensures access policies are consistently enforced. By defining roles and assigning them to principals, Polaris allows for scalable and maintainable access control.&lt;/p&gt;
&lt;p&gt;For example, in a typical scenario, an organization might have the following setup:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Principal Role: DataEngineer&lt;/strong&gt;: Assigned to users who manage data processing tasks.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Catalog Role: TableManager&lt;/strong&gt;: Grants privileges to create, read, update, and delete tables.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Privileges&lt;/strong&gt;: Specific actions like &lt;code&gt;TABLE_CREATE&lt;/code&gt;, &lt;code&gt;TABLE_READ_DATA&lt;/code&gt;, and &lt;code&gt;TABLE_WRITE_DATA&lt;/code&gt; are granted to the &lt;code&gt;TableManager&lt;/code&gt; role.&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;By assigning the &lt;code&gt;TableManager&lt;/code&gt; catalog role to the &lt;code&gt;DataEngineer&lt;/code&gt; principal role, all data engineers gain the necessary permissions to perform their tasks on the specified catalogs.&lt;/p&gt;
&lt;p&gt;Polaris&apos;s principles and RBAC model provide a secure and efficient way to manage access and permissions within your data lakehouse architecture. By leveraging these concepts, organizations can ensure that their data is both accessible to those who need it and protected from unauthorized access.&lt;/p&gt;
&lt;h2&gt;Conclusion&lt;/h2&gt;
&lt;p&gt;Polaris offers a comprehensive and flexible solution for managing Apache Iceberg catalogs within a data lakehouse architecture. By robustly supporting internal and external catalogs, Polaris ensures that organizations can choose the best approach for their existing infrastructure and data management needs.&lt;/p&gt;
&lt;p&gt;The various entities managed by Polaris, including catalogs, namespaces, tables, and views, are designed to provide a structured and efficient way to organize and access data. Polaris&apos;s role-based access control (RBAC) model further enhances security and governance, ensuring that data is accessed and managed in compliance with organizational policies.&lt;/p&gt;
&lt;p&gt;Polaris&apos;s Iceberg REST Catalog integration enables working with popular data processing engines like Apache Spark, Snowflake, and Dremio and its support for various storage types make it a versatile tool for modern data architectures.&lt;/p&gt;
&lt;p&gt;Polaris simplifies the management of Apache Iceberg catalogs and enhances the overall data lakehouse experience by providing a unified, secure, and efficient framework for data management. Whether you want to streamline your data governance or integrate with existing systems, Polaris offers the tools and capabilities to meet your needs.&lt;/p&gt;
&lt;h5&gt;GET HANDS-ON WITH APACHE ICEBERG&lt;/h5&gt;
&lt;p&gt;Below are list of exercises to help you get hands-on with Apache Iceberg to see all of this in action yourself!&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;a href=&quot;https://bit.ly/am-dremio-lakehouse-laptop&quot;&gt;Intro to Apache Iceberg, Nessie and Dremio on your Laptop&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&quot;https://bit.ly/am-json-csv-parquet-dremio&quot;&gt;JSON/CSV/Parquet to Apache Iceberg to BI Dashboard&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&quot;https://bit.ly/am-mongodb-dashboard&quot;&gt;From MongoDB to Apache Iceberg to BI Dashboard&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&quot;https://bit.ly/am-sqlserver-dashboard&quot;&gt;From SQLServer to Apache Iceberg to BI Dashboard&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&quot;https://bit.ly/am-postgres-to-dashboard&quot;&gt;From Postgres to Apache Iceberg to BI Dashboard&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&quot;https://bit.ly/dremio-experience&quot;&gt;Mongo/Postgres to Apache Iceberg to BI Dashboard using Git for Data and DBT&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&quot;https://bit.ly/am-dremio-elastic&quot;&gt;Elasticsearch to Apache Iceberg to BI Dashboard&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&quot;https://bit.ly/am-dremio-mysql-dashboard&quot;&gt;MySQL to Apache Iceberg to BI Dashboard&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&quot;https://bit.ly/am-kafka-connect-dremio&quot;&gt;Apache Kafka to Apache Iceberg to Dremio&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&quot;https://bit.ly/am-iceberg-lakehouse-engineering&quot;&gt;Apache Iceberg Lakehouse Engineering Video Playlist&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&quot;https://bit.ly/am-druid-dremio&quot;&gt;Apache Druid to Apache Iceberg to BI Dashboard&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&quot;https://bit.ly/end-to-end-de-tutorial&quot;&gt;Postgres to Apache Iceberg to Dashboard with Spark &amp;amp; Dremio&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;
</content:encoded><author>Alex Merced</author></item><item><title>Upcoming Data Talks from Alex Merced (And how to follow)</title><link>https://iceberglakehouse.com/posts/2024-7-upcoming-data-talks-from-alex-merced/</link><guid isPermaLink="true">https://iceberglakehouse.com/posts/2024-7-upcoming-data-talks-from-alex-merced/</guid><description>
In this article, I will provide you with a list of events I&apos;m currently scheduled to speak at. New events are regularly being added, so here are a co...</description><pubDate>Sat, 20 Jul 2024 09:00:00 GMT</pubDate><content:encoded>&lt;p&gt;In this article, I will provide you with a list of events I&apos;m currently scheduled to speak at. New events are regularly being added, so here are a couple of good spots to always be in the know.&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;a href=&quot;https://lu.ma/Techeventswithalex&quot;&gt;Subscribe to my Luma Calendar to get regular new event updates&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&quot;https://www.dremio.com/events&quot;&gt;Dremio&apos;s Events Calendar&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;h2&gt;Events&lt;/h2&gt;
&lt;h4&gt;Apache Iceberg Crash Course (July 11 - October 29th)&lt;/h4&gt;
&lt;p&gt;There is a 10-session Crash Course on Apache Iceberg, which you can &lt;a href=&quot;https://bit.ly/am-2024-iceberg-live-crash-course-1&quot;&gt;REGISTER HERE&lt;/a&gt;. The curriculum and dates are here.&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;July 11: What is a Data Lakehouse and What is a Table Format? (On Demand)&lt;/li&gt;
&lt;li&gt;July 16: The Architecture of Apache Iceberg, Apache Hudi and Delta Lake (On Demand)&lt;/li&gt;
&lt;li&gt;July 23: The Read and Write Process for Apache Iceberg Tables&lt;/li&gt;
&lt;li&gt;Aug 13: Understanding Apache Iceberg’s Partitioning Features&lt;/li&gt;
&lt;li&gt;Aug 27: Optimizing Apache Iceberg Tables&lt;/li&gt;
&lt;li&gt;Sep 3: Streaming with Apache Iceberg&lt;/li&gt;
&lt;li&gt;Sep 17: The Role of Apache Iceberg Catalogs&lt;/li&gt;
&lt;li&gt;Oct 1: Versioning with Apache Iceberg&lt;/li&gt;
&lt;li&gt;Oct 15: Ingesting Data into Apache Iceberg with Apache Spark&lt;/li&gt;
&lt;li&gt;Oct 29: Ingesting Data into Apache Iceberg with Dremio&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;Registrants can access the on-demand recordings if they&apos;ve missed the first few sessions.&lt;/p&gt;
&lt;h4&gt;Tampa Bay Data Engineering Meetup&lt;/h4&gt;
&lt;p&gt;I regularly speak most months at this meetup based in my state (Florida). I&apos;ll be speaking at these two upcoming meetups. These meetups are online so can attended from anywhere.&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;JULY 25: &lt;a href=&quot;https://www.meetup.com/tampa-bay-data-engineering-group/events/301599901/&quot;&gt;Dremio&apos;s Apache Iceberg Powered Reflections&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;AUG 29th: &lt;a href=&quot;https://www.meetup.com/tampa-bay-data-engineering-group/events/301600144/&quot;&gt;Apache Iceberg REST Catalog&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;h4&gt;Anants Data Engineers Lunch&lt;/h4&gt;
&lt;p&gt;Anant runs several weekly meetups for the data community, including their weekly Data Engineer Lunch, which I regularly speak at once a month. Here are my next visits to the Lunch which is a virtual event:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;JULY 29th: &lt;a href=&quot;https://www.meetup.com/data-wranglers-dc/events/301715974/&quot;&gt;Dremio&apos;s Apache Iceberg Powered Reflections&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;AUG 26th: &lt;a href=&quot;https://www.meetup.com/data-wranglers-dc/events/301715917/&quot;&gt;How Dremio&apos;s Semantic Layer Empowers Lakehouses&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;h4&gt;SQLSaturday Denver&lt;/h4&gt;
&lt;p&gt;SQLSaturday is an annual event featuring talks about data, databases, analytics, and AI and is held in many cities nationwide. Recently, I spoke at the 2024 SQLSaturday in Ft. Lauderdale and will now &lt;a href=&quot;https://sqlsaturday.com/2024-08-17-sqlsaturday1090/&quot;&gt;speak at the SQLSaturday in Denver&lt;/a&gt;.&lt;/p&gt;
&lt;h4&gt;Apache Community Over Code&lt;/h4&gt;
&lt;p&gt;The Community Over Code conference is one of the world&apos;s biggest gatherings of open-source developers. I am honored to be part of this year&apos;s line-up discussing Data Lakehouse House Data Versioning and different open-source approaches. &lt;a href=&quot;https://communityovercode.org/&quot;&gt;Find more information about the conference here&lt;/a&gt;.&lt;/p&gt;
&lt;p&gt;Remember you can stay up to date on events I&apos;ll be at by...&lt;/p&gt;
&lt;p&gt;&lt;a href=&quot;https://lu.ma/Techeventswithalex&quot;&gt;Subscribing to my Calendar on Lu.Ma&lt;/a&gt;&lt;/p&gt;
&lt;p&gt;&lt;a href=&quot;https://bio.alexmerced.com/data&quot;&gt;Also find my social links, slack communities and more here&lt;/a&gt;&lt;/p&gt;
</content:encoded><author>Alex Merced</author></item><item><title>Databases Deconstructed - The Value of Data Lakehouses and Table Formats</title><link>https://iceberglakehouse.com/posts/2024-7-databases-decontstructed-value-of-data-lakehouses-and-table-formats/</link><guid isPermaLink="true">https://iceberglakehouse.com/posts/2024-7-databases-decontstructed-value-of-data-lakehouses-and-table-formats/</guid><description>
- [Checkout out my Apache Iceberg Crash Course](https://bit.ly/am-2024-iceberg-live-crash-course-1)
- [Get a free copy of Apache Iceberg the Definiti...</description><pubDate>Fri, 12 Jul 2024 09:00:00 GMT</pubDate><content:encoded>&lt;ul&gt;
&lt;li&gt;&lt;a href=&quot;https://bit.ly/am-2024-iceberg-live-crash-course-1&quot;&gt;Checkout out my Apache Iceberg Crash Course&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&quot;https://bit.ly/am-iceberg-book&quot;&gt;Get a free copy of Apache Iceberg the Definitive Guide&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;Databases and data warehouses are powerful systems that simplify working with data by abstracting many of the inherent challenges, including:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Storage:&lt;/strong&gt; How data is stored and persisted, what file formats are used, and how those files are managed.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Tables:&lt;/strong&gt; How we determine which data belongs to which table and what table statistics are tracked internally.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Catalog:&lt;/strong&gt; How the system keeps track of all the tables so that users can easily access them.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Processing:&lt;/strong&gt; When a user writes a query, how that query is parsed into relational algebra expressions, transformed into an execution plan, optimized, and executed.&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;The drawback of having such tightly coupled systems is that the data within them is only understood by that specific system. Therefore, if another system is needed for a particular use case, the data must be migrated and duplicated into that other system. While this is often manageable for transactions (e.g., adding a user, updating a user, recording a sale) as a single database system can handle all CRUD operations (Create, Read, Update, Delete), it becomes more problematic in analytics. Analytical use cases are far more diverse, as are the tools required to support them, making data migration and duplication cumbersome and inefficient.&lt;/p&gt;
&lt;h2&gt;Enter the Lakehouse&lt;/h2&gt;
&lt;p&gt;In analytics, the status quo has been to duplicate your data across multiple systems for different use cases:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Data Lakes:&lt;/strong&gt; A universal storage layer for structured data in the form of CSV, JSON, ORC, AVRO, and Parquet files, as well as all other unstructured data like images, videos, and audio. Data lakes are often used as a repository for archiving all data and as a place to store the diverse data needed for training AI/ML models that require both structured and unstructured data.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Data Warehouses:&lt;/strong&gt; Essentially databases designed for analytics. They store and manage structured data with analytics in mind and are usually used as the data source for reports and Business Intelligence dashboards (visual panes built directly on the data).&lt;/p&gt;
&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;In an ideal world, you wouldn&apos;t need the costs and complexity of duplicating your data across systems and figuring out how to keep it all consistent. This is where the data lakehouse pattern comes in. The data lakehouse is an architectural pattern that essentially builds a deconstructed database using your data lake as the storage layer. The benefit is that structured data can now exist once in your data lake, and both data lakehouse tools and data warehouse tools can access it.&lt;/p&gt;
&lt;p&gt;Let&apos;s examine the construction of a data lakehouse layer by layer.&lt;/p&gt;
&lt;h2&gt;The Storage Layer&lt;/h2&gt;
&lt;p&gt;The basic foundation of a data lakehouse is the storage layer, where we need to determine where and how to store the data. For the &amp;quot;where,&amp;quot; the obvious choice is object storage.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;What is Object Storage?&lt;/strong&gt;&lt;/p&gt;
&lt;p&gt;Object storage is a data storage architecture that manages data as objects, as opposed to file systems that manage data as a file hierarchy, or block storage which manages data as blocks within sectors and tracks. Each object includes the data itself, a variable amount of metadata, and a unique identifier. This approach is highly scalable, cost-effective, and suitable for handling large amounts of unstructured data.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Benefits of Object Storage:&lt;/strong&gt;&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Scalability:&lt;/strong&gt; Object storage can easily scale out by adding more nodes, making it suitable for growing data needs.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Durability and Reliability:&lt;/strong&gt; Data is stored redundantly across multiple locations, ensuring high durability and reliability.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Cost-Effectiveness:&lt;/strong&gt; Often cheaper than traditional storage solutions, especially when dealing with large volumes of data.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Metadata Management:&lt;/strong&gt; Object storage allows for rich metadata, enabling more efficient data management and retrieval.&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;We can opt for major cloud vendors like AWS, Azure, and Google Cloud. However, many other storage vendors, such as &lt;a href=&quot;https://www.dremio.com/blog/3-reasons-to-create-hybrid-apache-iceberg-data-lakehouses/&quot;&gt;NetApp, Vast Data, MinIO, and Pure Storage, provide additional value in object storage solutions&lt;/a&gt; both in the cloud and on-premises.&lt;/p&gt;
&lt;p&gt;Next, we need to determine how we will store the data on the storage layer. The industry standard for this is Apache Parquet files.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;What are Apache Parquet Files?&lt;/strong&gt;&lt;/p&gt;
&lt;p&gt;Apache Parquet is a columnar storage file format optimized for use with big data processing frameworks. It is designed for efficient data storage and retrieval, making it ideal for analytic workloads.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Why Parquet Files are Good for Analytics:&lt;/strong&gt;&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Efficiency:&lt;/strong&gt; Parquet&apos;s columnar storage layout allows for efficient data compression and encoding schemes, reducing the amount of data scanned and improving query performance.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Compatibility:&lt;/strong&gt; Parquet is widely supported by many data processing tools and frameworks, making it a versatile choice for data storage.&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;&lt;strong&gt;How Row Groups Work in Parquet:&lt;/strong&gt;&lt;/p&gt;
&lt;p&gt;Parquet files are divided into row groups, subsets of the data that can be processed independently. Each row group contains column chunks, each of which consists of pages. This structure enables efficient reads by allowing queries to skip irrelevant data and read only the necessary columns and rows.&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;+-----------------------------------------------------+
|                     Parquet File                    |
+-----------------------------------------------------+
|                    File Metadata                    |
|                                                     |
| - Schema                                            |
| - Key-Value Metadata                                |
| - Version                                           |
+-----------------------------------------------------+
|                    Row Group 1                      |
|  +-----------------------------------------------+  |
|  |                 Column Chunk 1                |  |
|  |  +-----------------------------------------+  |  |
|  |  |                  Page 1                 |  |  |
|  |  +-----------------------------------------+  |  |
|  |  |                  Page 2                 |  |  |
|  |  +-----------------------------------------+  |  |
|  |  |                  ...                    |  |  |
|  |  +-----------------------------------------+  |  |
|  +-----------------------------------------------+  |
|  +-----------------------------------------------+  |
|  |                 Column Chunk 2                |  |
|  |  +-----------------------------------------+  |  |
|  |  |                  Page 1                 |  |  |
|  |  +-----------------------------------------+  |  |
|  |  |                  Page 2                 |  |  |
|  |  +-----------------------------------------+  |  |
|  |  |                  ...                    |  |  |
|  |  +-----------------------------------------+  |  |
|  +-----------------------------------------------+  |
|  |                     ...                       |  |
|  +-----------------------------------------------+  |
+-----------------------------------------------------+
|                    Row Group 2                      |
|  +-----------------------------------------------+  |
|  |                 Column Chunk 1                |  |
|  |  +-----------------------------------------+  |  |
|  |  |                  Page 1                 |  |  |
|  |  +-----------------------------------------+  |  |
|  |  |                  Page 2                 |  |  |
|  |  +-----------------------------------------+  |  |
|  |  |                  ...                    |  |  |
|  |  +-----------------------------------------+  |  |
|  +-----------------------------------------------+  |
|  +-----------------------------------------------+  |
|  |                 Column Chunk 2                |  |
|  |  +-----------------------------------------+  |  |
|  |  |                  Page 1                 |  |  |
|  |  +-----------------------------------------+  |  |
|  |  |                  Page 2                 |  |  |
|  |  +-----------------------------------------+  |  |
|  |  |                  ...                    |  |  |
|  |  +-----------------------------------------+  |  |
|  +-----------------------------------------------+  |
|  |                     ...                       |  |
|  +-----------------------------------------------+  |
+-----------------------------------------------------+
|                     ...                             |
+-----------------------------------------------------+

&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;By establishing a robust storage layer with object storage and using Apache Parquet files for data storage, we create a strong foundation for our data lakehouse. This setup ensures scalability, efficiency, and compatibility, essential for handling diverse and extensive data analytics workloads.&lt;/p&gt;
&lt;h2&gt;The Table Format&lt;/h2&gt;
&lt;p&gt;While Parquet files are excellent for storing data for quick access, datasets can eventually grow large enough to span multiple files. Parquet files are only aware of themselves and are unaware of other files in the same dataset. This puts the analyst responsible for defining the dataset, which can lead to mistakes where a file isn&apos;t included or extra files are included, resulting in inconsistent data definitions across use cases. Additionally, engines still need to open every file to execute a query, which can be time-consuming, especially if many files aren&apos;t needed for the specific query.&lt;/p&gt;
&lt;p&gt;In this case, we need an abstraction that helps do a few things:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Define which files comprise the current version of the table.&lt;/li&gt;
&lt;li&gt;Maintain a history of the file listings for previous table versions.&lt;/li&gt;
&lt;li&gt;Track file statistics that can be used to determine which files are relevant to a particular query.&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;This abstraction allows for faster scanning of large datasets and consistent results. It is known as a &amp;quot;table format,&amp;quot; a standard for how metadata is written to document the files in the table along with their statistics.&lt;/p&gt;
&lt;p&gt;Currently, there are three main table formats: &lt;a href=&quot;https://bit.ly/am-format-arch&quot;&gt;Apache Iceberg, Apache Hudi, and Delta Lake&lt;/a&gt;.&lt;/p&gt;
&lt;p&gt;&lt;a href=&quot;https://blog.iceberglakehouse.com/summarizing-recent-wins-for-apache-iceberg-table-format-56bd60837181?source=collection_home---4------3-----------------------&quot;&gt;Apache Iceberg is thought to have recently established itself as the industry standard.&lt;/a&gt;&lt;/p&gt;
&lt;h2&gt;The Catalog&lt;/h2&gt;
&lt;p&gt;Now that we have folders with metadata and data that comprise a table, processing tools need a way to know these tables exist and where the metadata for each table can be found. This is where the lakehouse catalog comes into play. A lakehouse catalog can perform several functions:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Track Tables:&lt;/strong&gt; Keep track of which tables exist in a data lakehouse.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Metadata References:&lt;/strong&gt; Provide references to the current metadata of each table.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Access Governance:&lt;/strong&gt; Govern access to the assets it tracks.&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;The catalog becomes the mechanism for bundling your tables and making them accessible to your preferred data processing tools. Currently, there are four main lakehouse catalogs offering solutions for this layer:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Nessie:&lt;/strong&gt; An open-source catalog that offers unique catalog-level versioning features for git-like functionality, initially created by Dremio.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Polaris:&lt;/strong&gt; An open-source catalog created by Snowflake.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Unity OSS:&lt;/strong&gt; An open-source catalog from Databricks, which is a complete rewrite of their proprietary Unity catalog.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Gravitino:&lt;/strong&gt; An open-source catalog from Datastrato.&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;You can efficiently manage and access your data lakehouse tables by leveraging these catalogs, ensuring seamless integration with various data processing tools.&lt;/p&gt;
&lt;h2&gt;Data Processing&lt;/h2&gt;
&lt;p&gt;Now that we have everything needed to store and track our data, we just need a tool to access that data and run queries and transformations for us. One such tool is &lt;a href=&quot;https://www.dremio.com/solutions/data-lakehouse/&quot;&gt;Dremio, a data lakehouse platform&lt;/a&gt; created to make working with data lakehouses easier, faster, and more open.&lt;/p&gt;
&lt;h3&gt;Dremio provides:&lt;/h3&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Unified Analytics:&lt;/strong&gt; The ability to connect to data lakes, lakehouse catalogs, databases, and data warehouses, allowing you to work with all your data in one place.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;SQL Query Engine:&lt;/strong&gt; An SQL query engine with industry-leading price/performance that can federate queries across all these data sources, featuring a semantic layer to track and define data models and metrics for fast reporting and business intelligence (BI).&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Lakehouse Management:&lt;/strong&gt; Dremio has deep integrations with Nessie catalogs and an integrated lakehouse catalog that allows you to automate the maintenance of Iceberg tables, ensuring your data lakehouse runs smoothly.&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;Dremio becomes a powerful tool for unifying and organizing your data into data products for your users. Since your data is in a lakehouse, several tools can be part of the picture. For example, you can use Upsolver, Fivetran, or Airbyte to ingest data into your lakehouse, run graph queries on your lakehouse with Puppygraph, and explore many other possibilities—all without needing multiple copies of your data.&lt;/p&gt;
&lt;h2&gt;Conclusion&lt;/h2&gt;
&lt;p&gt;We&apos;ve deconstructed the traditional database and data warehouse systems to highlight the value of data lakehouses and table formats. Databases and data warehouses simplify working with data by abstracting many inherent challenges, such as storage, table management, cataloging, and query processing. However, the tightly coupled nature of these systems often necessitates data duplication across different systems, leading to increased complexity and inefficiency.&lt;/p&gt;
&lt;p&gt;The data lakehouse architecture emerges as a solution, offering the best of both data lakes and data warehouses. By leveraging object storage and Apache Parquet files, we establish a robust storage layer that ensures scalability, efficiency, and compatibility. The introduction of table formats like Apache Iceberg, Apache Hudi, and Delta Lake further enhances our ability to manage large datasets effectively.&lt;/p&gt;
&lt;p&gt;To manage and track our data, lakehouse catalogs like Nessie, Polaris, Unity OSS, and Gravitino provide essential functionalities such as tracking tables, providing metadata references, and governing access. Finally, tools like Dremio offer potent data processing capabilities, enabling unified analytics, efficient query execution, and seamless lakehouse management.&lt;/p&gt;
&lt;p&gt;By adopting a data lakehouse architecture, we can streamline data management, reduce costs, and accelerate time to insight, all while maintaining the flexibility to integrate various tools and technologies.&lt;/p&gt;
&lt;p&gt;To learn more about the practical implementation of these concepts, be sure to check out my Apache Iceberg Crash Course and get a free copy of &amp;quot;Apache Iceberg: The Definitive Guide&amp;quot;:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;a href=&quot;https://bit.ly/am-2024-iceberg-live-crash-course-1&quot;&gt;Check out my Apache Iceberg Crash Course&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&quot;https://bit.ly/am-iceberg-book&quot;&gt;Get a free copy of Apache Iceberg: The Definitive Guide&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;h2&gt;GET HANDS-ON&lt;/h2&gt;
&lt;p&gt;Below are list of exercises to help you get hands-on with Apache Iceberg to see all of this in action yourself!&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;a href=&quot;https://bit.ly/am-dremio-lakehouse-laptop&quot;&gt;Intro to Apache Iceberg, Nessie and Dremio on your Laptop&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&quot;https://bit.ly/am-json-csv-parquet-dremio&quot;&gt;JSON/CSV/Parquet to Apache Iceberg to BI Dashboard&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&quot;https://bit.ly/am-mongodb-dashboard&quot;&gt;From MongoDB to Apache Iceberg to BI Dashboard&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&quot;https://bit.ly/am-sqlserver-dashboard&quot;&gt;From SQLServer to Apache Iceberg to BI Dashboard&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&quot;https://bit.ly/am-postgres-to-dashboard&quot;&gt;From Postgres to Apache Iceberg to BI Dashboard&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&quot;https://bit.ly/dremio-experience&quot;&gt;Mongo/Postgres to Apache Iceberg to BI Dashboard using Git for Data and DBT&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&quot;https://bit.ly/am-dremio-elastic&quot;&gt;Elasticsearch to Apache Iceberg to BI Dashboard&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&quot;https://bit.ly/am-dremio-mysql-dashboard&quot;&gt;MySQL to Apache Iceberg to BI Dashboard&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&quot;https://bit.ly/am-kafka-connect-dremio&quot;&gt;Apache Kafka to Apache Iceberg to Dremio&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&quot;https://bit.ly/am-iceberg-lakehouse-engineering&quot;&gt;Apache Iceberg Lakehouse Engineering Video Playlist&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&quot;https://bit.ly/am-druid-dremio&quot;&gt;Apache Druid to Apache Iceberg to BI Dashboard&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&quot;https://bit.ly/end-to-end-de-tutorial&quot;&gt;Postgres to Apache Iceberg to Dashboard with Spark &amp;amp; Dremio&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;
</content:encoded><author>Alex Merced</author></item><item><title>Partitioning with Apache Iceberg - A Deep Dive</title><link>https://iceberglakehouse.com/posts/2024-5-partitioning-with-apache-iceberg-deep-dive/</link><guid isPermaLink="true">https://iceberglakehouse.com/posts/2024-5-partitioning-with-apache-iceberg-deep-dive/</guid><description>
- [Apache Iceberg 101](https://www.dremio.com/blog/apache-iceberg-101-your-guide-to-learning-apache-iceberg-concepts-and-practices/)
- [Get Hands-on ...</description><pubDate>Wed, 29 May 2024 09:00:00 GMT</pubDate><content:encoded>&lt;ul&gt;
&lt;li&gt;&lt;a href=&quot;https://www.dremio.com/blog/apache-iceberg-101-your-guide-to-learning-apache-iceberg-concepts-and-practices/&quot;&gt;Apache Iceberg 101&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&quot;https://bit.ly/am-dremio-lakehouse-laptop&quot;&gt;Get Hands-on With Apache Iceberg&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&quot;https://bit.ly/am-iceberg-book&quot;&gt;Free PDF Copy of Apache Iceberg: The Definitive Guide&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;h2&gt;Introduction&lt;/h2&gt;
&lt;p&gt;Partitioning is a fundamental concept in data management that significantly enhances query performance by organizing data into distinct segments. This technique groups similar rows together based on specific criteria, making it easier and faster to retrieve relevant data.&lt;/p&gt;
&lt;p&gt;Apache Iceberg is an open table format designed for large analytic datasets. It brings high performance and reliability to data lake architectures, offering advanced capabilities such as hidden partitioning, which simplifies data management and improves query efficiency. In this blog, we will explore the partitioning capabilities of Apache Iceberg, highlighting how it stands out from traditional partitioning methods and demonstrating its practical applications using Dremio.&lt;/p&gt;
&lt;h2&gt;What is Partitioning?&lt;/h2&gt;
&lt;p&gt;Partitioning is a technique used to enhance the performance of queries by grouping similar rows together when data is written to storage. By organizing data in this manner, it becomes much faster to locate and retrieve specific subsets of data during query execution.&lt;/p&gt;
&lt;p&gt;For example, consider a logs table where queries typically include a time range filter, such as retrieving logs between 10 and 12 AM:&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-sql&quot;&gt;SELECT level, message FROM logs
WHERE event_time BETWEEN &apos;2018-12-01 10:00:00&apos; AND &apos;2018-12-01 12:00:00&apos;;
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Configuring the logs table to partition by the date of event_time groups log events into files based on the event date. Apache Iceberg keeps track of these dates, enabling the query engine to skip over files that do not contain relevant data, thereby speeding up query execution.&lt;/p&gt;
&lt;p&gt;Iceberg supports partitioning by various granularities such as year, month, day, and hour. It can also partition data based on categorical columns, such as the level column in the logs example, to further optimize query performance.&lt;/p&gt;
&lt;h2&gt;Traditional Partitioning Approaches&lt;/h2&gt;
&lt;p&gt;Traditional table formats like Hive also support partitioning, but they require explicit partitioning columns.&lt;/p&gt;
&lt;p&gt;To illustrate the difference between traditional partitioning and Iceberg&apos;s approach, let&apos;s consider how Hive handles partitioning with a sales table.&lt;/p&gt;
&lt;p&gt;In Hive, partitions are explicit and must be defined as separate columns. For a sales table, this means creating a &lt;code&gt;sale_date&lt;/code&gt; column and manually inserting data into partitions:&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-sql&quot;&gt;INSERT INTO sales PARTITION (sale_date)
  SELECT product_id, amount, sale_time, format_time(sale_time, &apos;YYYY-MM-dd&apos;)
  FROM unstructured_sales_source;
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Querying the sales table in Hive also requires an additional filter on the partition column:&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-sql&quot;&gt;SELECT product_id, count(1) as count FROM sales
WHERE sale_time BETWEEN &apos;2022-01-01 10:00:00&apos; AND &apos;2022-01-01 12:00:00&apos;
  AND sale_date = &apos;2022-01-01&apos;;
&lt;/code&gt;&lt;/pre&gt;
&lt;h3&gt;Problems with Hive Partitioning:&lt;/h3&gt;
&lt;ul&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Manual Partition Management:&lt;/strong&gt; Hive requires explicit partition columns and manual insertion of partition values, increasing the likelihood of errors.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Lack of Validation:&lt;/strong&gt; Hive cannot validate partition values, leading to potential inaccuracies if the wrong format or source column is used.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Query Complexity:&lt;/strong&gt; Queries must include filters on partition columns to benefit from partitioning, making them more complex and error-prone.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Static Partition Layouts:&lt;/strong&gt; Changing the partitioning scheme in Hive can break existing queries, limiting flexibility.
These issues highlight the challenges of traditional partitioning approaches, which Iceberg overcomes with its automated and hidden partitioning capabilities.&lt;/p&gt;
&lt;/li&gt;
&lt;/ul&gt;
&lt;h2&gt;What Does Iceberg Do Differently?&lt;/h2&gt;
&lt;p&gt;Apache Iceberg addresses the limitations of traditional partitioning by introducing hidden partitioning, which automates and simplifies the partitioning process.&lt;/p&gt;
&lt;h3&gt;Key Features of Iceberg&apos;s Partitioning:&lt;/h3&gt;
&lt;ul&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Hidden Partitioning:&lt;/strong&gt; Iceberg automatically handles the creation of partition values, removing the need for explicit partition columns. This reduces errors and simplifies data management.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Automatic Partition Pruning:&lt;/strong&gt; Iceberg can skip unnecessary partitions during query execution without requiring additional filters. This optimization ensures faster query performance.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Evolving Partition Layouts:&lt;/strong&gt; Iceberg allows partition layouts to evolve over time as data volumes change, without breaking existing queries. This flexibility makes it easier to adapt to changing data requirements.&lt;/p&gt;
&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;For example, in an Iceberg table, sales can be partitioned by date and product category without explicitly maintaining these columns:&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-sql&quot;&gt;CREATE TABLE sales (
  product_id STRING,
  amount DECIMAL,
  sale_time TIMESTAMP,
  category STRING
) PARTITIONED BY (date(sale_time), category);
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;With Iceberg&apos;s hidden partitioning, producers and consumers do not need to be aware of the partitioning scheme, leading to more straightforward and error-free data operations. This approach ensures that partition values are always produced correctly and used to optimize queries.&lt;/p&gt;
&lt;h2&gt;Iceberg Partition Transformations&lt;/h2&gt;
&lt;p&gt;Apache Iceberg supports a variety of partition transformations that allow for flexible and efficient data organization. These transformations help optimize query performance by logically grouping data based on specified criteria.&lt;/p&gt;
&lt;h3&gt;Overview of Supported Partition Transformations&lt;/h3&gt;
&lt;ol&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Year, Month, Day, Hour Transformations:&lt;/strong&gt; These transformations are used for timestamp columns to partition data by specific time intervals.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Categorical Column Transformations:&lt;/strong&gt;&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Bucket:&lt;/strong&gt; Partitions data by hashing values into a specified number of buckets.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Truncate:&lt;/strong&gt; Partitions data by truncating values to a specified length, suitable for strings or numeric ranges.&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;/ol&gt;
&lt;h3&gt;Example Scenarios for Each Transformation&lt;/h3&gt;
&lt;ul&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Year, Month, Day, Hour Transformations:&lt;/strong&gt; Beneficial for time-series data where queries often filter by date ranges. For example, partitioning sales data by month can significantly speed up monthly sales reports.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Bucket Transformation:&lt;/strong&gt; Useful for columns with high cardinality, such as user IDs, to evenly distribute data across partitions and avoid skew.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Truncate Transformation:&lt;/strong&gt; Effective for partitioning data with predictable ranges or fixed-length values, such as product codes or zip codes.&lt;/p&gt;
&lt;/li&gt;
&lt;/ul&gt;
&lt;h3&gt;Configuring Partitioning in Iceberg&lt;/h3&gt;
&lt;p&gt;Iceberg makes it straightforward to configure partitions when creating or modifying tables.&lt;/p&gt;
&lt;h4&gt;Syntax and Examples for Creating Iceberg Tables with Partitions&lt;/h4&gt;
&lt;p&gt;To create an Iceberg table partitioned by month:&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-sql&quot;&gt;CREATE TABLE sales (
  product_id STRING,
  amount DECIMAL,
  sale_time TIMESTAMP,
  category STRING
) PARTITIONED BY (month(sale_time));
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;This configuration will group sales records by the month of the sale_time, optimizing queries that filter by month.&lt;/p&gt;
&lt;p&gt;Using the ALTER TABLE Command to Modify Partition Schemes
Iceberg allows you to modify the partitioning scheme of existing tables using the ALTER TABLE command. For instance, you can add a new partition field:&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-sql&quot;&gt;ALTER TABLE sales ADD PARTITION FIELD year(sale_time);
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;This command updates the partitioning scheme to include both month(sale_time) and year(sale_time), enhancing query performance for both monthly and yearly aggregations.&lt;/p&gt;
&lt;p&gt;Iceberg&apos;s flexible partitioning capabilities, combined with its hidden partitioning feature, ensure that data is always optimally organized for efficient querying and analysis.&lt;/p&gt;
&lt;h2&gt;Query Optimization with Partitioning&lt;/h2&gt;
&lt;p&gt;Apache Iceberg leverages its advanced partitioning capabilities to optimize query performance by minimizing the amount of data scanned during query execution. By organizing data into partitions based on specified transformations, Iceberg ensures that only relevant partitions are read, significantly speeding up query response times.&lt;/p&gt;
&lt;h3&gt;How Iceberg Uses Partitioning to Optimize Queries&lt;/h3&gt;
&lt;p&gt;Iceberg&apos;s hidden partitioning and automatic partition pruning capabilities enable it to skip over irrelevant data, reducing I/O and improving query performance. When a query is executed, Iceberg uses the partition metadata to determine which partitions contain the data required by the query, thereby avoiding unnecessary scans.&lt;/p&gt;
&lt;h3&gt;Example Query Demonstrating Optimized Performance with Partitioning&lt;/h3&gt;
&lt;p&gt;Consider a sales table partitioned by month. A query to retrieve sales data for a specific month can be executed efficiently:&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-sql&quot;&gt;SELECT product_id, amount, sale_time
FROM sales
WHERE sale_time BETWEEN &apos;2022-01-01&apos; AND &apos;2022-01-31&apos;;
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Since the table is partitioned by month, Iceberg will only scan the partition corresponding to January 2022, drastically reducing the amount of data read and speeding up the query.&lt;/p&gt;
&lt;h2&gt;Advanced Use Cases and Best Practices&lt;/h2&gt;
&lt;p&gt;Strategies for Choosing Partition Columns and Transformations
Selecting appropriate partition columns and transformations is crucial for maximizing query performance. Consider the following strategies:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Analyze Query Patterns:&lt;/strong&gt; Choose partition columns based on the most common query filters. For example, partitioning by date for time-series data or by region for geographically distributed data.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Balance Cardinality:&lt;/strong&gt; Avoid columns with either too high or too low cardinality. High cardinality columns may create too many partitions, while low cardinality columns may not provide sufficient granularity.&lt;/p&gt;
&lt;/li&gt;
&lt;/ul&gt;
&lt;h3&gt;Best Practices for Managing and Evolving Partition Schemes in Iceberg&lt;/h3&gt;
&lt;ul&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Start Simple:&lt;/strong&gt; Begin with a straightforward partitioning scheme and evolve it as your data and query patterns change.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Monitor Performance:&lt;/strong&gt; Regularly monitor query performance and adjust partitioning schemes as needed. Use Iceberg&apos;s flexible partition evolution capabilities to modify schemes without downtime.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Document Partitioning:&lt;/strong&gt; Maintain clear documentation of your partitioning strategy and any changes made over time to ensure consistent data management practices.
Conclusion&lt;/p&gt;
&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;Apache Iceberg&apos;s advanced partitioning approach offers significant advantages over traditional partitioning methods. By automating partition management and providing flexible partition transformations, Iceberg simplifies data organization and enhances query performance. The ability to evolve partition schemes without disrupting existing queries ensures that your data infrastructure remains efficient and adaptable.&lt;/p&gt;
&lt;p&gt;Iceberg&apos;s partitioning capabilities empower data engineers and analysts to manage large datasets more effectively, ensuring that queries are executed swiftly and accurately. Embracing Iceberg&apos;s partitioning features can lead to more efficient data workflows and better overall performance in your data lake architecture.&lt;/p&gt;
&lt;h5&gt;GET HANDS-ON&lt;/h5&gt;
&lt;p&gt;Below are list of exercises to help you get hands-on with Apache Iceberg to see all of this in action yourself!&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;a href=&quot;https://bit.ly/am-dremio-lakehouse-laptop&quot;&gt;Intro to Apache Iceberg, Nessie and Dremio on your Laptop&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&quot;https://bit.ly/am-json-csv-parquet-dremio&quot;&gt;JSON/CSV/Parquet to Apache Iceberg to BI Dashboard&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&quot;https://bit.ly/am-mongodb-dashboard&quot;&gt;From MongoDB to Apache Iceberg to BI Dashboard&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&quot;https://bit.ly/am-sqlserver-dashboard&quot;&gt;From SQLServer to Apache Iceberg to BI Dashboard&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&quot;https://bit.ly/am-postgres-to-dashboard&quot;&gt;From Postgres to Apache Iceberg to BI Dashboard&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&quot;https://bit.ly/dremio-experience&quot;&gt;Mongo/Postgres to Apache Iceberg to BI Dashboard using Git for Data and DBT&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&quot;https://bit.ly/am-dremio-elastic&quot;&gt;Elasticsearch to Apache Iceberg to BI Dashboard&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&quot;https://bit.ly/am-dremio-mysql-dashboard&quot;&gt;MySQL to Apache Iceberg to BI Dashboard&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&quot;https://bit.ly/am-kafka-connect-dremio&quot;&gt;Apache Kafka to Apache Iceberg to Dremio&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&quot;https://bit.ly/am-iceberg-lakehouse-engineering&quot;&gt;Apache Iceberg Lakehouse Engineering Video Playlist&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&quot;https://bit.ly/am-druid-dremio&quot;&gt;Apache Druid to Apache Iceberg to BI Dashboard&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&quot;https://bit.ly/end-to-end-de-tutorial&quot;&gt;Postgres to Apache Iceberg to Dashboard with Spark &amp;amp; Dremio&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;
</content:encoded><author>Alex Merced</author></item><item><title>3 Reasons Data Engineers Should Embrace Apache Iceberg</title><link>https://iceberglakehouse.com/posts/2024-5-3-reasons-data-engineers-should-embrace-apache-iceberg/</link><guid isPermaLink="true">https://iceberglakehouse.com/posts/2024-5-3-reasons-data-engineers-should-embrace-apache-iceberg/</guid><description>
Data engineers are constantly seeking ways to streamline workflows and enhance data management efficiency. [Apache Iceberg, a high-performance table ...</description><pubDate>Wed, 15 May 2024 09:00:00 GMT</pubDate><content:encoded>&lt;p&gt;Data engineers are constantly seeking ways to streamline workflows and enhance data management efficiency. &lt;a href=&quot;https://www.dremio.com/blog/apache-iceberg-101-your-guide-to-learning-apache-iceberg-concepts-and-practices/&quot;&gt;Apache Iceberg, a high-performance table format&lt;/a&gt; for huge analytic datasets, has emerged as a game-changer in the field. By offering powerful features such as hidden partitioning, seamless partition evolution, and extensive tool compatibility, Iceberg simplifies data engineering tasks and boosts productivity. In this blog, we will delve into three key reasons why data engineers should embrace Apache Iceberg and how it can make their lives easier.&lt;/p&gt;
&lt;h2&gt;1. Hidden Partitioning&lt;/h2&gt;
&lt;p&gt;Traditionally, with Hive tables, data engineers often needed to create additional columns, such as day, month, and year, derived from a timestamp column for partitioning. This not only added extra work at the ingestion stage but also increased the size of data files. Moreover, data analysts had to be educated on how to query these columns to take advantage of partitioning.&lt;/p&gt;
&lt;p&gt;Apache Iceberg revolutionizes this process with its &lt;a href=&quot;https://www.dremio.com/subsurface/fewer-accidental-full-table-scans-brought-to-you-by-apache-icebergs-hidden-partitioning/&quot;&gt;hidden partitioning feature&lt;/a&gt;. Partitioning in Iceberg is a metadata operation, allowing you to express transforms like day, month, and year directly in your table&apos;s DDL. Instead of inflating data files, Iceberg tracks partition value ranges in the metadata, making the relationships between columns explicit. As a result, analysts do not need to update their queries to benefit from partitioning. This significantly reduces both the complexity of managing partitioned data and the overhead on data storage, leading to more efficient and streamlined data processing workflows.&lt;/p&gt;
&lt;h2&gt;2. Seamless Partition Evolution&lt;/h2&gt;
&lt;p&gt;As data needs change, so too must the ways in which we partition our tables. For example, you might start with year-based partitioning and later realize that month-based partitioning would better suit your queries and data access patterns. In the past, changing the partitioning scheme required rewriting the entire table and all its data, a time-consuming and resource-intensive process.&lt;/p&gt;
&lt;p&gt;&lt;a href=&quot;https://www.dremio.com/subsurface/future-proof-partitioning-and-fewer-table-rewrites-with-apache-iceberg/&quot;&gt;Apache Iceberg offers a much more flexible approach with seamless partition evolution&lt;/a&gt;. In Iceberg, you can update a table&apos;s partitioning scheme without having to rewrite existing data. Simply modify the partitioning strategy with an ALTER TABLE statement, and all future data writes will use the new scheme. The metadata keeps track of which data files are associated with which partitioning scheme, allowing for smooth transitions and backward compatibility. This feature greatly simplifies the process of adapting to evolving data requirements, saving time and reducing operational complexity for data engineers.&lt;/p&gt;
&lt;h2&gt;3. Extensive Tool Compatibility&lt;/h2&gt;
&lt;p&gt;One of the standout features of &lt;a href=&quot;https://www.youtube.com/watch?v=hh7wU9H2jz8&amp;amp;pp=ygUYQXBhY2hlIEljZWJlcmcgZWNvc3lzdGVt&quot;&gt;Apache Iceberg is its vast ecosystem of tools for reading, writing, and managing Iceberg tables&lt;/a&gt;. Unlike other solutions that may confine you to a limited set of tools, Iceberg integrates seamlessly with a wide range of technologies, allowing you to choose the tools that best fit your workflow and preferences.&lt;/p&gt;
&lt;p&gt;While Iceberg works exceptionally well with staple technologies like Apache Flink and Apache Spark, its compatibility extends far beyond these. You can leverage tools such as Dremio, Upsolver, Fivetran, Airbyte, Kafka Connect, Puppygraph, and many more. This extensive compatibility ensures that you are not locked into a specific technology stack and can adopt the tools that offer the most value for your specific use cases. The flexibility and choice provided by Iceberg&apos;s ecosystem empower data engineers to build more efficient, scalable, and adaptable data pipelines.&lt;/p&gt;
&lt;h2&gt;Conclusion&lt;/h2&gt;
&lt;p&gt;Apache Iceberg is transforming the way data engineers manage and optimize large datasets. With features like hidden partitioning, seamless partition evolution, and extensive tool compatibility, Iceberg not only simplifies complex data engineering tasks but also enhances the overall efficiency and flexibility of data workflows. By embracing Apache Iceberg, data engineers can reduce operational overhead, streamline data processing, and leverage a robust ecosystem of tools to meet their evolving data needs. The adoption of Apache Iceberg is a strategic move towards building more scalable, adaptable, and performant data platforms.&lt;/p&gt;
&lt;h3&gt;Exercises to Get Hands-on with Apache Iceberg on Your Laptop&lt;/h3&gt;
&lt;ul&gt;
&lt;li&gt;&lt;a href=&quot;https://bit.ly/am-dremio-lakehouse-laptop&quot;&gt;Intro to Apache Iceberg, Nessie and Dremio on your Laptop&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&quot;https://bit.ly/am-json-csv-parquet-dremio&quot;&gt;JSON/CSV/Parquet to Apache Iceberg to BI Dashboard&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&quot;https://bit.ly/am-mongodb-dashboard&quot;&gt;From MongoDB to Apache Iceberg to BI Dashboard&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&quot;https://bit.ly/am-sqlserver-dashboard&quot;&gt;From SQLServer to Apache Iceberg to BI Dashboard&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&quot;https://bit.ly/am-postgres-to-dashboard&quot;&gt;From Postgres to Apache Iceberg to BI Dashboard&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&quot;https://bit.ly/dremio-experience&quot;&gt;Mongo/Postgres to Apache Iceberg to BI Dashboard using Git for Data and DBT&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&quot;https://bit.ly/am-dremio-elastic&quot;&gt;Elasticsearch to Apache Iceberg to BI Dashboard&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&quot;https://bit.ly/am-dremio-mysql-dashboard&quot;&gt;MySQL to Apache Iceberg to BI Dashboard&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&quot;https://bit.ly/am-kafka-connect-dremio&quot;&gt;Apache Kafka to Apache Iceberg to Dremio&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&quot;https://bit.ly/am-iceberg-lakehouse-engineering&quot;&gt;Apache Iceberg Lakehouse Engineering Video Playlist&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&quot;https://bit.ly/am-druid-dremio&quot;&gt;Apache Druid to Apache Iceberg to BI Dashboard&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&quot;https://bit.ly/end-to-end-de-tutorial&quot;&gt;Postgres to Apache Iceberg to Dashboard with Spark &amp;amp; Dremio&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;
</content:encoded><author>Alex Merced</author></item><item><title>Understanding the Future of Apache Iceberg Catalogs</title><link>https://iceberglakehouse.com/posts/2024-4-understanding-the-future-of-apache-iceberg-catalogs/</link><guid isPermaLink="true">https://iceberglakehouse.com/posts/2024-4-understanding-the-future-of-apache-iceberg-catalogs/</guid><description>
[Apache Iceberg](https://www.dremio.com/blog/apache-iceberg-101-your-guide-to-learning-apache-iceberg-concepts-and-practices/) is revolutionizing the...</description><pubDate>Thu, 04 Apr 2024 09:00:00 GMT</pubDate><content:encoded>&lt;p&gt;&lt;a href=&quot;https://www.dremio.com/blog/apache-iceberg-101-your-guide-to-learning-apache-iceberg-concepts-and-practices/&quot;&gt;Apache Iceberg&lt;/a&gt; is revolutionizing the data industry as an open-source table format that allows data lake storage layers to function as full-fledged &lt;a href=&quot;https://www.dremio.com/blog/why-lakehouse-why-now-what-is-a-data-lakehouse-and-how-to-get-started/&quot;&gt;data warehouses, a concept known as a data lakehouse&lt;/a&gt;. This transformation has led to the development of comprehensive &lt;a href=&quot;https://www.dremio.com/blog/what-is-a-data-lakehouse-platform/&quot;&gt;data lakehouse platforms&lt;/a&gt; and &lt;a href=&quot;https://www.dremio.com/blog/what-is-lakehouse-management-git-for-data-automated-apache-iceberg-table-maintenance-and-more/&quot;&gt;lakehouse management tools&lt;/a&gt;, creating a robust ecosystem for modular data warehousing. At the heart of these lakehouse systems is the catalog, which tracks tables so that various tools can identify and interact with them efficiently.&lt;/p&gt;
&lt;h2&gt;What is an Apache Iceberg Catalog&lt;/h2&gt;
&lt;p&gt;In a &lt;a href=&quot;https://amdatalakehouse.substack.com/p/a-deep-dive-into-the-concept-and&quot;&gt;recent article, I explored the workings of Apache Iceberg catalogs&lt;/a&gt; and the existing catalogs within the ecosystem. Essentially, at the most basic level, Apache Iceberg catalogs maintain a list of tables, each linked to the location of its latest &amp;quot;metadata.json&amp;quot; file.&lt;/p&gt;
&lt;h2&gt;The Status Quo of Catalog Production&lt;/h2&gt;
&lt;p&gt;Initially, Apache Iceberg&apos;s API was predominantly Java-based, utilizing a Catalog class. Each catalog implementation would inherit from this class to ensure compatibility within the Apache Iceberg Java ecosystem. While effective, this approach faced challenges as the Iceberg API expanded into other languages like Python, Go, and Rust. Each catalog implementation had to be rewritten for these languages, often leading to inconsistencies.&lt;/p&gt;
&lt;p&gt;Moreover, at the tooling level, the reliance on different Java classes for each catalog meant that tools had to develop and test support for each catalog individually. This reliance on client-side code interactions slowed down the adoption of new catalog implementations by tools.&lt;/p&gt;
&lt;h2&gt;The REST Catalog&lt;/h2&gt;
&lt;p&gt;To address this issue, the concept of the &amp;quot;REST Catalog&amp;quot; was introduced. Its goal is to provide a single, language-agnostic interface that allows tools to interact with any catalog in any language, eliminating the need for custom connectors. In &lt;a href=&quot;https://github.com/apache/iceberg/blob/main/open-api/rest-catalog-open-api.yaml&quot;&gt;the current iteration of the REST Catalog specification&lt;/a&gt;, there are numerous endpoints for standard catalog operations, such as retrieving and updating metadata references, among other functions. The objective is for all catalogs to eventually achieve compatibility with this interface, enabling tools to interact seamlessly with any Iceberg catalog, whether it&apos;s an in-house, vendor, or open-source implementation.&lt;/p&gt;
&lt;h2&gt;The Future of the REST Catalog&lt;/h2&gt;
&lt;p&gt;Recently, a &lt;a href=&quot;https://lists.apache.org/thread/pqljowgy26tr0vh9xfwsth3g5z5z824k&quot;&gt;new proposal has been put forward&lt;/a&gt; to initiate discussions on the next iteration of the REST catalog, aiming to shift many operations from the client side to the server side. This change would enable catalog implementations to optimize Apache Iceberg operations across various tools and become more adaptable. It would accommodate better extensibility enabling unique features like &lt;a href=&quot;https://projectnessie.org/&quot;&gt;Nessie&apos;s Git-like catalog versioning&lt;/a&gt;, among other distinctive capabilities that different implementations might wish to introduce. Once this new iteration is refined and adopted, it will not only establish a standardized interface for all catalogs but also create an environment where these catalogs can innovate within that standard framework, experiment, and introduce a range of new advantages to Apache Iceberg.&lt;/p&gt;
&lt;h2&gt;Conclusion&lt;/h2&gt;
&lt;p&gt;Apache Iceberg&apos;s &lt;a href=&quot;https://iceberg.apache.org/spec/&quot;&gt;table specification&lt;/a&gt; and catalog interface are continually evolving to support a broad, open ecosystem for constructing modular and composable data systems. There are compelling reasons to adopt Apache Iceberg as the foundation of your data lakehouse. If you haven&apos;t yet experienced building an Apache Iceberg lakehouse, I recommend &lt;a href=&quot;https://amdatalakehouse.substack.com/p/end-to-end-basic-data-engineering&quot;&gt;reading this article for a practical exercise&lt;/a&gt; you can perform on your laptop.&lt;/p&gt;
</content:encoded><author>Alex Merced</author></item><item><title>End-to-End Basic Data Engineering Tutorial (Spark, Dremio, Superset)</title><link>https://iceberglakehouse.com/posts/2024-4-end-to-end-data-engineering-tutorial-spark-dremio-superset/</link><guid isPermaLink="true">https://iceberglakehouse.com/posts/2024-4-end-to-end-data-engineering-tutorial-spark-dremio-superset/</guid><description>
Data engineering aims to make data accessible and usable for data analytics and data science purposes. This involves several key aspects:

- Transfer...</description><pubDate>Mon, 01 Apr 2024 09:00:00 GMT</pubDate><content:encoded>&lt;p&gt;Data engineering aims to make data accessible and usable for data analytics and data science purposes. This involves several key aspects:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;
&lt;p&gt;Transferring data from operational systems like databases to systems optimized for analytical access.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;Modeling and optimizing data for improved accessibility and performance.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;Governing data access to ensure that only authorized individuals can access specific data.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;Creating abstractions to simplify data access.&lt;/p&gt;
&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;This tutorial focuses on the initial step of moving data between systems, introducing various systems commonly used in modern data platforms. Specifically, &lt;a href=&quot;https://bit.ly/dremio-blog-why-lakehouse&quot;&gt;we&apos;ll explore a &amp;quot;Data Lakehouse&amp;quot; architecture&lt;/a&gt;.&lt;/p&gt;
&lt;h2&gt;What is a Data Lakehouse?&lt;/h2&gt;
&lt;p&gt;In many data systems, there are two primary hubs for data:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Data Lake:&lt;/strong&gt; A storage system like Hadoop or Object Storage (ADLS/S3) that stores structured and unstructured data.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Data Warehouses:&lt;/strong&gt; These systems store structured data optimized for analytical workloads, in contrast to databases that are designed for transactional tasks.&lt;/p&gt;
&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;Data engineers typically move data from operational systems to JSON/CSV/Parquet files in the data lake, and then transfer a subset of that data to the data warehouse. However, as data volumes increased, this two-step process became time-consuming and costly, emphasizing the need for faster data delivery.&lt;/p&gt;
&lt;p&gt;The evolution involved enhancing data lake capabilities to resemble data warehouse functionalities. This included implementing components like table formats to organize data files into tables and a catalog to track these tables. These enhancements enable &lt;a href=&quot;https://bit.ly/dremio-blog-lakehouse-platform&quot;&gt;data lakehouse platforms&lt;/a&gt; like &lt;a href=&quot;https://bit.ly/am-dremio-get-started-external-blog&quot;&gt;Dremio&lt;/a&gt; to process data on the data lake as efficiently as a data warehouse.&lt;/p&gt;
&lt;h2&gt;Summary of Exercises&lt;/h2&gt;
&lt;p&gt;In this exercise, we assume our operational applications use Postgres as a database. Our goal is to migrate this data to our data lakehouse, specifically into &lt;a href=&quot;https://bit.ly/am-iceberg-101&quot;&gt;Apache Iceberg tables&lt;/a&gt; managed stored in Minio as our object storage, these tables will tracked by a &lt;a href=&quot;https://bit.ly/am-nessie-101&quot;&gt;Nessie catalog&lt;/a&gt;. We&apos;ll utilize Apache Spark as the data movement tool to the data lake and Dremio as the query engine powering our business intelligence (BI) dashboards through &lt;a href=&quot;https://www.dremio.com/blog/bi-dashboards-101-with-dremio-and-superset/&quot;&gt;Apache Superset&lt;/a&gt;.&lt;/p&gt;
&lt;h2&gt;Environment Setup&lt;/h2&gt;
&lt;p&gt;To setup our environment you will need docker desktop installed on your machine. Then in an empty folder create a &lt;code&gt;docker-compose.yml&lt;/code&gt; file and include the following:&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-yaml&quot;&gt;version: &amp;quot;3&amp;quot;

services:
  # Nessie Catalog Server Using In-Memory Store
  nessie:
    image: projectnessie/nessie:latest
    container_name: nessie
    networks:
      de-end-to-end:
    ports:
      - 19120:19120
  # Minio Storage Server
  minio:
    image: minio/minio:latest
    container_name: minio
    environment:
      - MINIO_ROOT_USER=admin
      - MINIO_ROOT_PASSWORD=password
      - MINIO_DOMAIN=storage
      - MINIO_REGION_NAME=us-east-1
      - MINIO_REGION=us-east-1
    networks:
      de-end-to-end:
    ports:
      - 9001:9001
      - 9000:9000
    command: [&amp;quot;server&amp;quot;, &amp;quot;/data&amp;quot;, &amp;quot;--console-address&amp;quot;, &amp;quot;:9001&amp;quot;]
  # Dremio
  dremio:
    platform: linux/x86_64
    image: dremio/dremio-oss:latest
    ports:
      - 9047:9047
      - 31010:31010
      - 32010:32010
    container_name: dremio
    networks:
      de-end-to-end:
  # Spark
  spark:
    platform: linux/x86_64
    image: alexmerced/spark35notebook:latest
    ports: 
      - 8080:8080  # Master Web UI
      - 7077:7077  # Master Port
      - 8888:8888  # Notebook
    environment:
      - AWS_REGION=us-east-1
      - AWS_ACCESS_KEY_ID=admin #minio username
      - AWS_SECRET_ACCESS_KEY=password #minio password

    container_name: spark
    networks:
      de-end-to-end:
  # Postgres
  postgres:
    image: postgres:latest
    container_name: postgres
    environment:
      POSTGRES_DB: mydb
      POSTGRES_USER: myuser
      POSTGRES_PASSWORD: mypassword
    ports:
      - &amp;quot;5435:5432&amp;quot;
    networks:
      de-end-to-end:
  #Superset
  superset:
    image: alexmerced/dremio-superset
    container_name: superset
    networks:
      de-end-to-end:
    ports:
      - 8088:8088
networks:
  de-end-to-end:
&lt;/code&gt;&lt;/pre&gt;
&lt;h3&gt;Breakdown of the docker-compose file&lt;/h3&gt;
&lt;p&gt;This Docker Compose file defines a set of services that work together to create a data engineering environment. Let&apos;s break down each service and its purpose:&lt;/p&gt;
&lt;ol&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Nessie Catalog Server (nessie):&lt;/strong&gt;&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Image: &lt;code&gt;projectnessie/nessie:latest&lt;/code&gt;&lt;/li&gt;
&lt;li&gt;Purpose: This service sets up a Nessie catalog server using an in-memory store.&lt;/li&gt;
&lt;li&gt;Ports: Exposes port 19120 for external communication.&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Minio Storage Server (minio):&lt;/strong&gt;&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Image: &lt;code&gt;minio/minio:latest&lt;/code&gt;&lt;/li&gt;
&lt;li&gt;Environment Variables:
&lt;ul&gt;
&lt;li&gt;&lt;code&gt;MINIO_ROOT_USER=admin&lt;/code&gt;&lt;/li&gt;
&lt;li&gt;&lt;code&gt;MINIO_ROOT_PASSWORD=password&lt;/code&gt;&lt;/li&gt;
&lt;li&gt;&lt;code&gt;MINIO_DOMAIN=storage&lt;/code&gt;&lt;/li&gt;
&lt;li&gt;&lt;code&gt;MINIO_REGION_NAME=us-east-1&lt;/code&gt;&lt;/li&gt;
&lt;li&gt;&lt;code&gt;MINIO_REGION=us-east-1&lt;/code&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;li&gt;Purpose: Sets up a Minio storage server for object storage.&lt;/li&gt;
&lt;li&gt;Ports: Exposes ports 9001 and 9000 for external access and uses port 9001 for the Minio console.&lt;/li&gt;
&lt;li&gt;Command: Starts the server with the specified parameters.&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Dremio (dremio):&lt;/strong&gt;&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Platform: &lt;code&gt;linux/x86_64&lt;/code&gt;&lt;/li&gt;
&lt;li&gt;Image: &lt;code&gt;dremio/dremio-oss:latest&lt;/code&gt;&lt;/li&gt;
&lt;li&gt;Ports: Exposes ports 9047, 31010, and 32010 for Dremio communication.&lt;/li&gt;
&lt;li&gt;Purpose: Sets up Dremio, a data lakehouse platform, for data processing and analytics.&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Spark (spark):&lt;/strong&gt;&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Platform: &lt;code&gt;linux/x86_64&lt;/code&gt;&lt;/li&gt;
&lt;li&gt;Image: &lt;code&gt;alexmerced/spark35notebook:latest&lt;/code&gt;&lt;/li&gt;
&lt;li&gt;Ports: Exposes ports 8080, 7077, and 8888 for Spark services, including the web UI, master port, and notebook.&lt;/li&gt;
&lt;li&gt;Purpose: Sets up Apache Spark for distributed data processing and analytics.&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Postgres (postgres):&lt;/strong&gt;&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Image: &lt;code&gt;postgres:latest&lt;/code&gt;&lt;/li&gt;
&lt;li&gt;Environment Variables:
&lt;ul&gt;
&lt;li&gt;&lt;code&gt;POSTGRES_DB=mydb&lt;/code&gt;&lt;/li&gt;
&lt;li&gt;&lt;code&gt;POSTGRES_USER=myuser&lt;/code&gt;&lt;/li&gt;
&lt;li&gt;&lt;code&gt;POSTGRES_PASSWORD=mypassword&lt;/code&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;li&gt;Ports: Exposes port 5435 for external access.&lt;/li&gt;
&lt;li&gt;Purpose: Sets up a Postgres database with a specified database name, username, and password.&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Superset (superset):&lt;/strong&gt;&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Image: &lt;code&gt;alexmerced/dremio-superset&lt;/code&gt;&lt;/li&gt;
&lt;li&gt;Ports: Exposes port 8080 for Superset access.&lt;/li&gt;
&lt;li&gt;Purpose: Sets up Apache Superset, a data visualization and exploration platform, for creating BI dashboards.&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;/ol&gt;
&lt;p&gt;Additionally, the file defines a network called &lt;code&gt;de-end-to-end&lt;/code&gt; that connects all the services together, allowing them to communicate with each other within the Docker environment.&lt;/p&gt;
&lt;p&gt;This Docker Compose file creates a comprehensive data engineering environment with services for data storage, processing, analytics, and visualization.&lt;/p&gt;
&lt;h2&gt;Populating the Postgres Database&lt;/h2&gt;
&lt;p&gt;The first step is to populate our Postgres database with some data to represent operational data.&lt;/p&gt;
&lt;h3&gt;1. Spin up the Postgres Service:&lt;/h3&gt;
&lt;p&gt;Open a terminal, navigate to the directory containing the Docker Compose file, and run the following command to start the Postgres service:&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-bash&quot;&gt;docker-compose up postgres
&lt;/code&gt;&lt;/pre&gt;
&lt;h3&gt;2. Access the Postgres Shell:&lt;/h3&gt;
&lt;p&gt;After the Postgres service is running, you can access the Postgres shell using the following command in another terminal:&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-bash&quot;&gt;docker exec -it postgres psql -U myuser mydb
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Enter the password when prompted (use &lt;code&gt;mypassword&lt;/code&gt; in this example).&lt;/p&gt;
&lt;h3&gt;3. Create a Table and Add Data:&lt;/h3&gt;
&lt;p&gt;Once you&apos;re in the Postgres shell, you can create a table and add data. Here&apos;s an example SQL script:&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-sql&quot;&gt;-- Create a table for a mock BI dashboard dataset
CREATE TABLE sales_data (
    id SERIAL PRIMARY KEY,
    product_name VARCHAR(255),
    category VARCHAR(50),
    sales_amount DECIMAL(10, 2),
    sales_date DATE
);

-- Insert sample data into the table
INSERT INTO sales_data (product_name, category, sales_amount, sales_date)
VALUES
    (&apos;Product A&apos;, &apos;Electronics&apos;, 1000.50, &apos;2024-03-01&apos;),
    (&apos;Product B&apos;, &apos;Clothing&apos;, 750.25, &apos;2024-03-02&apos;),
    (&apos;Product C&apos;, &apos;Home Goods&apos;, 1200.75, &apos;2024-03-03&apos;),
    (&apos;Product D&apos;, &apos;Electronics&apos;, 900.00, &apos;2024-03-04&apos;),
    (&apos;Product E&apos;, &apos;Clothing&apos;, 600.50, &apos;2024-03-05&apos;);
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Run the above SQL script in the Postgres shell to create the sales_data table and populate it with sample data ideal for a mock BI dashboard. Leave the postgres shell with the command:&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;\q
&lt;/code&gt;&lt;/pre&gt;
&lt;h2&gt;Moving the Data to the Data Lake with Spark&lt;/h2&gt;
&lt;p&gt;Next, we need to move the data to our data lake so need to spin up the following services.&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;
&lt;p&gt;minio: This will be our storage layer, an object storage service for holding all our files.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;nessie: This will be our Apache iceberg catalog, tracking our different tables and the location of their latest metadata file in our storage.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;spark: This will have Apache Spark, a data processing framework running along with a Python notebook server to write code to send Spark instructions for processing data.&lt;/p&gt;
&lt;/li&gt;
&lt;/ul&gt;
&lt;h3&gt;1. Starting Up Our Data Lake&lt;/h3&gt;
&lt;p&gt;To run these services in an available terminal run the following command:&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;docker compose up spark nessie minio dremio
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Keep an eye out cause in the terminal output the URL to access the Python notebook server will appear, and this will be needed to access the server running on localhost:8888.&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;spark   | [I 2024-04-01 15:02:50.052 ServerApp]     http://127.0.0.1:8888/lab?token=bdc8479a80be54e723eb636e1b62de141a553b75e984a9da
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Put the URL in the browser and you&apos;ll be able to create a new notebook, which we&apos;ll add some code to later on.&lt;/p&gt;
&lt;h3&gt;2. Creating a Bucket in Our Data Lake&lt;/h3&gt;
&lt;p&gt;Head over to &lt;code&gt;localhost:9001&lt;/code&gt; and enter in the username &lt;code&gt;admin&lt;/code&gt; and the password &lt;code&gt;password&lt;/code&gt; to get access to the minio console where you can create a new bucket called &amp;quot;warehouse&amp;quot;.&lt;/p&gt;
&lt;h3&gt;3. Running the PySpark Script&lt;/h3&gt;
&lt;p&gt;with the following code:&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-py&quot;&gt;import pyspark
from pyspark.sql import SparkSession
import os


## DEFINE SENSITIVE VARIABLES
CATALOG_URI = &amp;quot;http://nessie:19120/api/v1&amp;quot; ## Nessie Server URI
WAREHOUSE = &amp;quot;s3://warehouse/&amp;quot; ## S3 Address to Write to
STORAGE_URI = &amp;quot;http://minio:9000&amp;quot;


conf = (
    pyspark.SparkConf()
        .setAppName(&apos;app_name&apos;)
  		#packages
        .set(&apos;spark.jars.packages&apos;, &apos;org.postgresql:postgresql:42.7.3,org.apache.iceberg:iceberg-spark-runtime-3.5_2.12:1.5.0,org.projectnessie.nessie-integrations:nessie-spark-extensions-3.5_2.12:0.77.1,software.amazon.awssdk:bundle:2.24.8,software.amazon.awssdk:url-connection-client:2.24.8&apos;)
  		#SQL Extensions
        .set(&apos;spark.sql.extensions&apos;, &apos;org.apache.iceberg.spark.extensions.IcebergSparkSessionExtensions,org.projectnessie.spark.extensions.NessieSparkSessionExtensions&apos;)
  		#Configuring Catalog
        .set(&apos;spark.sql.catalog.nessie&apos;, &apos;org.apache.iceberg.spark.SparkCatalog&apos;)
        .set(&apos;spark.sql.catalog.nessie.uri&apos;, CATALOG_URI)
        .set(&apos;spark.sql.catalog.nessie.ref&apos;, &apos;main&apos;)
        .set(&apos;spark.sql.catalog.nessie.authentication.type&apos;, &apos;NONE&apos;)
        .set(&apos;spark.sql.catalog.nessie.catalog-impl&apos;, &apos;org.apache.iceberg.nessie.NessieCatalog&apos;)
        .set(&apos;spark.sql.catalog.nessie.s3.endpoint&apos;, STORAGE_URI)
        .set(&apos;spark.sql.catalog.nessie.warehouse&apos;, WAREHOUSE)
        .set(&apos;spark.sql.catalog.nessie.io-impl&apos;, &apos;org.apache.iceberg.aws.s3.S3FileIO&apos;)

)

## Start Spark Session
spark = SparkSession.builder.config(conf=conf).getOrCreate()
print(&amp;quot;Spark Running&amp;quot;)

# Define the JDBC URL for the Postgres database
jdbc_url = &amp;quot;jdbc:postgresql://postgres:5432/mydb&amp;quot;
properties = {
    &amp;quot;user&amp;quot;: &amp;quot;myuser&amp;quot;,
    &amp;quot;password&amp;quot;: &amp;quot;mypassword&amp;quot;,
    &amp;quot;driver&amp;quot;: &amp;quot;org.postgresql.Driver&amp;quot;
}

# Load the table from Postgres
postgres_df = spark.read.jdbc(url=jdbc_url, table=&amp;quot;sales_data&amp;quot;, properties=properties)

# Write the DataFrame to an Iceberg table
postgres_df.writeTo(&amp;quot;nessie.sales_data&amp;quot;).createOrReplace()

# Show the contents of the Iceberg table
spark.read.table(&amp;quot;nessie.sales_data&amp;quot;).show()

# Stop the Spark session
spark.stop()
&lt;/code&gt;&lt;/pre&gt;
&lt;blockquote&gt;
&lt;p&gt;If you run into a &amp;quot;Unknown Host&amp;quot; issue using &lt;code&gt;http://minio:9000&lt;/code&gt; then there may be an issue with the DNS in your Docker network that watches the name &lt;code&gt;minio&lt;/code&gt; with the ip address of the image on the docker network. In this situation replace &lt;code&gt;minio&lt;/code&gt; with the containers ip address. You can look up the ip address of the container with &lt;code&gt;docker inspect minio&lt;/code&gt; and look for the ip address in the network section and update the STORAGE_URI variable for example &lt;code&gt;STORAGE_URI = &amp;quot;http://172.18.0.6:9000&amp;quot;&lt;/code&gt;&lt;/p&gt;
&lt;/blockquote&gt;
&lt;h3&gt;Breakdown of the PySpark Code&lt;/h3&gt;
&lt;p&gt;This PySpark script demonstrates how to configure a Spark session to integrate with Apache Iceberg and Nessie, read data from a PostgreSQL database, and write it to an Iceberg table managed by Nessie.&lt;/p&gt;
&lt;ol&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Import necessary modules:&lt;/strong&gt;&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;code&gt;pyspark&lt;/code&gt;: The main PySpark library.&lt;/li&gt;
&lt;li&gt;&lt;code&gt;SparkSession&lt;/code&gt;: The entry point to programming Spark with the Dataset and DataFrame API.&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Define sensitive variables:&lt;/strong&gt;&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;code&gt;CATALOG_URI&lt;/code&gt;: The URI for the Nessie server.&lt;/li&gt;
&lt;li&gt;&lt;code&gt;WAREHOUSE&lt;/code&gt;: The S3 bucket URI where the Iceberg tables will be stored.&lt;/li&gt;
&lt;li&gt;&lt;code&gt;STORAGE_URI&lt;/code&gt;: The URI of the S3-compatible storage, in this case, a MinIO instance running at &lt;code&gt;172.18.0.6:9000&lt;/code&gt;.&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Configure Spark session:&lt;/strong&gt;&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Set the application name.&lt;/li&gt;
&lt;li&gt;Specify necessary packages (&lt;code&gt;spark.jars.packages&lt;/code&gt;) including PostgreSQL JDBC driver, Iceberg, Nessie, and AWS SDK.&lt;/li&gt;
&lt;li&gt;Enable required SQL extensions for Iceberg and Nessie (&lt;code&gt;spark.sql.extensions&lt;/code&gt;).&lt;/li&gt;
&lt;li&gt;Configure Nessie catalog settings such as URI, reference branch, authentication type, and implementation class.&lt;/li&gt;
&lt;li&gt;Set the S3 endpoint for Nessie to communicate with the S3-compatible storage (MinIO).&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Start the Spark session:&lt;/strong&gt;&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;The &lt;code&gt;SparkSession&lt;/code&gt; is initialized with the above configuration.&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Database connection setup:&lt;/strong&gt;&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Define the JDBC URL for the PostgreSQL database.&lt;/li&gt;
&lt;li&gt;Set connection properties including user, password, and driver.&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Data ingestion from PostgreSQL:&lt;/strong&gt;&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Read data from the &lt;code&gt;sales_data&lt;/code&gt; table in PostgreSQL into a DataFrame (&lt;code&gt;postgres_df&lt;/code&gt;).&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Write data to an Iceberg table:&lt;/strong&gt;&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Write the DataFrame to an Iceberg table named &lt;code&gt;sales_data&lt;/code&gt; in the Nessie catalog.&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Read and display the Iceberg table:&lt;/strong&gt;&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Read the newly created Iceberg table from the Nessie catalog and display its contents.&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Stop the Spark session:&lt;/strong&gt;&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Terminate the Spark session to release resources.&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;/ol&gt;
&lt;h3&gt;Can This Be Easier?&lt;/h3&gt;
&lt;p&gt;Configuring Apache Spark while a standard tool for the Data Engineer, can be really tedious to configure and trouble shoot. We could alternatively use our Data Lakehouse Platform, Dremio, to handle the ingestion of the data with simple SQL statements. To see an example of this check out the following tutorials:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;a href=&quot;https://www.dremio.com/blog/from-postgres-to-dashboards-with-dremio-and-apache-iceberg/&quot;&gt;From Postgres -&amp;gt; Dremio -&amp;gt; Dashboards&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&quot;https://www.dremio.com/blog/from-sqlserver-to-dashboards-with-dremio-and-apache-iceberg/&quot;&gt;From SQLServer -&amp;gt; Dremio -&amp;gt; Dashboards&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&quot;https://www.dremio.com/blog/from-mongodb-to-dashboards-with-dremio-and-apache-iceberg/&quot;&gt;From MongoDB -&amp;gt; Dremio -&amp;gt; Dashboards&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&quot;https://www.dremio.com/blog/bi-dashboards-with-apache-iceberg-using-aws-glue-and-apache-superset/&quot;&gt;From AWS Glue -&amp;gt; Dremio -&amp;gt; Dashboard&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;h2&gt;Connecting Our Data to Dremio&lt;/h2&gt;
&lt;p&gt;&lt;a href=&quot;https://www.dremio.com/solutions/data-lakehouse/&quot;&gt;Dremio is powerful data lakehouse platform&lt;/a&gt; that can connect several data sources across cloud and on-prem sources and deliver them anywhere you need like &lt;a href=&quot;https://www.dremio.com/blog/bi-dashboards-101-with-dremio-and-superset/&quot;&gt;BI Dashboards&lt;/a&gt; and &lt;a href=&quot;https://www.dremio.com/blog/connecting-to-dremio-using-apache-arrow-flight-in-python/&quot;&gt;Python notebooks&lt;/a&gt;. We will use Dremio to process queries that power our BI Dashboards.&lt;/p&gt;
&lt;p&gt;Now, head to &lt;code&gt;localhost:9047&lt;/code&gt; in your browser to set up your Dremio admin account. Once set up, click “add a Source” and select a “Nessie” as the source. Enter in the following settings:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;General settings tab
&lt;ul&gt;
&lt;li&gt;Source Name: nessie&lt;/li&gt;
&lt;li&gt;Nessie Endpoint URL: http://nessie:19120/api/v2&lt;/li&gt;
&lt;li&gt;Auth Type: None&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;li&gt;Storage settings tab
&lt;ul&gt;
&lt;li&gt;AWS Root Path: warehouse&lt;/li&gt;
&lt;li&gt;AWS Access Key: admin&lt;/li&gt;
&lt;li&gt;AWS Secret Key: password&lt;/li&gt;
&lt;li&gt;Uncheck “Encrypt Connection” Box (since we aren’t using SSL)&lt;/li&gt;
&lt;li&gt;Connection Properties
&lt;ul&gt;
&lt;li&gt;Key: fs.s3a.path.style.access | Value: true&lt;/li&gt;
&lt;li&gt;Key: fs.s3a.endpoint | Value: minio:9000&lt;/li&gt;
&lt;li&gt;Key: dremio.s3.compat | Value: true&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;Click on “Save,” and the source will be added to Dremio. You can then run full DDL and DML SQL against it. Dremio turns your data lake into a data warehouse—a data lakehouse!&lt;/p&gt;
&lt;p&gt;Now we can connect superset and build BI dashboards over any data we have connected to Dremio which can not only include our data lake but many sources like Postgres, SQLServer, Mongo, ElasticSearch, Snowflake, Hadoop, ADLS, S3, AWS Glue, Hive and much more!&lt;/p&gt;
&lt;h2&gt;Building our BI Dashboard&lt;/h2&gt;
&lt;p&gt;Dremio can be used with most existing BI tools, with one-click integrations in the user interface for tools like Tableau and Power BI. We will use an open-source option in Superset for this exercise, but any BI tool would have a similar experience. Let&apos;s run the Superset service:&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;docker compose up superset
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;We need to initialize Superset, so open another terminal and run this command:&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;docker exec -it superset superset init
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;This may take a few minutes to finish initializing but once it is done you can head over to &lt;code&gt;localhost:8080&lt;/code&gt; and log in to Superset with the username “&lt;code&gt;admin&lt;/code&gt;” and password “&lt;code&gt;admin&lt;/code&gt;”. Once you are in, click on “Settings” and select “Database Connections”.&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Add a New Database&lt;/li&gt;
&lt;li&gt;Select “Other”&lt;/li&gt;
&lt;li&gt;Use the following connection string (make sure to include Dremio username and password in URL):&lt;/li&gt;
&lt;/ul&gt;
&lt;pre&gt;&lt;code&gt;dremio+flight://USERNAME:PASSWORD@dremio:32010/?UseEncryption=false
&lt;/code&gt;&lt;/pre&gt;
&lt;ul&gt;
&lt;li&gt;Test connection&lt;/li&gt;
&lt;li&gt;Save connection&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;The next step is to add a dataset by clicking on the + icon in the upper right corner and selecting “create dataset”. From here, choose the table you want to add to Superset, which is, in this case, our sales_data table.&lt;/p&gt;
&lt;p&gt;We can then click the + to add charts based on the datasets we’ve added. Once we create the charts we want we can add them to a dashboard, and that’s it! You’ve now taken data from an operational database, ingested it into your data lake, and served a BI dashboard using the data.&lt;/p&gt;
&lt;h2&gt;Conclusion&lt;/h2&gt;
&lt;p&gt;In conclusion, this comprehensive guide has journeyed through the critical steps of data engineering, from moving data between operational systems and analytical platforms to leveraging modern data architectures like the Data Lakehouse. By utilizing tools such as Apache Iceberg, Nessie, Minio, Apache Spark, and Dremio, we&apos;ve demonstrated how to efficiently migrate data from a traditional database like Postgres into a scalable and manageable data lakehouse environment. Furthermore, the integration of Apache Superset for BI dashboarding illustrates the seamless end-to-end data workflow.&lt;/p&gt;
&lt;p&gt;Here are many other tutorials and resources to help you learn even more about the data engineering world.&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;a href=&quot;https://youtube.com/playlist?list=PLsLAVBjQJO0p_4Nqz99tIjeoDYE97L0xY&amp;amp;si=gVaGFq4cDgIthTfz&quot;&gt;Video: Data 101 Video Playlist&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&quot;https://www.youtube.com/watch?v=CKTGkQbryX8&quot;&gt;Video: Using Dremio with Deepnote Collaborative Notebooks&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&quot;https://www.youtube.com/watch?v=sglNHVg42ns&quot;&gt;Video: Using Dremio with Hex Collaborative Notebooks&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&quot;https://www.youtube.com/watch?v=KE0DkxF-GI8&quot;&gt;Video: Using Dremio Cloud with dbt&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&quot;https://www.youtube.com/watch?v=p8UrOsnBg6Q&quot;&gt;Video: Using Dremio Software with dbt&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&quot;&quot;&gt;Blog: Running Graph Queries on your Apache Iceberg Tables with Puppygraph &amp;amp; Dremio&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&quot;https://www.youtube.com/watch?v=CLde-63N2bc&quot;&gt;Video: Branching and Merging with Nessie&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&quot;https://youtube.com/playlist?list=PL-gIUf9e9CCsBMa0DN2_oVicpcUYXuSMT&amp;amp;si=ju-75Z-lOt95kYpD&quot;&gt;Video: Dremio Demonstrations Playlist&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&quot;https://github.com/developer-advocacy-dremio/quick-guides-from-dremio&quot;&gt;Reference: Dremio Quick Guides Repo&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;
</content:encoded><author>Alex Merced</author></item><item><title>5 Reasons Dremio is the Ideal Apache Iceberg Lakehouse Platform</title><link>https://iceberglakehouse.com/posts/2024-3-5-reasons-dremio-is-the-ideal-iceberg-lakehouse-platform/</link><guid isPermaLink="true">https://iceberglakehouse.com/posts/2024-3-5-reasons-dremio-is-the-ideal-iceberg-lakehouse-platform/</guid><description>
[The Apache Iceberg table format](https://www.dremio.com/blog/apache-iceberg-101-your-guide-to-learning-apache-iceberg-concepts-and-practices/) has s...</description><pubDate>Sat, 09 Mar 2024 09:00:00 GMT</pubDate><content:encoded>&lt;p&gt;&lt;a href=&quot;https://www.dremio.com/blog/apache-iceberg-101-your-guide-to-learning-apache-iceberg-concepts-and-practices/&quot;&gt;The Apache Iceberg table format&lt;/a&gt; has seen an impressive expansion in its compatibility with a vast spectrum of data platforms and tools. Among these, &lt;a href=&quot;https://www.dremio.com/resources/topic/apache-iceberg/&quot;&gt;Dremio stands out as a pioneer&lt;/a&gt;, having &lt;a href=&quot;https://amdatalakehouse.substack.com/p/the-apache-iceberg-lakehouse-the&quot;&gt;embraced Apache Iceberg early on&lt;/a&gt;. In this article, we delve into the multitude of ways Dremio has seamlessly integrated Apache Iceberg, establishing itself as one of &lt;a href=&quot;https://www.dremio.com/solutions/data-lakehouse/&quot;&gt;the most formidable platforms for Iceberg lakehouses&lt;/a&gt; available today.&lt;/p&gt;
&lt;h2&gt;Reason 1: Dataset Promotion&lt;/h2&gt;
&lt;p&gt;Dremio&apos;s integration with Apache Iceberg began with its innovative dataset promotion feature, which significantly enhances how you interact with file-based datasets on your data lake. When you connect a data lake source containing a folder of Parquet data, Dremio allows you to elevate it to a table status. This capability is not just about organization; it&apos;s about performance. Leveraging Dremio&apos;s high-performance scan capability, driven by its Apache Arrow-based query engine, dataset promotion transcends to a new level of efficiency. Notably, &lt;a href=&quot;https://www.dremio.com/blog/the-origins-of-apache-arrow-its-fit-in-todays-data-landscape/&quot;&gt;Apache Arrow isn&apos;t just an adopted technology for Dremio; it&apos;s foundational, having originated as Dremio&apos;s in-memory processing format&lt;/a&gt;. When a Parquet file table is promoted, Dremio discreetly crafts a layer of Apache Iceberg metadata. This layer is not merely a catalog; it&apos;s a powerful index that enables swift and efficient querying, epitomizing how Dremio&apos;s deep integration with Apache Iceberg makes it a powerhouse for managing Iceberg lakehouses.&lt;/p&gt;
&lt;h2&gt;Reason 2: Data Reflections&lt;/h2&gt;
&lt;p&gt;Dremio has a powerful feature known as data reflections, which effectively &lt;a href=&quot;https://www.dremio.com/blog/bi-dashboard-acceleration-cubes-extracts-and-dremios-reflections/&quot;&gt;eliminates the need for traditional materialized views, BI extracts, and cubes&lt;/a&gt;. This feature allows users to activate reflections on any table or view within Dremio, which then creates an optimized Apache Iceberg table within your data lake. Dremio maintains a keen awareness of the materialized data&apos;s relationship to the original table or view, allowing it to seamlessly substitute the optimized version during queries. This substitution process enhances query speed without burdening the end user with the complexity of juggling multiple namespaces across various materialized views.&lt;/p&gt;
&lt;p&gt;Data reflections are not a one-size-fits-all solution; they are highly customizable. Data Engineers can easily by SQL or point-and-click specify which columns to track, define custom sorting and partitioning rules, and choose whether to reflect raw data or aggregated metrics, tailoring the reflection to provide the optimal query performance. These reflections are not static; they&apos;re dynamically updated on a schedule that you control, with upcoming features allowing manual updates via a REST API. You can even create your own materialization externally and register them with Dremio so it can use them for acceleration.&lt;/p&gt;
&lt;p&gt;Moreover, Dremio&apos;s intelligent reflection recommender system advises on which reflections will maximize price performance based on your cluster&apos;s query patterns. This feature not only accelerates query performance but does so in a manner that is straightforward for data engineers to implement and for data analysts to utilize, simplifying the entire data interaction process while delivering rapid results.&lt;/p&gt;
&lt;h2&gt;Reason 3: First-Class Apache Iceberg Support&lt;/h2&gt;
&lt;p&gt;Dremio elevates the utility of Apache Iceberg, offering its benefits to all users, regardless of whether they operate an Iceberg Lakehouse. Through features like dataset promotion and data reflections, users engage with Apache Iceberg&apos;s advantages seamlessly, without needing to delve into its complexities. However, Dremio doesn&apos;t stop at seamless integration; it offers comprehensive capabilities to interact directly with Apache Iceberg tables in your data lakehouse.&lt;/p&gt;
&lt;p&gt;Dremio&apos;s full DDL (Data Definition Language) support allows users to create and modify Iceberg tables, while its complete DML (Data Manipulation Language) capabilities enable inserting, upserting, deleting, and updating data within Apache Iceberg tables. The platform supports a variety of options for your Apache Iceberg catalog, enhancing flexibility and integration.&lt;/p&gt;
&lt;p&gt;A standout feature of working directly with Iceberg tables in Dremio is the elimination of the need for setting a metadata refresh cadence, as required for promoted datasets. The catalog becomes the authoritative source for Apache Iceberg metadata, ensuring that users always access the most current metadata, especially when working with service catalogs like Nessie, AWS Glue, Hive, and others.&lt;/p&gt;
&lt;p&gt;This capability transforms Dremio from a platform solely focused on read-only analytics to a comprehensive solution capable of handling data ingestion, curation, and analytics. Users can import data from various sources into their Iceberg tables, join Iceberg tables with data from other databases and data warehouses, and build a &lt;a href=&quot;https://www.dremio.com/platform/unified-analytics/&quot;&gt;cohesive semantic layer—all within the Dremio platform&lt;/a&gt;.&lt;/p&gt;
&lt;h2&gt;Reason 4: Enhanced Catalog Versioning&lt;/h2&gt;
&lt;p&gt;Dremio not only supports a diverse array of Apache Iceberg catalog options but also continuously expands its offerings. A notable aspect of &lt;a href=&quot;https://www.dremio.com/blog/managing-data-as-code-with-dremio-arctic-easily-ensure-data-quality-in-your-data-lakehouse/&quot;&gt;Dremio&apos;s catalog capabilities&lt;/a&gt; is its integration with the open-source Nessie technology, alongside support for externally managed Nessie catalogs. &lt;a href=&quot;https://www.dremio.com/blog/what-is-nessie-catalog-versioning-and-git-for-data/&quot;&gt;Nessie introduces a robust versioning system for Apache Iceberg tables&lt;/a&gt;, enabling users to segregate work across different branches within the catalog. This feature is a game-changer, offering functionalities such as multi-table transactions, isolation for ingestion tasks—ideal for DataOps patterns, the creation of zero-copy environments, multi-table rollbacks, and tagging for simplified data replication.&lt;/p&gt;
&lt;p&gt;The integration of Nessie within Dremio&apos;s catalog doesn&apos;t just enhance Dremio&apos;s functionality; it extends its utility beyond its ecosystem. Since Dremio&apos;s catalog leverages Nessie, it&apos;s accessible read and write via other tools like Apache Spark and Apache Flink, broadening the scope of your Apache Iceberg workflows. This interoperability, empowered by catalog versioning, provides unparalleled flexibility and control, allowing you to streamline your data processes and collaborate more effectively across various platforms.&lt;/p&gt;
&lt;h2&gt;Reason 5: Streamlined Lakehouse Management&lt;/h2&gt;
&lt;p&gt;Dremio simplifies the &lt;a href=&quot;https://www.dremio.com/blog/what-is-lakehouse-management-git-for-data-automated-apache-iceberg-table-maintenance-and-more/&quot;&gt;intricacies of lakehouse management&lt;/a&gt; with its suite of built-in table optimization features. Users can effortlessly execute compaction or manage snapshot expiration across any supported Iceberg catalog, utilizing Dremio&apos;s intuitive OPTIMIZE and VACUUM SQL commands. The convenience doesn&apos;t stop there; for tables cataloged within Dremio&apos;s integrated system, the platform offers automation for table management tasks. This means that Dremio can be configured to routinely manage table optimization on a set schedule, ensuring your tables are always operating efficiently without the need for constant oversight. This automation not only streamlines operations but also ensures that your data is always primed for optimal performance, allowing you to focus on deriving insights rather than managing the underlying infrastructure.&lt;/p&gt;
&lt;h2&gt;Conclusion&lt;/h2&gt;
&lt;p&gt;In conclusion, Dremio&apos;s adept integration with Apache Iceberg provides a robust, feature-rich platform that significantly enhances lakehouse architecture and management. From advanced dataset promotion and revolutionary data reflections to first-class Apache Iceberg support, enhanced catalog versioning, and streamlined lakehouse management, Dremio offers an array of tools that empower users to optimize their data operations efficiently. Whether you&apos;re looking to accelerate queries, manage complex data transformations, or ensure seamless data governance and versioning, Dremio&apos;s innovative features and user-friendly approach make it an ideal choice for anyone looking to leverage the full potential of Apache Iceberg lakehouses.&lt;/p&gt;
&lt;p&gt;&lt;a href=&quot;https://www.dremio.com/blog/intro-to-dremio-nessie-and-apache-iceberg-on-your-laptop/&quot;&gt;Try by Creating an Iceberg/Dremio Lakehouse on Your Laptop&lt;/a&gt;&lt;/p&gt;
&lt;p&gt;&lt;a href=&quot;https://hello.dremio.com/wp-apache-iceberg-the-definitive-guide-reg.html&quot;&gt;Download a Free Copy of Apache Iceberg the Definitive Guide&lt;/a&gt;&lt;/p&gt;
</content:encoded><author>Alex Merced</author></item><item><title>The Apache Iceberg Lakehouse - The Great Data Equalizer</title><link>https://iceberglakehouse.com/posts/2024-3-06-apache-iceberg-the-great-data-equalizer/</link><guid isPermaLink="true">https://iceberglakehouse.com/posts/2024-3-06-apache-iceberg-the-great-data-equalizer/</guid><description>
&gt; [Get a Free Copy of &quot;Apache Iceberg: The Definitive Guide&quot;](https://hello.dremio.com/wp-apache-iceberg-the-definitive-guide-reg.html)

&gt; [Build an ...</description><pubDate>Wed, 06 Mar 2024 09:00:00 GMT</pubDate><content:encoded>&lt;blockquote&gt;
&lt;p&gt;&lt;a href=&quot;https://hello.dremio.com/wp-apache-iceberg-the-definitive-guide-reg.html&quot;&gt;Get a Free Copy of &amp;quot;Apache Iceberg: The Definitive Guide&amp;quot;&lt;/a&gt;&lt;/p&gt;
&lt;/blockquote&gt;
&lt;blockquote&gt;
&lt;p&gt;&lt;a href=&quot;https://www.dremio.com/blog/intro-to-dremio-nessie-and-apache-iceberg-on-your-laptop/&quot;&gt;Build an Iceberg Lakehouse on Your Laptop&lt;/a&gt;&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;&lt;a href=&quot;https://www.dremio.com/blog/apache-iceberg-101-your-guide-to-learning-apache-iceberg-concepts-and-practices/&quot;&gt;Apache Iceberg is an open-source table format&lt;/a&gt; designed for data lakehouse architectures, enabling the organization of data on data lakes in a manner similar to tables found in traditional databases and data warehouses. This innovative table format provides a crucial abstraction layer, allowing users to leverage database-like features on their data lakes. Among its key features are ACID transactions, which ensure data integrity and consistency, time travel capabilities that allow users to access historical data snapshots, and robust table evolution mechanisms for managing partitions and schema changes. By integrating these functionalities, &lt;a href=&quot;https://iceberg.apache.org/&quot;&gt;Apache Iceberg&lt;/a&gt; transforms data lakes into more structured and manageable environments, facilitating advanced data analytics and management tasks.&lt;/p&gt;
&lt;h2&gt;The Role of Catalogs in Apache Iceberg&lt;/h2&gt;
&lt;p&gt;The catalog mechanism is a cornerstone in the functionality of Apache Iceberg tables, providing a crucial layer of organization and accessibility, even though the specifics of its implementation are beyond the Iceberg specification. A catalog in Iceberg serves as a registry for tables, tracking Iceberg tables to ensure they are discoverable by compatible tools, thereby facilitating seamless integration and usage. Moreover, catalogs are instrumental in maintaining a consistent view of the data, which is essential for the integrity and reliability of ACID transactions. By acting as the source of truth, catalogs enable concurrent transactions to reference them and &lt;a href=&quot;https://www.dremio.com/subsurface/the-life-of-a-write-query-for-apache-iceberg-tables/&quot;&gt;ascertain the status of other ongoing transactions, effectively determining if another transaction was committed while theirs was in progress&lt;/a&gt;. This mechanism is vital for maintaining data consistency and ensuring that the system provides ACID guarantees, thereby enhancing the robustness and reliability of data operations within the Iceberg ecosystem.&lt;/p&gt;
&lt;h3&gt;Challenges with Catalogs and Choosing the Right Catalog For you&lt;/h3&gt;
&lt;p&gt;When selecting a catalog for Apache Iceberg, several key factors should guide your decision: compatibility with your current tools, the additional functionalities or integrations the catalog offers, and its maintainability. It&apos;s crucial to choose only one catalog for managing your tables because, upon the completion of a transaction, only the active catalog is updated. Utilizing multiple catalogs could result in them referencing outdated table states, leading to consistency issues. If there is an absolute need to employ multiple catalogs, a feasible approach is to designate a single catalog for write operations while others are used solely for reading. However, this setup demands the implementation of custom systems to synchronize the read-only catalogs with the primary to ensure they reflect the most current table state, maintaining consistency to the greatest extent possible.&lt;/p&gt;
&lt;h3&gt;Service and File-System Catalogs&lt;/h3&gt;
&lt;p&gt;The distinction between a &lt;a href=&quot;https://www.youtube.com/watch?v=4hcfveg1t70&quot;&gt;service catalog and a file-system&lt;/a&gt; catalog in Apache Iceberg is fundamental to understanding their operational dynamics and use cases. Service catalogs, which constitute the majority, involve a running service that can be either self-managed or cloud-managed. These catalogs utilize a backing store to maintain all references to Iceberg tables, with locking mechanisms in place to enforce ACID guarantees. This setup ensures that when a table is modified, the catalog updates references accurately, preventing conflicting changes from being committed.&lt;/p&gt;
&lt;p&gt;On the other hand, the &amp;quot;Hadoop Catalog&amp;quot; represents a file-system catalog that is compatible with any storage system. Unlike service catalogs, the Hadoop Catalog does not rely on a backing store but instead uses a file named &amp;quot;version-hint.text&amp;quot; on the file system to track the latest version of the table. This file must be updated whenever the table changes. However, since not all storage systems provide the same level of atomicity and consistency in file replacement, this method can lead to potential inconsistencies, especially in environments with high concurrency. Therefore, while the Hadoop Catalog might be suitable for evaluating Iceberg&apos;s capabilities, it is generally not recommended for production use due to these potential consistency issues.&lt;/p&gt;
&lt;h2&gt;A tour of Apache Iceberg Catalogs&lt;/h2&gt;
&lt;p&gt;In the following section, we will delve into the diverse range of currently existing catalogs available within the Apache Iceberg ecosystem. Each catalog offers unique features, integrations, and compatibility with different data storage systems and processing engines. Understanding the nuances of these catalogs is crucial for architects and developers to make informed decisions that align with their specific data infrastructure needs and operational goals. We&apos;ll explore a variety of catalogs, including those that are widely adopted in the industry as well as emerging options, highlighting their respective advantages, use cases, and&lt;/p&gt;
&lt;h3&gt;Nessie&lt;/h3&gt;
&lt;p&gt;&lt;a href=&quot;https://projectnessie.org/&quot;&gt;Nessie is an innovative open-source catalog&lt;/a&gt; that extends beyond the traditional catalog capabilities in the Apache Iceberg ecosystem, introducing &lt;a href=&quot;https://www.dremio.com/blog/what-is-nessie-catalog-versioning-and-git-for-data/&quot;&gt;git-like features to data management&lt;/a&gt;. This catalog not only tracks table metadata but also allows users to capture commits at a holistic level, enabling advanced operations such as multi-table transactions, rollbacks, branching, and tagging. These features provide a new layer of flexibility and control over data changes, resembling version control systems in software development.&lt;/p&gt;
&lt;p&gt;Nessie can be either self-managed or cloud-managed, with the latter option available through the Dremio Lakehouse Platform. Dremio integrates Nessie into its &lt;a href=&quot;https://www.dremio.com/blog/managing-data-as-code-with-dremio-arctic-easily-ensure-data-quality-in-your-data-lakehouse/&quot;&gt;Dremio Cloud product&lt;/a&gt;, offering a seamless experience that includes automated table optimization alongside Nessie&apos;s robust cataloging capabilities. This integration underscores Nessie&apos;s versatility and its potential to enhance data governance and management in modern data architectures.&lt;/p&gt;
&lt;h3&gt;Hive&lt;/h3&gt;
&lt;p&gt;The Hive catalog offers a seamless integration pathway for organizations already utilizing Hive in their data architectures, allowing them to leverage their existing Hive metastore as an Apache Iceberg catalog. This integration facilitates a smooth transition to Iceberg&apos;s advanced features while maintaining compatibility with the existing Hive ecosystem. By using the Hive catalog, users can avoid the redundancy of maintaining separate metadata stores, streamlining their data management processes. However, for those not currently using Hive, adopting the Hive catalog would necessitate setting up and running a Hive metastore service. This requirement introduces an additional layer of infrastructure that might be better served using an option with additional features.&lt;/p&gt;
&lt;h3&gt;REST&lt;/h3&gt;
&lt;p&gt;The REST catalog represents a unique approach in the Apache Iceberg ecosystem, serving not as a standalone catalog but as a universal interface that can be adapted for any catalog type. It is based on an OPEN API specification that can be implemented in any programming language, provided that the specified endpoints are adhered to. This flexibility allows for the creation of custom catalogs tailored to specific use cases, enabling developers and organizations to contribute new catalog types to the community without the need to develop bespoke support for a myriad of engines and tools.&lt;/p&gt;
&lt;p&gt;Among the catalogs utilizing this specification are Tabular, which offers a &amp;quot;headless warehouse&amp;quot; solution; Unity Catalog from Databricks, which primarily manages Delta Lake tables but also provides Iceberg table access through its UniFormat feature; and Gravitino, an emerging open-source catalog project. The community&apos;s vision is that all catalogs will eventually interface through this REST specification, simplifying tool integration by requiring only a single interface. However, it&apos;s important to note that the specification is still evolving, with version 3 under discussion and development, which is anticipated to introduce additional endpoints for greater extensibility and the incorporation of custom behaviors.&lt;/p&gt;
&lt;h3&gt;AWS Glue&lt;/h3&gt;
&lt;p&gt;The AWS Glue catalog is an integral component of the AWS ecosystem, providing a fully managed catalog service that integrates seamlessly with other AWS tools such as Redshift and AWS Athena. As a native AWS service, it offers a streamlined experience for users deeply embedded in the AWS infrastructure, ensuring compatibility and optimized performance across AWS services. A notable feature of the AWS Glue catalog is its support for auto-compaction of tables through the AWS Lake Formation service, enhancing data management and optimization. While the AWS Glue catalog is an excellent choice for those committed to the AWS platform, organizations operating in on-premises environments or across multiple cloud providers might benefit from considering a cloud-agnostic catalog to ensure flexibility and avoid vendor lock-in.&lt;/p&gt;
&lt;h3&gt;Snowflake Catalog&lt;/h3&gt;
&lt;p&gt;The Snowflake Iceberg catalog offers a unique integration, allowing Snowflake to manage Apache Iceberg tables that are stored externally on your storage system. This integration aligns with Snowflake&apos;s robust data management capabilities while offering the cost benefits of utilizing external storage. However, there are limitations to consider: all table creation, insertions, updates, and deletions must be conducted within Snowflake, as the Snowflake SDK currently only supports reading operations from Spark. While this setup allows users to leverage some cost savings by storing tables outside of Snowflake, it comes with a trade-off in terms of flexibility. Users do not have access to the full range of open-ecosystem tools for managing Snowflake-managed Iceberg tables, which could be a significant consideration for organizations that rely on a diverse set of data tools and platforms.&lt;/p&gt;
&lt;h3&gt;LakeFS Catalog&lt;/h3&gt;
&lt;p&gt;Initially, there was a compatibility issue between LakeFS, a file-versioning solution, and Apache Iceberg due to a fundamental difference in their design: Iceberg relies on absolute paths in its metadata to reference files, whereas LakeFS uses relative paths to manage different file versions. To bridge this gap, LakeFS introduced its own custom catalog, allowing it to integrate with Apache Iceberg.&lt;/p&gt;
&lt;p&gt;However, as with any custom catalog, there&apos;s a dependency on engine support, which, at the time of writing, appears to be limited to Apache Spark. While versioning is a powerful feature for data management, users looking to leverage versioning for their Iceberg tables might find the built-in table versioning features of Iceberg or the catalog-level versioning offered by Nessie to be more universally compatible and supported options, especially when considering broader ecosystem integration beyond LakeFS&apos;s Iceberg catalog.&lt;/p&gt;
&lt;h3&gt;Other&lt;/h3&gt;
&lt;p&gt;As the Apache Iceberg community evolves, there&apos;s a noticeable shift away from certain catalogs, primarily due to concerns like maintenance challenges or inconsistent implementations. Two such catalogs are the JDBC catalog and the DynamoDB catalog. The JDBC catalog, which enabled any JDBC-compatible database to function as an Iceberg catalog, is seeing reduced usage. This decline is likely due to the complexities and variances in how different databases implement JDBC, potentially leading to inconsistencies in catalog behavior. Similarly, the DynamoDB catalog, initially used in the early stages of AWS support for Iceberg, is also falling out of favor. The community&apos;s pivot away from these catalogs underscores a broader trend towards more robust, consistently supported, and feature-rich catalog options that align with the evolving needs and standards of Iceberg users and developers.&lt;/p&gt;
&lt;h2&gt;Conclusion&lt;/h2&gt;
&lt;p&gt;Apache Iceberg stands as a transformative force in the data lakehouse landscape, offering a structured and efficient way to manage data on data lakes with features traditionally reserved for databases and data warehouses. The journey through the diverse world of Apache Iceberg catalogs highlights the importance of these components in ensuring data accessibility, consistency, and robust transactional support. From the integration-friendly Hive catalog to the innovative Nessie catalog that brings git-like versioning to data, each catalog serves a unique purpose and caters to different architectural needs and preferences.&lt;/p&gt;
&lt;p&gt;As we&apos;ve explored, choosing the right catalog is crucial, balancing factors like compatibility, functionality, and the specific context of your data ecosystem. Whether you&apos;re deeply embedded in the AWS infrastructure, leveraging the AWS Glue catalog, or exploring the versioning capabilities of LakeFS or Nessie, the decision should align with your strategic objectives and operational requirements.&lt;/p&gt;
&lt;p&gt;The evolution away from certain catalogs, like the JDBC and DynamoDB catalogs, underscores the community&apos;s drive towards more reliable, feature-rich, and consistent catalog implementations. This shift is a testament to the ongoing maturation of the Iceberg ecosystem and its users&apos; commitment to adopting practices and tools that enhance data reliability, scalability, and manageability.&lt;/p&gt;
&lt;p&gt;As Apache Iceberg continues to evolve, so too will its ecosystem of catalogs, each adapting to the emerging needs of data professionals seeking to harness the full potential of their data lakehouses. Embracing these tools and understanding their nuances will empower organizations to build more resilient, flexible, and efficient data architectures, paving the way for advanced analytics and data-driven decision-making.&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;&lt;a href=&quot;https://hello.dremio.com/wp-apache-iceberg-the-definitive-guide-reg.html&quot;&gt;Get a Free Copy of &amp;quot;Apache Iceberg: The Definitive Guide&amp;quot;&lt;/a&gt;&lt;/p&gt;
&lt;/blockquote&gt;
&lt;blockquote&gt;
&lt;p&gt;&lt;a href=&quot;https://www.dremio.com/blog/intro-to-dremio-nessie-and-apache-iceberg-on-your-laptop/&quot;&gt;Build an Iceberg Lakehouse on Your Laptop&lt;/a&gt;&lt;/p&gt;
&lt;/blockquote&gt;
</content:encoded><author>Alex Merced</author></item><item><title>10 Reasons to Make Apache Iceberg and Dremio Part of Your Data Lakehouse Strategy</title><link>https://iceberglakehouse.com/posts/2024-3-10-reasons-to-make-dremio-part-of-your-data-lakehouse-strategy/</link><guid isPermaLink="true">https://iceberglakehouse.com/posts/2024-3-10-reasons-to-make-dremio-part-of-your-data-lakehouse-strategy/</guid><description>
&gt; [Get a Free Copy of &quot;Apache Iceberg: The Definitive Guide&quot;](https://hello.dremio.com/wp-apache-iceberg-the-definitive-guide-reg.html)

&gt; [Build an ...</description><pubDate>Fri, 01 Mar 2024 09:00:00 GMT</pubDate><content:encoded>&lt;blockquote&gt;
&lt;p&gt;&lt;a href=&quot;https://hello.dremio.com/wp-apache-iceberg-the-definitive-guide-reg.html&quot;&gt;Get a Free Copy of &amp;quot;Apache Iceberg: The Definitive Guide&amp;quot;&lt;/a&gt;&lt;/p&gt;
&lt;/blockquote&gt;
&lt;blockquote&gt;
&lt;p&gt;&lt;a href=&quot;https://www.dremio.com/blog/intro-to-dremio-nessie-and-apache-iceberg-on-your-laptop/&quot;&gt;Build an Iceberg Lakehouse on Your Laptop&lt;/a&gt;&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;&lt;a href=&quot;https://www.dremio.com/blog/apache-iceberg-101-your-guide-to-learning-apache-iceberg-concepts-and-practices/&quot;&gt;Apache Iceberg&lt;/a&gt; is disrupting the data landscape, offering a new paradigm where data is not confined to the storage system of a chosen data warehouse vendor. Instead, it resides in your own storage, accessible by multiple tools. A &lt;a href=&quot;https://www.dremio.com/blog/why-lakehouse-why-now-what-is-a-data-lakehouse-and-how-to-get-started/&quot;&gt;data lakehouse, which is essentially a modular data warehouse&lt;/a&gt; built on your data lake as the storage layer, offers limitless configuration possibilities. Among the various options for constructing an Apache Iceberg lakehouse, the &lt;a href=&quot;https://www.dremio.com/solutions/data-lakehouse/&quot;&gt;Dremio Data Lakehouse Platform&lt;/a&gt; stands out as one of the most straightforward, rapid, and &lt;a href=&quot;https://www.dremio.com/blog/using-dremio-to-reduce-your-snowflake-data-warehouse-costs/&quot;&gt;cost-effective&lt;/a&gt; choices. This platform has gained popularity for &lt;a href=&quot;https://www.dremio.com/solutions/hadoop-migration/&quot;&gt;on-premises migrations&lt;/a&gt;, &lt;a href=&quot;https://www.dremio.com/solutions/data-mesh/&quot;&gt;implementing data mesh strategies&lt;/a&gt;, &lt;a href=&quot;https://www.dremio.com/blog/bi-dashboard-acceleration-cubes-extracts-and-dremios-reflections/&quot;&gt;enhancing BI dashboards&lt;/a&gt;, and more. In this article, we will explore 10 reasons why the combination of Apache Iceberg and the Dremio platform is exceptionally powerful. We will delve into five reasons to choose Apache Iceberg over other table formats and five reasons to opt for the Dremio platform when considering &lt;a href=&quot;https://www.dremio.com/platform/unified-analytics/&quot;&gt;Semantic Layers&lt;/a&gt;, &lt;a href=&quot;https://www.dremio.com/platform/sql-query-engine/&quot;&gt;Query Engines&lt;/a&gt;, and &lt;a href=&quot;https://www.dremio.com/platform/lakehouse-management/&quot;&gt;Lakehouse Management&lt;/a&gt;.&lt;/p&gt;
&lt;h2&gt;5 Reasons to Choose Apache Iceberg Over Other Table Formats&lt;/h2&gt;
&lt;p&gt;Apache Iceberg is not the only table format available; &lt;a href=&quot;https://www.dremio.com/blog/exploring-the-architecture-of-apache-iceberg-delta-lake-and-apache-hudi/&quot;&gt;Delta Lake and Apache Hudi are also key players in this domain&lt;/a&gt;. All three formats provide a core set of features, enabling database-like tables on your data lake with capabilities such as ACID transactions, time-travel, and schema evolution. However, there are several unique aspects that make Apache Iceberg a noteworthy option to consider.&lt;/p&gt;
&lt;h3&gt;1. Partition Evolution&lt;/h3&gt;
&lt;p&gt;Apache Iceberg distinguishes itself with a feature known as &lt;a href=&quot;https://www.dremio.com/subsurface/future-proof-partitioning-and-fewer-table-rewrites-with-apache-iceberg/&quot;&gt;partition evolution, which allows users to modify their partitioning scheme at any time without the need to rewrite the entire table&lt;/a&gt;. This capability is unique to Iceberg and carries significant implications, particularly for tables at the petabyte scale where altering partitioning can be a complex and costly process. Partition evolution facilitates the optimization of data management, as it enables users to easily revert any changes to the partitioning scheme by simply rolling back to a previous snapshot of the table. This flexibility is a considerable advantage in managing large-scale data efficiently.&lt;/p&gt;
&lt;h3&gt;2. Hidden Partitioning&lt;/h3&gt;
&lt;p&gt;Apache Iceberg introduces a unique feature called &lt;a href=&quot;https://www.dremio.com/subsurface/fewer-accidental-full-table-scans-brought-to-you-by-apache-icebergs-hidden-partitioning/&quot;&gt;hidden partitioning&lt;/a&gt;, which significantly simplifies the workflows for both data engineers and data analysts. In traditional partitioning approaches, data engineers often need to create additional partitioning columns derived from existing ones, which not only increases storage requirements but also complicates the data ingestion process. Additionally, data analysts must be cognizant of these extra columns; failing to filter by the partition column could lead to a full table scan, undermining efficiency.&lt;/p&gt;
&lt;p&gt;However, with Apache Iceberg&apos;s hidden partitioning, the system can partition tables based on the transformed value of a column, with this transformation tracked in the metadata, eliminating the need for physical partitioning columns in the data files. This means that analysts can apply filters directly on the original columns and still benefit from the optimized performance of partitioning. This feature streamlines operations for both data engineers and analysts, making the process more efficient and less prone to error.&lt;/p&gt;
&lt;h3&gt;3. Versioning&lt;/h3&gt;
&lt;p&gt;Versioning is an invaluable feature that facilitates isolating changes, executing rollbacks, simultaneously publishing numerous changes across different objects, and creating zero-copy environments for experimentation and development. While each table format records a single chain of changes, allowing for rollbacks, Apache Iceberg uniquely incorporates branching, tagging, and merging as integral aspects of its core table format. Furthermore, Apache Iceberg stands out as the sole format presently compatible with &lt;a href=&quot;https://www.dremio.com/blog/what-is-nessie-catalog-versioning-and-git-for-data/&quot;&gt;Nessie, an open-source project that extends versioning capabilities to include commits, branches, tags, and merges at the multi-table catalog level&lt;/a&gt;, thereby unlocking a plethora of new possibilities.&lt;/p&gt;
&lt;p&gt;These advanced versioning features in Apache Iceberg are accessible through ergonomic SQL interfaces, making them user-friendly and easily integrated into data workflows. In contrast, other formats typically rely on file-level versioning, which necessitates the use of command-line interfaces (CLIs) and imperative programming for management, making them less approachable and more cumbersome to use. This distinction underscores Apache Iceberg&apos;s advanced capabilities and its potential to significantly enhance data management practices.&lt;/p&gt;
&lt;h3&gt;4. Lakehouse Management&lt;/h3&gt;
&lt;p&gt;Apache Iceberg is attracting a growing roster of vendors eager to assist in &lt;a href=&quot;https://www.dremio.com/blog/what-is-lakehouse-management-git-for-data-automated-apache-iceberg-table-maintenance-and-more/&quot;&gt;managing tables, offering services such as compaction, sorting, snapshot cleanup, and more&lt;/a&gt;. This support makes using Iceberg tables as straightforward as utilizing tables in a traditional database or data warehouse. In contrast, other table formats typically rely on a single tool or vendor for data management, which can lead to vendor lock-in. With Iceberg, however, there is a diverse array of vendors, including Dremio, Tabular, Upsolver, AWS, and Snowflake, each providing varying levels of table management features. This variety gives users the flexibility to choose a vendor that best fits their needs, enhancing Iceberg&apos;s appeal as a versatile and user-friendly data management solutio&lt;/p&gt;
&lt;h3&gt;5. Open Culture&lt;/h3&gt;
&lt;p&gt;One of the most persuasive arguments for adopting Apache Iceberg is its dynamic open-source culture, which permeates its development and ecosystem. Development discussions take place on publicly accessible mailing lists and emails, enabling anyone to participate in and influence the format&apos;s evolution. The ecosystem is expanding daily, with an increasing number of tools offering both read and write support, reflecting the growing enthusiasm among vendors. This open environment provides vendors with the confidence to invest their resources in supporting Iceberg, knowing they are not at the mercy of a single vendor who could unpredictably alter or restrict access to the format. This level of transparency and inclusivity not only fosters innovation and collaboration but also ensures a level of stability and predictability that is highly valued in the tech industry.&lt;/p&gt;
&lt;h2&gt;Dremio&lt;/h2&gt;
&lt;p&gt;Dremio is a comprehensive &lt;a href=&quot;https://www.dremio.com/blog/what-is-a-data-lakehouse-platform/&quot;&gt;data lakehouse platform that consolidates numerous functionalities, typically offered by different vendors, into a single solution&lt;/a&gt;. It unifies data analytics through data virtualization and a semantic layer, streamlining the integration and interpretation of data from diverse sources. Dremio&apos;s robust SQL query engine is capable of federating queries across various data sources, offering transparent acceleration to enhance performance. Additionally, Dremio&apos;s suite of lakehouse management features includes a Nessie-powered data catalog, which ensures data is versioned and easily transportable, alongside automated table maintenance capabilities. This integration of multiple key features into one platform simplifies the data management process, making Dremio a powerful and efficient tool for organizations looking to harness the full potential of their data lakehouse.&lt;/p&gt;
&lt;h3&gt;5. Apache Arrow&lt;/h3&gt;
&lt;p&gt;One of the key reasons Dremio&apos;s SQL query engine outperforms other distributed query engines and data warehouses is its core reliance on &lt;a href=&quot;https://www.dremio.com/blog/the-origins-of-apache-arrow-its-fit-in-todays-data-landscape/&quot;&gt;Apache Arrow, an in-memory data format increasingly recognized as the de facto standard for analytical processing&lt;/a&gt;. Apache Arrow facilitates the swift and efficient loading of data from various sources into a unified format optimized for speedy processing. Moreover, it introduces a transport protocol known as Apache Arrow Flight, which significantly reduces serialization/deserialization bottlenecks often encountered when transferring data over the network within a distributed system or between different systems. This integration of Apache Arrow at the heart of Dremio&apos;s architecture enhances its query performance, making it a powerful tool for data analytics.&lt;/p&gt;
&lt;h3&gt;4. Columnar Cloud Cache&lt;/h3&gt;
&lt;p&gt;One common bottleneck in querying a data lake based on object storage is the latency experienced when retrieving a large number of objects from storage. Additionally, each individual request can incur a cost, contributing to the overall storage access expenses. Dremio addresses these challenges with its &lt;a href=&quot;https://www.dremio.com/blog/how-dremio-delivers-fast-queries-on-object-storage-apache-arrow-reflections-and-the-columnar-cloud-cache/&quot;&gt;C3 (Columnar Cloud Cache) feature, which caches frequently accessed data on the NVMe memory of nodes within the Dremio cluster&lt;/a&gt;. This caching mechanism enables rapid access to data during subsequent query executions that require the same information. As a result, the more queries that are run, the more efficient Dremio becomes. This not only enhances query performance over time but also reduces costs, making Dremio an increasingly cost-effective and faster solution as usage grows. This anti-fragile nature of Dremio, where it strengthens and improves with stress or demand, is a significant advantage for organizations looking to optimize their data querying capabilities.&lt;/p&gt;
&lt;h3&gt;3. Reflections&lt;/h3&gt;
&lt;p&gt;Other engines often rely on materialized views and BI extracts to accelerate queries, which can require significant manual effort to maintain. This process creates a broader array of objects that data analysts must track and understand when to use. Moreover, many platforms cannot offer this acceleration across all their compatible data sources.&lt;/p&gt;
&lt;p&gt;In contrast, Dremio introduces a unique feature called Reflections, which simplifies query acceleration without adding to the management workload of engineers or expanding the number of namespaces analysts need to be aware of. Reflections can be applied to any table or view within Dremio, allowing for the materialization of rows or the aggregation of calculations on the dataset.&lt;/p&gt;
&lt;p&gt;For data engineers, Dremio automates the management of these materializations, treating them as Iceberg tables that can be intelligently substituted when a query that would benefit from them is detected. Data analysts, on the other hand, continue to query tables and build dashboards as usual, without needing to navigate additional namespaces. They will, however, experience noticeable performance improvements immediately, without any extra effort. This streamlined approach not only enhances efficiency but also significantly reduces the complexity typically associated with optimizing query performance.&lt;/p&gt;
&lt;h3&gt;2. Semantic Layer&lt;/h3&gt;
&lt;p&gt;Many query engines and data warehouses lack the capability to offer an organized, user-friendly interface for end users, a feature known as a semantic layer. This layer is crucial for providing logical, intuitive views for understanding and discovering data. Without this feature, organizations often find themselves needing to integrate services from additional vendors, which can introduce a complex web of dependencies and potential conflicts to manage.&lt;/p&gt;
&lt;p&gt;Dremio stands out by incorporating an easy-to-use semantic layer within its lakehouse platform. This feature allows users to organize and document data from all sources into a single, coherent layer, facilitating data discovery. Beyond organization, Dremio enables robust data governance through role-based, column-based, and row-based access controls, ensuring users can only access the data they are permitted to view.&lt;/p&gt;
&lt;p&gt;This semantic layer enhances collaboration across data teams, offering a unified access point that supports the implementation of data-centric architectures like data mesh. By streamlining data access and collaboration, Dremio not only makes data more discoverable and understandable but also ensures a secure and controlled data environment, aligning with best practices in data management and governance.&lt;/p&gt;
&lt;h3&gt;1. Hybrid Architecture&lt;/h3&gt;
&lt;p&gt;Many contemporary data tools focus predominantly on cloud-based data, sidelining the vast reserves of on-premise data that cannot leverage these modern solutions. Dremio, however, stands out by offering the capability to access on-premise data sources in addition to cloud data. This flexibility allows Dremio to unify on-premise and cloud data sources, facilitating seamless migrations between different systems. With Dremio, organizations can enhance their on-premise data by integrating it with the wealth of data available in cloud data marketplaces, all without the need for data movement. This approach not only broadens the scope of data resources available to businesses but also enables a more integrated and comprehensive data strategy, accommodating the needs of organizations with diverse data environments.&lt;/p&gt;
&lt;h2&gt;Conclusion&lt;/h2&gt;
&lt;p&gt;Apache Iceberg and Dremio are spearheading a transformative shift in data management and analysis. Apache Iceberg&apos;s innovative features, such as partition evolution, hidden partitioning, advanced versioning, and an open-source culture, set it apart in the realm of data table formats, offering flexibility, efficiency, and a collaborative development environment. On the other hand, Dremio&apos;s data lakehouse platform leverages these strengths and further enhances the data management experience with its integrated SQL query engine, semantic layer, and unique features like Reflections and the C3 Columnar Cloud Cache.&lt;/p&gt;
&lt;p&gt;By providing a unified platform that addresses the challenges of both on-premise and cloud data, Dremio eliminates the complexity and fragmentation often associated with data analytics. Its ability to streamline data processing, ensure robust data governance, and facilitate seamless integration across diverse data sources makes it an invaluable asset for organizations aiming to leverage their data for insightful analytics and informed decision-making.&lt;/p&gt;
&lt;p&gt;Together, Apache Iceberg and Dremio not only offer a robust foundation for data management but also embody the future of data analytics, where accessibility, efficiency, and collaboration are key. Whether you&apos;re a data engineer looking to optimize data storage and retrieval or a data analyst seeking intuitive and powerful data exploration tools, this combination presents a compelling solution in the ever-evolving landscape of data technology.&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;&lt;a href=&quot;https://hello.dremio.com/wp-apache-iceberg-the-definitive-guide-reg.html&quot;&gt;Get a Free Copy of &amp;quot;Apache Iceberg: The Definitive Guide&amp;quot;&lt;/a&gt;&lt;/p&gt;
&lt;/blockquote&gt;
&lt;blockquote&gt;
&lt;p&gt;&lt;a href=&quot;https://www.dremio.com/blog/intro-to-dremio-nessie-and-apache-iceberg-on-your-laptop/&quot;&gt;Build an Iceberg Lakehouse on Your Laptop&lt;/a&gt;&lt;/p&gt;
&lt;/blockquote&gt;
</content:encoded><author>Alex Merced</author></item><item><title>A deep dive into the concept and world of Apache Iceberg Catalogs</title><link>https://iceberglakehouse.com/posts/2024-3-deep-dive-into-apache-iceberg-catalogs/</link><guid isPermaLink="true">https://iceberglakehouse.com/posts/2024-3-deep-dive-into-apache-iceberg-catalogs/</guid><description>
&gt; [Get a Free Copy of &quot;Apache Iceberg: The Definitive Guide&quot;](https://hello.dremio.com/wp-apache-iceberg-the-definitive-guide-reg.html)

&gt; [Build an ...</description><pubDate>Fri, 01 Mar 2024 09:00:00 GMT</pubDate><content:encoded>&lt;blockquote&gt;
&lt;p&gt;&lt;a href=&quot;https://hello.dremio.com/wp-apache-iceberg-the-definitive-guide-reg.html&quot;&gt;Get a Free Copy of &amp;quot;Apache Iceberg: The Definitive Guide&amp;quot;&lt;/a&gt;&lt;/p&gt;
&lt;/blockquote&gt;
&lt;blockquote&gt;
&lt;p&gt;&lt;a href=&quot;https://www.dremio.com/blog/intro-to-dremio-nessie-and-apache-iceberg-on-your-laptop/&quot;&gt;Build an Iceberg Lakehouse on Your Laptop&lt;/a&gt;&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;&lt;a href=&quot;https://www.dremio.com/blog/apache-iceberg-101-your-guide-to-learning-apache-iceberg-concepts-and-practices/&quot;&gt;Apache Iceberg is an open-source table format&lt;/a&gt; designed for data lakehouse architectures, enabling the organization of data on data lakes in a manner similar to tables found in traditional databases and data warehouses. This innovative table format provides a crucial abstraction layer, allowing users to leverage database-like features on their data lakes. Among its key features are ACID transactions, which ensure data integrity and consistency, time travel capabilities that allow users to access historical data snapshots, and robust table evolution mechanisms for managing partitions and schema changes. By integrating these functionalities, &lt;a href=&quot;https://iceberg.apache.org/&quot;&gt;Apache Iceberg&lt;/a&gt; transforms data lakes into more structured and manageable environments, facilitating advanced data analytics and management tasks.&lt;/p&gt;
&lt;h2&gt;The Role of Catalogs in Apache Iceberg&lt;/h2&gt;
&lt;p&gt;The catalog mechanism is a cornerstone in the functionality of Apache Iceberg tables, providing a crucial layer of organization and accessibility, even though the specifics of its implementation are beyond the Iceberg specification. A catalog in Iceberg serves as a registry for tables, tracking Iceberg tables to ensure they are discoverable by compatible tools, thereby facilitating seamless integration and usage. Moreover, catalogs are instrumental in maintaining a consistent view of the data, which is essential for the integrity and reliability of ACID transactions. By acting as the source of truth, catalogs enable concurrent transactions to reference them and &lt;a href=&quot;https://www.dremio.com/subsurface/the-life-of-a-write-query-for-apache-iceberg-tables/&quot;&gt;ascertain the status of other ongoing transactions, effectively determining if another transaction was committed while theirs was in progress&lt;/a&gt;. This mechanism is vital for maintaining data consistency and ensuring that the system provides ACID guarantees, thereby enhancing the robustness and reliability of data operations within the Iceberg ecosystem.&lt;/p&gt;
&lt;h3&gt;Challenges with Catalogs and Choosing the Right Catalog For you&lt;/h3&gt;
&lt;p&gt;When selecting a catalog for Apache Iceberg, several key factors should guide your decision: compatibility with your current tools, the additional functionalities or integrations the catalog offers, and its maintainability. It&apos;s crucial to choose only one catalog for managing your tables because, upon the completion of a transaction, only the active catalog is updated. Utilizing multiple catalogs could result in them referencing outdated table states, leading to consistency issues. If there is an absolute need to employ multiple catalogs, a feasible approach is to designate a single catalog for write operations while others are used solely for reading. However, this setup demands the implementation of custom systems to synchronize the read-only catalogs with the primary to ensure they reflect the most current table state, maintaining consistency to the greatest extent possible.&lt;/p&gt;
&lt;h3&gt;Service and File-System Catalogs&lt;/h3&gt;
&lt;p&gt;The distinction between a &lt;a href=&quot;https://www.youtube.com/watch?v=4hcfveg1t70&quot;&gt;service catalog and a file-system&lt;/a&gt; catalog in Apache Iceberg is fundamental to understanding their operational dynamics and use cases. Service catalogs, which constitute the majority, involve a running service that can be either self-managed or cloud-managed. These catalogs utilize a backing store to maintain all references to Iceberg tables, with locking mechanisms in place to enforce ACID guarantees. This setup ensures that when a table is modified, the catalog updates references accurately, preventing conflicting changes from being committed.&lt;/p&gt;
&lt;p&gt;On the other hand, the &amp;quot;Hadoop Catalog&amp;quot; represents a file-system catalog that is compatible with any storage system. Unlike service catalogs, the Hadoop Catalog does not rely on a backing store but instead uses a file named &amp;quot;version-hint.text&amp;quot; on the file system to track the latest version of the table. This file must be updated whenever the table changes. However, since not all storage systems provide the same level of atomicity and consistency in file replacement, this method can lead to potential inconsistencies, especially in environments with high concurrency. Therefore, while the Hadoop Catalog might be suitable for evaluating Iceberg&apos;s capabilities, it is generally not recommended for production use due to these potential consistency issues.&lt;/p&gt;
&lt;h2&gt;A tour of Apache Iceberg Catalogs&lt;/h2&gt;
&lt;p&gt;In the following section, we will delve into the diverse range of currently existing catalogs available within the Apache Iceberg ecosystem. Each catalog offers unique features, integrations, and compatibility with different data storage systems and processing engines. Understanding the nuances of these catalogs is crucial for architects and developers to make informed decisions that align with their specific data infrastructure needs and operational goals. We&apos;ll explore a variety of catalogs, including those that are widely adopted in the industry as well as emerging options, highlighting their respective advantages, use cases, and&lt;/p&gt;
&lt;h3&gt;Nessie&lt;/h3&gt;
&lt;p&gt;&lt;a href=&quot;https://projectnessie.org/&quot;&gt;Nessie is an innovative open-source catalog&lt;/a&gt; that extends beyond the traditional catalog capabilities in the Apache Iceberg ecosystem, introducing &lt;a href=&quot;https://www.dremio.com/blog/what-is-nessie-catalog-versioning-and-git-for-data/&quot;&gt;git-like features to data management&lt;/a&gt;. This catalog not only tracks table metadata but also allows users to capture commits at a holistic level, enabling advanced operations such as multi-table transactions, rollbacks, branching, and tagging. These features provide a new layer of flexibility and control over data changes, resembling version control systems in software development.&lt;/p&gt;
&lt;p&gt;Nessie can be either self-managed or cloud-managed, with the latter option available through the Dremio Lakehouse Platform. Dremio integrates Nessie into its &lt;a href=&quot;https://www.dremio.com/blog/managing-data-as-code-with-dremio-arctic-easily-ensure-data-quality-in-your-data-lakehouse/&quot;&gt;Dremio Cloud product&lt;/a&gt;, offering a seamless experience that includes automated table optimization alongside Nessie&apos;s robust cataloging capabilities. This integration underscores Nessie&apos;s versatility and its potential to enhance data governance and management in modern data architectures.&lt;/p&gt;
&lt;h3&gt;Hive&lt;/h3&gt;
&lt;p&gt;The Hive catalog offers a seamless integration pathway for organizations already utilizing Hive in their data architectures, allowing them to leverage their existing Hive metastore as an Apache Iceberg catalog. This integration facilitates a smooth transition to Iceberg&apos;s advanced features while maintaining compatibility with the existing Hive ecosystem. By using the Hive catalog, users can avoid the redundancy of maintaining separate metadata stores, streamlining their data management processes. However, for those not currently using Hive, adopting the Hive catalog would necessitate setting up and running a Hive metastore service. This requirement introduces an additional layer of infrastructure that might be better served using an option with additional features.&lt;/p&gt;
&lt;h3&gt;REST&lt;/h3&gt;
&lt;p&gt;The REST catalog represents a unique approach in the Apache Iceberg ecosystem, serving not as a standalone catalog but as a universal interface that can be adapted for any catalog type. It is based on an OPEN API specification that can be implemented in any programming language, provided that the specified endpoints are adhered to. This flexibility allows for the creation of custom catalogs tailored to specific use cases, enabling developers and organizations to contribute new catalog types to the community without the need to develop bespoke support for a myriad of engines and tools.&lt;/p&gt;
&lt;p&gt;Among the catalogs utilizing this specification are Tabular, which offers a &amp;quot;headless warehouse&amp;quot; solution; Unity Catalog from Databricks, which primarily manages Delta Lake tables but also provides Iceberg table access through its UniFormat feature; and Gravitino, an emerging open-source catalog project. The community&apos;s vision is that all catalogs will eventually interface through this REST specification, simplifying tool integration by requiring only a single interface. However, it&apos;s important to note that the specification is still evolving, with version 3 under discussion and development, which is anticipated to introduce additional endpoints for greater extensibility and the incorporation of custom behaviors.&lt;/p&gt;
&lt;h3&gt;AWS Glue&lt;/h3&gt;
&lt;p&gt;The AWS Glue catalog is an integral component of the AWS ecosystem, providing a fully managed catalog service that integrates seamlessly with other AWS tools such as Redshift and AWS Athena. As a native AWS service, it offers a streamlined experience for users deeply embedded in the AWS infrastructure, ensuring compatibility and optimized performance across AWS services. A notable feature of the AWS Glue catalog is its support for auto-compaction of tables through the AWS Lake Formation service, enhancing data management and optimization. While the AWS Glue catalog is an excellent choice for those committed to the AWS platform, organizations operating in on-premises environments or across multiple cloud providers might benefit from considering a cloud-agnostic catalog to ensure flexibility and avoid vendor lock-in.&lt;/p&gt;
&lt;h3&gt;Snowflake Catalog&lt;/h3&gt;
&lt;p&gt;The Snowflake Iceberg catalog offers a unique integration, allowing Snowflake to manage Apache Iceberg tables that are stored externally on your storage system. This integration aligns with Snowflake&apos;s robust data management capabilities while offering the cost benefits of utilizing external storage. However, there are limitations to consider: all table creation, insertions, updates, and deletions must be conducted within Snowflake, as the Snowflake SDK currently only supports reading operations from Spark. While this setup allows users to leverage some cost savings by storing tables outside of Snowflake, it comes with a trade-off in terms of flexibility. Users do not have access to the full range of open-ecosystem tools for managing Snowflake-managed Iceberg tables, which could be a significant consideration for organizations that rely on a diverse set of data tools and platforms.&lt;/p&gt;
&lt;h3&gt;LakeFS Catalog&lt;/h3&gt;
&lt;p&gt;Initially, there was a compatibility issue between LakeFS, a file-versioning solution, and Apache Iceberg due to a fundamental difference in their design: Iceberg relies on absolute paths in its metadata to reference files, whereas LakeFS uses relative paths to manage different file versions. To bridge this gap, LakeFS introduced its own custom catalog, allowing it to integrate with Apache Iceberg.&lt;/p&gt;
&lt;p&gt;However, as with any custom catalog, there&apos;s a dependency on engine support, which, at the time of writing, appears to be limited to Apache Spark. While versioning is a powerful feature for data management, users looking to leverage versioning for their Iceberg tables might find the built-in table versioning features of Iceberg or the catalog-level versioning offered by Nessie to be more universally compatible and supported options, especially when considering broader ecosystem integration beyond LakeFS&apos;s Iceberg catalog.&lt;/p&gt;
&lt;h3&gt;Other&lt;/h3&gt;
&lt;p&gt;As the Apache Iceberg community evolves, there&apos;s a noticeable shift away from certain catalogs, primarily due to concerns like maintenance challenges or inconsistent implementations. Two such catalogs are the JDBC catalog and the DynamoDB catalog. The JDBC catalog, which enabled any JDBC-compatible database to function as an Iceberg catalog, is seeing reduced usage. This decline is likely due to the complexities and variances in how different databases implement JDBC, potentially leading to inconsistencies in catalog behavior. Similarly, the DynamoDB catalog, initially used in the early stages of AWS support for Iceberg, is also falling out of favor. The community&apos;s pivot away from these catalogs underscores a broader trend towards more robust, consistently supported, and feature-rich catalog options that align with the evolving needs and standards of Iceberg users and developers.&lt;/p&gt;
&lt;h2&gt;Conclusion&lt;/h2&gt;
&lt;p&gt;Apache Iceberg stands as a transformative force in the data lakehouse landscape, offering a structured and efficient way to manage data on data lakes with features traditionally reserved for databases and data warehouses. The journey through the diverse world of Apache Iceberg catalogs highlights the importance of these components in ensuring data accessibility, consistency, and robust transactional support. From the integration-friendly Hive catalog to the innovative Nessie catalog that brings git-like versioning to data, each catalog serves a unique purpose and caters to different architectural needs and preferences.&lt;/p&gt;
&lt;p&gt;As we&apos;ve explored, choosing the right catalog is crucial, balancing factors like compatibility, functionality, and the specific context of your data ecosystem. Whether you&apos;re deeply embedded in the AWS infrastructure, leveraging the AWS Glue catalog, or exploring the versioning capabilities of LakeFS or Nessie, the decision should align with your strategic objectives and operational requirements.&lt;/p&gt;
&lt;p&gt;The evolution away from certain catalogs, like the JDBC and DynamoDB catalogs, underscores the community&apos;s drive towards more reliable, feature-rich, and consistent catalog implementations. This shift is a testament to the ongoing maturation of the Iceberg ecosystem and its users&apos; commitment to adopting practices and tools that enhance data reliability, scalability, and manageability.&lt;/p&gt;
&lt;p&gt;As Apache Iceberg continues to evolve, so too will its ecosystem of catalogs, each adapting to the emerging needs of data professionals seeking to harness the full potential of their data lakehouses. Embracing these tools and understanding their nuances will empower organizations to build more resilient, flexible, and efficient data architectures, paving the way for advanced analytics and data-driven decision-making.&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;&lt;a href=&quot;https://hello.dremio.com/wp-apache-iceberg-the-definitive-guide-reg.html&quot;&gt;Get a Free Copy of &amp;quot;Apache Iceberg: The Definitive Guide&amp;quot;&lt;/a&gt;&lt;/p&gt;
&lt;/blockquote&gt;
&lt;blockquote&gt;
&lt;p&gt;&lt;a href=&quot;https://www.dremio.com/blog/intro-to-dremio-nessie-and-apache-iceberg-on-your-laptop/&quot;&gt;Build an Iceberg Lakehouse on Your Laptop&lt;/a&gt;&lt;/p&gt;
&lt;/blockquote&gt;
</content:encoded><author>Alex Merced</author></item><item><title>What is the Data Lakehouse and the Role of Apache Iceberg, Nessie and Dremio?</title><link>https://iceberglakehouse.com/posts/2024-2-what-a-data-lakehouse-and-role-of-dremio-iceberg-nessie/</link><guid isPermaLink="true">https://iceberglakehouse.com/posts/2024-2-what-a-data-lakehouse-and-role-of-dremio-iceberg-nessie/</guid><description>
Organizations are constantly seeking more efficient, scalable, and flexible solutions to manage their ever-growing data assets. This quest has led to...</description><pubDate>Wed, 21 Feb 2024 09:00:00 GMT</pubDate><content:encoded>&lt;p&gt;Organizations are constantly seeking more efficient, scalable, and flexible solutions to manage their ever-growing data assets. This quest has led to the development of the &lt;a href=&quot;https://www.dremio.com/blog/why-lakehouse-why-now-what-is-a-data-lakehouse-and-how-to-get-started/&quot;&gt;data lakehouse&lt;/a&gt;, a novel architecture that promises to revolutionize the way businesses store, access, and analyze data. By combining the strengths of data lakes and data warehouses, data lakehouses offer a unified platform that addresses the limitations of its predecessors. This blog post delves into the essence of a data lakehouse, explores the significance of &lt;a href=&quot;https://www.dremio.com/blog/exploring-the-architecture-of-apache-iceberg-delta-lake-and-apache-hudi/&quot;&gt;table formats&lt;/a&gt;, and introduces &lt;a href=&quot;https://www.dremio.com/blog/open-source-and-the-data-lakehouse-apache-arrow-apache-iceberg-nessie-and-dremio/&quot;&gt;Apache Iceberg and Nessie—two cutting-edge technologies&lt;/a&gt; that are shaping the future of data management.&lt;/p&gt;
&lt;h2&gt;What is a Data Lakehouse?&lt;/h2&gt;
&lt;p&gt;A data lakehouse is an innovative architecture that merges the expansive storage capabilities of data lakes with the structured querying and transactional features of data warehouses. This hybrid model is designed to support both the vast data volumes typical of big data initiatives and the sophisticated analytics usually reserved for structured data environments. Data lakehouses aim to provide a single, coherent platform for all types of data analysis, from real-time analytics to machine learning, without sacrificing performance, scalability, or cost-effectiveness.&lt;/p&gt;
&lt;h3&gt;The data lakehouse architecture is distinguished by several key characteristics:&lt;/h3&gt;
&lt;p&gt;&lt;strong&gt;Unified Data Management&lt;/strong&gt;: It eliminates the traditional boundaries between data lakes and warehouses, offering a consolidated view of all data assets.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Cost-Efficiency&lt;/strong&gt;: By leveraging low-cost storage solutions and optimizing query execution, data lakehouses reduce the overall expense of data storage and analysis.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Scalability&lt;/strong&gt;: The architecture effortlessly scales to accommodate growing data volumes and complex analytical workloads, ensuring that performance remains consistent as demands increase.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Real-Time Analytics and AI&lt;/strong&gt;: Data lakehouses enable direct analytics on raw and structured data, supporting advanced use cases like real-time decision-making and artificial intelligence applications.&lt;/p&gt;
&lt;h2&gt;Understanding Table Formats and their Importance&lt;/h2&gt;
&lt;p&gt;The concept of a table format is fundamental in data storage and analysis, referring to the method by which data is structured and organized within a database or storage system. An effective table format is critical for optimizing data accessibility, query performance, and storage efficiency. In the data lakehouse world, table formats ensure that data can be easily read, written, and processed by various analytical tools and applications.&lt;/p&gt;
&lt;h3&gt;Table formats play several vital roles in data management:&lt;/h3&gt;
&lt;p&gt;&lt;strong&gt;Efficient Data Retrieval&lt;/strong&gt;: They enable quick and efficient data access by organizing data in a way that aligns with common query patterns.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Data Integrity and Reliability&lt;/strong&gt;: Proper table formats help maintain data accuracy and consistency, ensuring that data remains reliable and trustworthy over time.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Schema Evolution&lt;/strong&gt;: They support the ability to evolve data schema over time without disrupting existing data, allowing for the addition of new fields and the modification of existing structures as business requirements change.&lt;/p&gt;
&lt;p&gt;In the context of data lakehouses, &lt;a href=&quot;https://www.dremio.com/blog/comparison-of-data-lake-table-formats-apache-iceberg-apache-hudi-and-delta-lake/&quot;&gt;selecting the right table format is even more critical&lt;/a&gt;, as it directly impacts the system&apos;s ability to deliver on the promise of high-performance analytics on diverse datasets.&lt;/p&gt;
&lt;h2&gt;Apache Iceberg - The Standard in Lakehouse Tables&lt;/h2&gt;
&lt;p&gt;&lt;a href=&quot;https://iceberg.apache.org/&quot;&gt;Apache Iceberg&lt;/a&gt; emerges as a groundbreaking table format designed to enhance the capabilities of data lakehouses by addressing several limitations of traditional table formats used in big data environments. As an open-source table format, Iceberg is engineered to improve data reliability, performance, and scalability for analytical workloads.&lt;/p&gt;
&lt;h3&gt;Key Features and Advantages of Apache Iceberg:&lt;/h3&gt;
&lt;p&gt;&lt;strong&gt;Schema Evolution&lt;/strong&gt;: Iceberg supports adding, renaming, and deleting columns while maintaining backward compatibility and forward compatibility. This feature ensures that data consumers can access the data even as the schema evolves, without the need for complex migration processes.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Hidden Partitioning&lt;/strong&gt;: Unlike traditional methods that require explicit partition paths in the directory structure, &lt;a href=&quot;https://www.dremio.com/blog/fewer-accidental-full-table-scans-brought-to-you-by-apache-icebergs-hidden-partitioning/&quot;&gt;Iceberg handles partitioning behind the scenes&lt;/a&gt;. This approach simplifies data management and enhances query performance without the need for users to manage partition metadata manually.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Snapshot Isolation&lt;/strong&gt;: Iceberg provides snapshot isolation for data operations, enabling consistent and repeatable reads. This feature allows for concurrent reads and writes without impacting the integrity of the data being analyzed.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Incremental Updates&lt;/strong&gt;: It supports atomic and incremental updates, deletions, and upserts, facilitating more dynamic and fine-grained data management strategies. This capability is crucial for real-time analytics and data science applications that require frequent updates to datasets.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Compatibility and Integration&lt;/strong&gt;: Apache Iceberg is designed to be compatible with popular compute engines like Apache Spark, Dremio, and Flink. This ensures that organizations can adopt Iceberg without having to replace their existing data processing tools and infrastructure.&lt;/p&gt;
&lt;p&gt;By addressing these critical areas, Apache Iceberg significantly enhances the efficiency and reliability of data lakehouse architectures, making it a cornerstone technology for modern data management.&lt;/p&gt;
&lt;h2&gt;Nessie - Data Lakehouse Catalog with Git-like Versioning&lt;/h2&gt;
&lt;p&gt;&lt;a href=&quot;https://projectnessie.org/&quot;&gt;Project Nessie&lt;/a&gt; introduces the concept of version control to the world of data lakehouses, akin to what Git has done for source code. Nessie enables data engineers and data scientists to manage and maintain versions of their data lakehouse catalog, providing a robust framework for data experimentation, rollback, and governance.&lt;/p&gt;
&lt;h3&gt;Understanding Nessie&apos;s Role and Benefits:&lt;/h3&gt;
&lt;p&gt;&lt;strong&gt;Data Versioning&lt;/strong&gt;: Nessie allows for versioning of entire data lakehouses, enabling users to track changes, experiment with datasets, and roll back to previous states if necessary. This capability is especially valuable in complex analytical environments where changes to data models and structures are frequent.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Branching and Merging&lt;/strong&gt;: Similar to Git, Nessie supports branching and merging of data changes. This feature facilitates parallel development and experimentation with data, allowing teams to work on different versions of datasets simultaneously without interfering with each other&apos;s work.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Simplified Data Governance&lt;/strong&gt;: With Nessie, organizations can enforce better governance and compliance practices by maintaining a clear history of data changes, access, and usage. This transparency is critical for adhering to regulatory requirements and conducting audits.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Enhanced Collaboration&lt;/strong&gt;: By enabling branching and merging, Nessie fosters a collaborative environment where data practitioners can share, review, and integrate data changes more efficiently. This collaborative workflow is crucial for accelerating innovation and ensuring data quality in analytics projects.&lt;/p&gt;
&lt;p&gt;Nessie&apos;s introduction of version control for data fundamentally changes how organizations approach data management, offering a more flexible, secure, and collaborative environment for handling complex data landscapes.&lt;/p&gt;
&lt;h2&gt;Use Cases for Catalog Versioning&lt;/h2&gt;
&lt;p&gt;The Nessie approach to data lakehouse catalogs opens up a range of possibilities for how data is accessed, manipulated, and maintained over time. Here are several key use cases that highlight the importance and utility of catalog versioning:&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;A/B Testing for Data-Driven Decisions&lt;/strong&gt;: Catalog versioning enables organizations to maintain multiple versions of data simultaneously, allowing for A/B testing of different analytical models or business strategies. By comparing outcomes across different data sets, businesses can make more informed decisions based on empirical evidence.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Data Rollback and Recovery&lt;/strong&gt;: In the event of erroneous data manipulation or accidental deletion, catalog versioning allows for quick rollback to a previous state, ensuring data integrity and continuity of business operations. This capability is critical for minimizing the impact of mistakes and maintaining trust in data systems.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Collaborative Data Science and Engineering Workflows&lt;/strong&gt;: By enabling branching and merging of data changes, catalog versioning supports collaborative workflows among data scientists and engineers. Teams can work on different aspects of a project in isolation, then merge their changes into a unified dataset, thereby accelerating innovation and ensuring consistency.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Compliance and Audit Trails&lt;/strong&gt;: Catalog versioning provides a comprehensive audit trail of changes to data, including who made changes, what changes were made, and when they were made. This level of transparency is invaluable for compliance with regulatory requirements and for conducting internal audits.&lt;/p&gt;
&lt;h2&gt;Dremio - The Data Lakehouse Platform&lt;/h2&gt;
&lt;p&gt;&lt;a href=&quot;https://www.dremio.com/blog/what-is-a-data-lakehouse-platform/&quot;&gt;Dremio is a platform&lt;/a&gt; designed to realize the full potential of the data lakehouse architecture. By integrating cutting-edge technologies like Apache Iceberg and Nessie, Dremio offers a seamless and efficient solution for managing, querying, and analyzing data at scale.&lt;/p&gt;
&lt;h3&gt;Key Features and Capabilities of Dremio:&lt;/h3&gt;
&lt;p&gt;&lt;strong&gt;Data Virtualization&lt;/strong&gt;: Dremio&apos;s data virtualization capabilities allow users to access and query data across various sources as if it were stored in a single location. This eliminates the need for data movement and transformation, thereby reducing complexity and increasing efficiency.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;SQL-Based Data Access&lt;/strong&gt;: Dremio provides a SQL layer that enables analysts and data scientists to perform complex queries on data stored in a lakehouse using familiar SQL syntax. This feature democratizes data access, allowing a broader range of users to derive insights from big data.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;High-Performance Analytics&lt;/strong&gt;: Leveraging Apache Arrow, &lt;a href=&quot;https://www.dremio.com/blog/how-dremio-delivers-fast-queries-on-object-storage-apache-arrow-reflections-and-the-columnar-cloud-cache/&quot;&gt;Dremio optimizes query performance, enabling lightning-fast analytics on large datasets&lt;/a&gt;. This capability is essential for real-time analytics and interactive data exploration.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Scalability and Flexibility&lt;/strong&gt;: Dremio is designed to scale horizontally, supporting the growth of data volumes and concurrent users without sacrificing performance. Its flexible architecture allows organizations to adapt to changing data needs and technologies over time.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Unified Data Governance and Security&lt;/strong&gt;: Dremio incorporates robust security features, including data access controls and encryption, to ensure data privacy and compliance. Its integration with Nessie further enhances governance by providing version control and auditability of data assets.&lt;/p&gt;
&lt;p&gt;By converging the benefits of Apache Iceberg, Nessie, and data virtualization into one easy-to-use platform, Dremio not only simplifies the data management landscape but also empowers organizations to harness the full value of their data in the lakehouse model. This integration paves the way for advanced analytics, machine learning, and data-driven decision-making, positioning Dremio at the forefront of the data lakehouse movement.&lt;/p&gt;
&lt;h2&gt;Integrating Apache Iceberg, Nessie, and Data Virtualization with Dremio&lt;/h2&gt;
&lt;p&gt;The integration of Apache Iceberg and Project Nessie with Dremio&apos;s data virtualization capabilities represents a significant advancement in data lakehouse technology. This combination addresses the complex challenges of data management at scale, providing a cohesive solution that enhances performance, flexibility, and governance. Here&apos;s how Dremio brings these technologies together to offer a powerful data lakehouse platform:&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Leveraging Apache Iceberg for Scalable Data Management&lt;/strong&gt;: Dremio enables full DDL/DML/Optimization with Apache Iceberg tables on any data lake source like Hadoop/Hive/Glue/S3/Nessie/ADLS/GCP and more. Not only do you have full featured freedom to work with Iceberg tables, it is Iceberg tables that fuels Dremio&apos;s unique query acceleration feature, &lt;a href=&quot;https://www.dremio.com/blog/bi-dashboard-acceleration-cubes-extracts-and-dremios-reflections/&quot;&gt;Reflections, that eliminates the need for materialized views, cubes and BI extracts&lt;/a&gt;. Also, any Iceberg tables cataloged in Dremio&apos;s Nessie based integrated catalog can have automatic table optimization enabled so they &amp;quot;just work&amp;quot;.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Version Control and Governance with Nessie&lt;/strong&gt;: Incorporating Nessie into the data lakehouse architecture, Dremio introduces robust version control capabilities akin to Git for data. This allows for branching, committing, and merging of data changes, enabling collaborative workflows and easier management of data evolution. Nessie&apos;s versioning also supports data governance by providing an immutable audit log of changes, essential for compliance and data quality assurance. Nessie is the backbone of Dremio Cloud&apos;s integrated catalog that also provides an easy-to-use UI for managing commits, tags and branches.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Data Virtualization for Seamless Data Access&lt;/strong&gt;: Dremio&apos;s data virtualization layer abstracts the complexity of data storage and format, presenting users with a unified view of data across the lakehouse. This enables seamless access and query capabilities across diverse data sources and formats, without the need for data movement or transformation. Data virtualization simplifies analytics, reduces time to insight, and democratizes data access across the organization.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Unified Analytics and AI Platform&lt;/strong&gt;: By integrating these technologies, Dremio transforms the data lakehouse into a comprehensive platform for analytics and AI. Users can perform complex SQL queries and bring the data to their desired environments and use cases with JDBC/ODBC, a REST API, or Apache Arrow Flight. The enables the data to fuel machine learning models and analytics directly on diverse and large-scale datasets. The platform&apos;s optimization for performance ensures that these operations are fast and efficient, catering to the needs of data-driven businesses.&lt;/p&gt;
&lt;h2&gt;Conclusion&lt;/h2&gt;
&lt;p&gt;The advent of the data lakehouse, powered by technologies like Apache Iceberg, Project Nessie, and Dremio&apos;s Data Lakehouse Platform, marks a new era in data management and analytics. This integrated approach addresses longstanding challenges in data scalability, performance, and governance, offering a versatile and powerful platform for organizations to leverage their data assets fully.&lt;/p&gt;
&lt;p&gt;In embracing these technologies, businesses can position themselves at the forefront of the data revolution, unlocking new opportunities for innovation, efficiency, and competitive advantage. The journey towards a more integrated, intelligent, and intuitive data ecosystem is just beginning, and the potential for transformation is boundless.&lt;/p&gt;
&lt;p&gt;&lt;a href=&quot;https://bit.ly/am-dremio-lakehouse-laptop&quot;&gt;Build a Prototype Lakehouse on your lapto&lt;/a&gt;&lt;/p&gt;
&lt;p&gt;&lt;a href=&quot;https://bit.ly/am-dremio-get-started-external-blog&quot;&gt;Deploy Dremio into Production for Free&lt;/a&gt;&lt;/p&gt;
&lt;p&gt;&lt;a href=&quot;https://bit.ly/am-iceberg-lakehouse-engineering&quot;&gt;Apache Iceberg Lakehouse Engineering Video Playlist&lt;/a&gt;&lt;/p&gt;
</content:encoded><author>Alex Merced</author></item><item><title>Partitioning Practices in Apache Hive and Apache Iceberg</title><link>https://iceberglakehouse.com/posts/2024-2-partitioning-in-apache-hive-and-apache-iceberg/</link><guid isPermaLink="true">https://iceberglakehouse.com/posts/2024-2-partitioning-in-apache-hive-and-apache-iceberg/</guid><description>
# Partitioning Practices in Apache Hive and Apache Iceberg

## Introduction
The efficiency of query execution is paramount. One of the key strategies...</description><pubDate>Mon, 12 Feb 2024 09:00:00 GMT</pubDate><content:encoded>&lt;h1&gt;Partitioning Practices in Apache Hive and Apache Iceberg&lt;/h1&gt;
&lt;h2&gt;Introduction&lt;/h2&gt;
&lt;p&gt;The efficiency of query execution is paramount. One of the key strategies to optimize this efficiency is through the use of partitioning. Partitioning is a technique that can significantly speed up query performance by organizing data in a manner that aligns with how queries are executed. In this blog, we delve into the concept of partitioning, explore traditional partitioning practices and their associated bottlenecks, and compare the partitioning implementations in Apache Hive and Apache Iceberg to highlight the evolution of partitioning strategies.&lt;/p&gt;
&lt;h2&gt;What is Partitioning?&lt;/h2&gt;
&lt;p&gt;Partitioning is a data organization technique used in database and data management systems to improve query performance. By grouping similar rows together when writing data, partitioning ensures that queries access only the relevant slices of data, thereby reducing the amount of data scanned and speeding up query execution. For instance, consider a database table containing log entries. Queries against this table often search for entries within a specific time range. If the table is partitioned by the date of the event time, the database can quickly locate and access only the data relevant to the query&apos;s time range, skipping over unrelated data. This method is especially effective in big data environments where tables can contain billions of rows, making data retrieval efficiency critical.&lt;/p&gt;
&lt;h2&gt;Traditional Partitioning Practices and Bottlenecks&lt;/h2&gt;
&lt;p&gt;Traditionally, partitioning has been manually managed by database administrators and data engineers, who had to explicitly define partition columns and ensure that data was loaded into the correct partitions. This approach, while effective in some scenarios, introduces several bottlenecks and challenges:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Manual Partition Management&lt;/strong&gt;: The need to manually define and maintain partitions can be time-consuming and error-prone, especially in dynamic environments where data volume and access patterns change frequently.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Explicit Partition Columns&lt;/strong&gt;: Traditional partitioning requires that partitions be represented as explicit columns in tables, complicating data insertion and querying processes. For example, inserting data into a partitioned table often requires specifying the partition key, and queries must include the partition column to avoid scanning the entire table.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Inefficient Queries&lt;/strong&gt;: Lack of understanding of the table&apos;s physical layout can lead to inefficient queries. Users may inadvertently write queries that scan more data than necessary, leading to slower performance and increased computational costs.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Inflexibility&lt;/strong&gt;: Once a partitioning scheme is implemented, changing it can be difficult and disruptive. Altering the partitioning strategy often requires extensive data migration and can break existing queries, making the system less adaptable to evolving data and access patterns.&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;These traditional practices, while foundational, highlight the need for more advanced partitioning strategies that can address these challenges, as seen in newer systems like Apache Iceberg.&lt;/p&gt;
&lt;h2&gt;Partitioning in Apache Hive&lt;/h2&gt;
&lt;p&gt;Apache Hive is a data warehouse software that facilitates reading, writing, and managing large datasets residing in distributed storage using SQL. Hive&apos;s approach to partitioning is straightforward but comes with its own set of challenges. In Hive, partitions are treated as explicit columns within a table. This model requires that data be inserted into specific partitions, often necessitating additional steps during data loading.&lt;/p&gt;
&lt;p&gt;For example, when inserting log data into a partitioned table, the insertion query must specify the partition key, as shown below:&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-sql&quot;&gt;INSERT INTO logs PARTITION (event_date)
  SELECT level, message, event_time, format_time(event_time, &apos;YYYY-MM-dd&apos;)
  FROM unstructured_log_source;
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Queries against partitioned tables must also include the partition column to avoid scanning the entire table. This explicit handling of partitions ensures data is stored and accessed efficiently, but it places the burden of partition management on the user.&lt;/p&gt;
&lt;h3&gt;Problems with Hive Partitioning&lt;/h3&gt;
&lt;p&gt;The explicit partitioning model in Hive introduces several problems:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Manual Partition Specification:&lt;/strong&gt; Users must manually specify partition values during data insertion, increasing the complexity of data loading operations.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Silently Incorrect Results:&lt;/strong&gt; Incorrectly formatted partition values or incorrect column references can lead to silently incorrect query results, as there is no inherent validation of partition values against the data they represent.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Inflexibility and Query Performance:&lt;/strong&gt; Hive&apos;s reliance on explicit partition columns can lead to inefficient queries if users are not intimately familiar with the table&apos;s partitioning scheme. Additionally, changing a table&apos;s partitioning strategy can require significant effort and potentially disrupt existing queries.&lt;/li&gt;
&lt;/ul&gt;
&lt;h2&gt;Apache Iceberg&apos;s Approach to Partitioning&lt;/h2&gt;
&lt;p&gt;&lt;a href=&quot;https://www.dremio.com/blog/apache-iceberg-101-your-guide-to-learning-apache-iceberg-concepts-and-practices/&quot;&gt;Apache Iceberg, a newer table format designed for big data&lt;/a&gt;, introduces several innovations in partitioning that address the limitations found in systems like Apache Hive. Iceberg implements &lt;a href=&quot;https://www.dremio.com/subsurface/fewer-accidental-full-table-scans-brought-to-you-by-apache-icebergs-hidden-partitioning/&quot;&gt;hidden partitioning, where the partitioning scheme is managed internally&lt;/a&gt;, and partition columns are not required to be specified by users during data insertion or querying.&lt;/p&gt;
&lt;p&gt;Iceberg handles partitioning transparently by automatically determining the appropriate partition for each row based on the table&apos;s partitioning configuration. For example, Iceberg can partition a logs table by event_time without requiring the event_time to be explicitly specified as a partition column in queries:&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-sql&quot;&gt;SELECT level, message FROM logs
WHERE event_time BETWEEN &apos;2018-12-01 10:00:00&apos; AND &apos;2018-12-01 12:00:00&apos;;
&lt;/code&gt;&lt;/pre&gt;
&lt;h3&gt;Key Features of Iceberg Partitioning&lt;/h3&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Hidden Partitioning:&lt;/strong&gt; Iceberg automates the creation of partition values based on the configured partitioning schema, eliminating the need for users to manually manage partition columns.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Automatic Partition Skipping:&lt;/strong&gt; By tracking partition metadata, Iceberg efficiently skips irrelevant partitions during query execution, significantly improving query performance without requiring additional user input.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Partition Evolution:&lt;/strong&gt; &lt;a href=&quot;https://www.dremio.com/subsurface/future-proof-partitioning-and-fewer-table-rewrites-with-apache-iceberg/&quot;&gt;Iceberg&apos;s partitioning scheme can be evolved over time without affecting existing data or queries&lt;/a&gt;, allowing for the dynamic optimization of data layout as access patterns change.&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;These features make Apache Iceberg an attractive option for managing large-scale data lakes, providing flexibility, ease of use, and performance improvements over traditional partitioning methods.&lt;/p&gt;
&lt;h2&gt;Key Differences and Advantages of Iceberg&apos;s Partitioning&lt;/h2&gt;
&lt;p&gt;Apache Iceberg&apos;s partitioning mechanism offers several key differences and advantages over Apache Hive&apos;s traditional partitioning approach:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Hidden Partitioning vs. Explicit Partitions&lt;/strong&gt;: Unlike Hive, where partitions must be explicitly defined and managed by the user, Iceberg abstracts partition details away from the user. This hidden partitioning simplifies data ingestion and querying by removing the need for users to understand or specify partition columns.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Automatic Partition Value Generation&lt;/strong&gt;: Iceberg automatically generates partition values based on the data being inserted, ensuring that data is correctly and efficiently organized without manual intervention. This contrasts with Hive, where users must manually specify partition values, leading to potential errors and inefficiencies.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Partition Evolution&lt;/strong&gt;: Iceberg supports changing the partitioning scheme of a table without needing to rewrite the data or disrupt existing queries. This flexibility allows Iceberg tables to adapt to changing access patterns and data volumes over time, a feature not readily supported in Hive.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Improved Query Performance&lt;/strong&gt;: By automatically skipping irrelevant partitions and utilizing more granular partitioning strategies (e.g., partitioning by day or hour rather than just by date), Iceberg can offer superior query performance, especially for large datasets.&lt;/p&gt;
&lt;/li&gt;
&lt;/ul&gt;
&lt;h2&gt;Partition Transforms and Evolution in Iceberg&lt;/h2&gt;
&lt;p&gt;Iceberg introduces the concept of partition transforms, which allow for sophisticated partitioning strategies beyond simple column-based partitioning. These transforms include partitioning by identity (direct mapping), year, month, day, hour, and even bucketing, which groups data into a fixed number of buckets based on hashing. Such flexibility enables more efficient data organization and faster query performance by closely aligning the partitioning scheme with the query patterns.&lt;/p&gt;
&lt;h3&gt;Partition Evolution&lt;/h3&gt;
&lt;p&gt;One of the standout features of Iceberg is its support for evolving a table&apos;s partitioning scheme. As the needs of an organization change, so too can the way its data is partitioned, without the costly and complex process of data migration. Iceberg supports adding, dropping, and modifying partitions as part of its schema evolution capabilities. This process is seamless to end-users, who continue to query the table as if nothing has changed, benefiting from improved performance and efficiency.&lt;/p&gt;
&lt;h2&gt;Conclusion&lt;/h2&gt;
&lt;p&gt;The evolution of partitioning practices from traditional models like Apache Hive to advanced systems like Apache Iceberg represents a significant step forward in data management and analytics. Iceberg&apos;s approach to partitioning, with features like hidden partitioning, automatic partition value generation, and the ability to evolve partition schemes, offers a level of flexibility, efficiency, and ease of use that is well-suited to the demands of modern big data ecosystems. As organizations continue to seek ways to efficiently manage and analyze vast amounts of data, the innovations provided by Apache Iceberg are likely to play a critical role in shaping the future of data storage and access.&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;a href=&quot;https://bit.ly/am-dremio-lakehouse-laptop&quot;&gt;Build a Data Lakehouse with Dremio/Iceberg on your laptop&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&quot;https://bit.ly/am-dremio-get-started-partner-blog&quot;&gt;Learn more about Dremio and Apache Iceberg&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&quot;https://bit.ly/am-iceberg-lakehouse-engineering&quot;&gt;Iceberg Lakehouse Engineering Video Playlist&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;h2&gt;References&lt;/h2&gt;
&lt;ul&gt;
&lt;li&gt;For more details on Apache Hive and its partitioning features, visit the official &lt;a href=&quot;https://hive.apache.org/&quot;&gt;Apache Hive documentation&lt;/a&gt;.&lt;/li&gt;
&lt;li&gt;To learn more about Apache Iceberg and its advanced partitioning capabilities, refer to the &lt;a href=&quot;https://iceberg.apache.org/&quot;&gt;Apache Iceberg documentation&lt;/a&gt;.&lt;/li&gt;
&lt;/ul&gt;
</content:encoded><author>Alex Merced</author></item><item><title>Embracing the Future of Data Management - Why Choose Lakehouse, Iceberg, and Dremio?</title><link>https://iceberglakehouse.com/posts/2024-1-why-choose-lakehouse-iceberg-dremio/</link><guid isPermaLink="true">https://iceberglakehouse.com/posts/2024-1-why-choose-lakehouse-iceberg-dremio/</guid><description>
Data is not just an asset but the cornerstone of business strategy. The way we manage, store, and process this invaluable resource has evolved dramat...</description><pubDate>Thu, 25 Jan 2024 09:00:00 GMT</pubDate><content:encoded>&lt;p&gt;Data is not just an asset but the cornerstone of business strategy. The way we manage, store, and process this invaluable resource has evolved dramatically. The traditional boundaries of data warehouses and lakes are blurring, giving rise to a new, more integrated approach: the &lt;strong&gt;Data Lakehouse&lt;/strong&gt;. This innovative architecture combines the expansive storage capabilities of data lakes with the structured management and processing power of data warehouses, offering an unparalleled solution for modern data needs.&lt;/p&gt;
&lt;p&gt;When it comes to &lt;a href=&quot;https://www.dremio.com/solutions/data-lakehouse/&quot;&gt;Data Lakehouses&lt;/a&gt;, technologies like &lt;a href=&quot;https://bit.ly/am-iceberg-101&quot;&gt;&lt;strong&gt;Apache Iceberg&lt;/strong&gt;&lt;/a&gt; and &lt;a href=&quot;https://bit.ly/am-dremio-get-started-external-blog&quot;&gt;&lt;strong&gt;Dremio&lt;/strong&gt;&lt;/a&gt; have emerged as frontrunners, each bringing unique strengths to the table. &lt;a href=&quot;https://bit.ly/am-iceberg-101&quot;&gt;Apache Iceberg&lt;/a&gt;, an open table format, is gaining traction for its robustness and flexibility in handling large-scale data across different platforms. Meanwhile, &lt;a href=&quot;https://bit.ly/am-dremio-get-started-external-blog&quot;&gt;Dremio&lt;/a&gt; stands out as a comprehensive solution, integrating seamlessly with Iceberg to provide advanced data virtualization, query engine capabilities, and a robust semantic layer.&lt;/p&gt;
&lt;p&gt;In this blog, we&apos;ll dive deep into why these technologies are not just buzzwords but essential tools in the arsenal of any data-driven organization. We&apos;ll explore the synergies between Data Lakehouses, Apache Iceberg, and Dremio, and how they collectively pave the way for a more agile, efficient, and future-proof data management strategy.&lt;/p&gt;
&lt;h2&gt;The Rise of the Data Lakehouse&lt;/h2&gt;
&lt;p&gt;&lt;strong&gt;Data Lakehouse&lt;/strong&gt;: A &lt;a href=&quot;https://www.dremio.com/resources/guides/what-is-a-data-lakehouse/&quot;&gt;data lakehouse&lt;/a&gt; is a pattern of using formats, tools, and platforms to build the type of performance accessability normally associated with data warehouses on top of your data lake storage, reducing the need to duplicate data while providing the scalability and flexibility of the data lake.&lt;/p&gt;
&lt;h3&gt;Why Data Lakehouses?&lt;/h3&gt;
&lt;p&gt;&lt;strong&gt;Single Storage, Multiple Tools:&lt;/strong&gt; Data Lakehouses eliminate the traditional silos between data lakes and warehouses. They offer a single copy of your data for all types of data workloads across multiple tools - from machine learning and data science to BI and analytics.&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Flexibility and Scalability:&lt;/strong&gt; With a Data Lakehouse, businesses gain the flexibility to handle diverse data formats and the scalability to manage growing data volumes. This adaptability is crucial in an era where data variety and volume are exploding and the speed at which you need the data delivered are abbreviating.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Enhanced Performance:&lt;/strong&gt; By bringing together the best of lakes and warehouses, Data Lakehouses offer improved performance for both ad-hoc analytics and complex data transformations.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Cost-Effectiveness:&lt;/strong&gt; Reducing the need for multiple systems and integrating storage and processing capabilities, Data Lakehouses can lead to significant cost savings.&lt;/p&gt;
&lt;/li&gt;
&lt;/ul&gt;
&lt;h3&gt;The Concept of &amp;quot;Shifting Left&amp;quot; in Data Warehousing&lt;/h3&gt;
&lt;p&gt;The idea of &amp;quot;Shifting Left&amp;quot; in data warehousing refers to performing data quality, governance, and processing earlier in the data lifecycle. This approach, which is inherent to Data Lakehouses, ensures higher data quality and more efficient data processing. It allows organizations to leverage the benefits of flexibility, scalability, performance, and cost savings right from the early stages of data handling.&lt;/p&gt;
&lt;p&gt;Data Lakehouses are not just a technological advancement; they are a strategic evolution in data management, aligning with the dynamic needs of modern enterprises. They stand at the forefront of the big data revolution, redefining how organizations store, process, and extract value from their data.&lt;/p&gt;
&lt;h2&gt;The Role of Apache Iceberg in the Data Lakehouse&lt;/h2&gt;
&lt;p&gt;&lt;a href=&quot;https://bit.ly/am-iceberg-101&quot;&gt;&lt;strong&gt;Apache Iceberg&lt;/strong&gt;&lt;/a&gt; is an open table format that has been gaining widespread recognition for its ability to manage large-scale data across various platforms. But what makes Apache Iceberg a critical component in modern data architectures, particularly in Data Lakehouses?&lt;/p&gt;
&lt;h3&gt;Key Features of Apache Iceberg&lt;/h3&gt;
&lt;ul&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Broad Tool Compatibility:&lt;/strong&gt; Apache Iceberg excels in its compatibility with a myriad of tools for both reading and writing data. This versatility ensures that Iceberg tables can be easily integrated into existing data pipelines.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Platform Agnostic:&lt;/strong&gt; Whether you&apos;re operating on-premises, in the cloud, or in a hybrid environment, Apache Iceberg&apos;s platform-agnostic nature makes it a universally adaptable solution for data table management.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Robustness in Data Management:&lt;/strong&gt; Apache Iceberg provides superior handling of large datasets, including features like schema evolution, hidden partitioning, and efficient upserts, making it an ideal choice for complex data operations.&lt;/p&gt;
&lt;/li&gt;
&lt;/ul&gt;
&lt;h3&gt;Community-Driven Development&lt;/h3&gt;
&lt;p&gt;One of Apache Iceberg&apos;s most significant strengths lies in its community-driven approach. With transparent discussions, public email lists, regular meetings, and a dedicated Slack channel, Iceberg fosters an open and collaborative development environment. This transparency ensures:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Reliability:&lt;/strong&gt; No unexpected changes to data formats that businesses rely on. All changes and new features are discussed and broadcasted well in advance of their release to allow enterprises to plan accordingly.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Accessibility:&lt;/strong&gt; Features and improvements are not locked behind proprietary systems but are available to the entire community.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Innovation:&lt;/strong&gt; Ongoing contributions from a diverse set of developers and organizations drive continuous innovation and improvement.&lt;/p&gt;
&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;Apache Iceberg&apos;s tool compatibility and community-driven nature make it an invaluable asset in implementing data lakehouses.&lt;/p&gt;
&lt;p&gt;&lt;a href=&quot;https://bit.ly/am-iceberg-lakehouse-engineering&quot;&gt;VIDEO PLAYLIST: Apache Iceberg Lakehouse Engineering&lt;/a&gt;&lt;/p&gt;
&lt;h2&gt;Section 3: Dremio - A Comprehensive Data Lakehouse Solution&lt;/h2&gt;
&lt;p&gt;While the concept of a Data Lakehouse is revolutionary, its true potential is unlocked when paired with the right technology. This is where &lt;a href=&quot;https://bit.ly/am-dremio-get-started-external-blog&quot;&gt;&lt;strong&gt;Dremio&lt;/strong&gt;&lt;/a&gt; enters the picture as a standout platform in the Apache Iceberg ecosystem. Dremio is a comprehensive solution that enhances the capabilities of Data Lakehouses and Apache Iceberg tables. Let&apos;s delve into why Dremio is an integral part of this modern data architecture.&lt;/p&gt;
&lt;p&gt;&lt;a href=&quot;https://bit.ly/am-dremio-lakehouse-laptop&quot;&gt;TUTORIAL: Build a Prototype Data Lakehouse on your Laptop&lt;/a&gt;&lt;/p&gt;
&lt;h3&gt;Dremio&apos;s Standout Features&lt;/h3&gt;
&lt;ul&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Seamless Integration with Iceberg:&lt;/strong&gt; The Dremio Lakehouse Platform includes a high-performance &lt;a href=&quot;https://www.dremio.com/blog/connecting-to-dremio-using-apache-arrow-flight-in-python/&quot;&gt;Apache Arrow based&lt;/a&gt; query engine that not only makes querying Apache Iceberg tables fast, but allows you to &lt;a href=&quot;https://www.dremio.com/blog/overcoming-data-silos-how-dremio-unifies-disparate-data-sources-for-seamless-analytics/&quot;&gt;unify your Iceberg&lt;/a&gt; tables with data from different databases, data lakes and data warehouses with its &lt;a href=&quot;https://docs.dremio.com/cloud/sonar/data-sources/&quot;&gt;data virtualization/data federation&lt;/a&gt; features.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;User-Friendly Semantic Layer:&lt;/strong&gt; Dremio provides a &lt;a href=&quot;https://www.dremio.com/resources/guides/what-is-a-semantic-layer/&quot;&gt;manageable and governed semantic layer&lt;/a&gt;. This feature &lt;a href=&quot;https://www.dremio.com/blog/virtual-data-marts-101-the-benefits-and-how-to/&quot;&gt;simplifies the provision of tables and logical views&lt;/a&gt;, making &lt;a href=&quot;https://docs.dremio.com/cloud/security/access-control/&quot;&gt;data access control&lt;/a&gt; more straightforward and effective.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Efficient Data Ingestion and Management:&lt;/strong&gt; With capabilities like &lt;a href=&quot;https://docs.dremio.com/cloud/reference/sql/commands/merge/&quot;&gt;MERGE INTO&lt;/a&gt; for upserts and &lt;a href=&quot;https://docs.dremio.com/cloud/reference/sql/commands/copy-into-table/&quot;&gt;COPY INTO&lt;/a&gt; for loading various file formats, Dremio streamlines the process of data ingestion. To make things even more turn-key you can use orchestration &lt;a href=&quot;https://www.dremio.com/blog/using-dbt-to-manage-your-dremio-semantic-layer/&quot;&gt;tools like dbt to orchestrate your transformations&lt;/a&gt; and curation of the semantic layer in Dremio.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Automated Iceberg Table Management:&lt;/strong&gt; &lt;a href=&quot;https://docs.dremio.com/cloud/arctic/automatic-optimization&quot;&gt;Dremio automates the optimization and cleanup of Iceberg tables&lt;/a&gt; as part of its data lakehouse management features, allowing you to focus on analyzing your data instead of managing it.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Versioning and Transaction Isolation:&lt;/strong&gt; The platform offers &lt;a href=&quot;https://docs.dremio.com/cloud/arctic/data-branching/&quot;&gt;catalog-level versioning and git-like semantics&lt;/a&gt;, essential for maintaining data consistency and reliability in complex environments.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Optimized Data Representations:&lt;/strong&gt; Dremio utilizes &lt;a href=&quot;https://docs.dremio.com/cloud/sonar/reflections/&quot;&gt;reflections which are flexible, Apache Iceberg-based data representations to optimize data storage and access&lt;/a&gt;, significantly &lt;a href=&quot;https://www.dremio.com/blog/bi-dashboard-acceleration-cubes-extracts-and-dremios-reflections/&quot;&gt;speeding up BI dashboards and queries&lt;/a&gt;.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Diverse Data Delivery Interfaces:&lt;/strong&gt; Catering to various user needs, Dremio supports multiple interfaces, including JDBC/ODBC, REST API, and Apache Arrow Flight, ensuring flexible data access and delivery.&lt;/p&gt;
&lt;/li&gt;
&lt;/ul&gt;
&lt;h3&gt;Embracing Open Source and Open Architecture&lt;/h3&gt;
&lt;p&gt;Dremio&apos;s commitment to open source and open architecture is a key factor in its appeal. This approach ensures that your data remains within your control and storage, aligning with modern principles of Data Virtualization and Semantic Layers. Dremio is the open lakehouse platform, embodying the essence of flexibility, scalability, and control in data management.&lt;/p&gt;
&lt;p&gt;Dremio acts as the bridge connecting the vast capabilities of Data Lakehouses and the structured efficiency of Apache Iceberg. Its comprehensive set of features makes it an indispensable tool for businesses looking to harness the full potential of an Apache Iceberg-based Data Lakehouse.&lt;/p&gt;
&lt;h2&gt;Paving the Way for Open Data Lakehouses&lt;/h2&gt;
&lt;p&gt;As we&apos;ve explored throughout this blog, the combination of Data Lakehouses, Apache Iceberg, and Dremio represents a significant leap forward in the world of data management. This trio brings together the best aspects of flexibility, scalability, and efficiency, addressing the complex data challenges faced by modern businesses.&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Data Lakehouses&lt;/strong&gt; offer a singular, scalable platform, blending the strengths of data lakes and warehouses.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Apache Iceberg&lt;/strong&gt; stands out for its robust table format, broad compatibility, and community-driven innovation, making it an ideal choice for diverse and large-scale data operations.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Dremio&lt;/strong&gt; shines as a comprehensive solution that not only complements Iceberg but also brings additional capabilities like efficient data ingestion, federated data access, automated lakehouse table management, and a versatile semantic layer.&lt;/p&gt;
&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;Whether you are just starting on your data journey or looking to enhance your existing infrastructure, considering implementing an &lt;a href=&quot;https://bit.ly/am-dremio-get-started-external-blog&quot;&gt;Open Data Lakehouse with Dremio&lt;/a&gt; could be the key to unlocking a new realm of possibilities in data access and analytics.&lt;/p&gt;
&lt;p&gt;Remember, the future of data is not just about storing vast amounts of information; it&apos;s about managing, processing, and utilizing that data in the most efficient, reliable, and scalable way possible. And with Data Lakehouses, Apache Iceberg, and Dremio, you&apos;re well-equipped to navigate this future.&lt;/p&gt;
&lt;p&gt;&lt;a href=&quot;https://bit.ly/am-dremio-lakehouse-laptop&quot;&gt;Create a Prototype Dremio Lakehouse on your Laptop with this tutorial&lt;/a&gt;&lt;/p&gt;
</content:encoded><author>Alex Merced</author></item><item><title>Open Lakehouse Engineering/Apache Iceberg Lakehouse Engineering - A Directory of Resources</title><link>https://iceberglakehouse.com/posts/2024-1-open-lakehouse-engineering-resources/</link><guid isPermaLink="true">https://iceberglakehouse.com/posts/2024-1-open-lakehouse-engineering-resources/</guid><description>
The concept of the **Open Lakehouse** has emerged as a beacon of flexibility and innovation. An Open Lakehouse represents a specialized form data lak...</description><pubDate>Fri, 19 Jan 2024 09:00:00 GMT</pubDate><content:encoded>&lt;p&gt;The concept of the &lt;strong&gt;Open Lakehouse&lt;/strong&gt; has emerged as a beacon of flexibility and innovation. An Open Lakehouse represents a specialized form data lakehouse (bringing data warehouse like functionality/performance to data on a data lake), uniquely characterized by its commitment to open standards and technologies. At the core of this paradigm are tools like Apache Iceberg, Nessie, and Apache Arrow, which collectively empower organizations to build highly efficient, scalable, and interoperable data ecosystems.&lt;/p&gt;
&lt;p&gt;Unlike conventional data lakehouses which may have high levels of coupling between the storage formats, governance, optimization and more of their data with one vendor with few alternatives, an Open Lakehouse prioritizes the avoidance of vendor lock-in, ensuring that organizations maintain full control over their data infrastructure. This approach not only fosters a more adaptable and resilient data environment but also encourages a collaborative, community-driven development ethos that is instrumental in driving the field forward.&lt;/p&gt;
&lt;p&gt;A key platform enabling open lakehouses is Dremio, a cutting-edge lakehouse platform that epitomizes the Open Lakehouse philosophy. Dremio seamlessly integrates various data sources, leveraging the power of open-source technologies to unify data management and analytics. This integration allows for an unprecedented level of flexibility and efficiency, making Dremio an indispensable tool for organizations looking to harness the full potential of their data. Dremio enables the maximization of decentralization in data harnessing the right features for data virtualization (decentralized data), data lakehouse (decentralized access to a single copy of a dataset) and data mesh (decentralized data curation).&lt;/p&gt;
&lt;p&gt;This directory serves as a comprehensive resource for anyone looking to dive into the world of Open Lakehouse Engineering. Whether you&apos;re a seasoned data professional or just starting out, the following resources will guide you through the intricacies of building and managing an Open Lakehouse, ensuring you&apos;re well-equipped to leverage these exciting technologies to their fullest extent.
Feel free to modify or expand upon this introduction to better fit the tone and scope of&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;If you are new to the data space I recommend starting with &lt;a href=&quot;https://bit.ly/am-intro-to-data&quot;&gt;this playlist&lt;/a&gt; that will cover lakehouse engineering, modeling, big data concepts and more&lt;/strong&gt;&lt;/p&gt;
&lt;h2&gt;Getting Started with Open Lakehouses&lt;/h2&gt;
&lt;ul&gt;
&lt;li&gt;&lt;a href=&quot;https://www.youtube.com/watch?v=G_dbypufGXc&quot;&gt;No Code Setup of a Data Lakehouse on your Laptop with Dremio &amp;amp; Minio using Docker Desktop&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&quot;https://bit.ly/am-iceberg-lakehouse-engineering&quot;&gt;Video Playlist: Apache Iceberg Lakehouse Engineering&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&quot;https://bit.ly/am-dremio-lakehouse-laptop&quot;&gt;Blog: Creating an Iceberg Lakehouse on your Laptop with Dremio/Minio/Nessie&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&quot;https://bit.ly/am-iceberg-101&quot;&gt;Blog: Apache Iceberg 101 - Comprehensive List of Resources&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&quot;https://bit.ly/am-bi-dashboards-acceleration&quot;&gt;Blog: BI Dashboard Acceleration: Cubes, Extracts, and Dremio’s Reflections&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&quot;https://bit.ly/am-5-use-cases-dremio&quot;&gt;Blog: 5 Use Cases for the Dremio Lakehouse&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;h2&gt;Hands-on Articles&lt;/h2&gt;
&lt;ul&gt;
&lt;li&gt;&lt;a href=&quot;https://bit.ly/am-spark-dremio-lakehouse&quot;&gt;Blog: Creating an Iceberg Lakehouse with Spark, Minio, Dremio, Nessie&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&quot;https://bit.ly/am-dbt-internal&quot;&gt;Blog: Using dbt to Manage Your Dremio Semantic Layer&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&quot;https://bit.ly/am-arrow-python-dremio&quot;&gt;Blog: Connecting to Dremio Using Apache Arrow Flight in Python&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&quot;https://www.dremio.com/blog/exploring-the-architecture-of-apache-iceberg-delta-lake-and-apache-hudi/&quot;&gt;Blog: Exploring the Architecture of Apache Iceberg, Delta Lake, and Apache Hudi&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&quot;https://www.dremio.com/blog/how-to-create-a-lakehouse-with-airbyte-s3-apache-iceberg-and-dremio/&quot;&gt;Blog: How to Create a Lakehouse with Airbyte, S3, Apache Iceberg, and Dremio&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&quot;https://www.dremio.com/blog/using-flink-with-apache-iceberg-and-nessie/&quot;&gt;Blog: Using Flink with Apache Iceberg and Nessie&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&quot;https://www.dremio.com/blog/3-ways-to-use-python-with-apache-iceberg/&quot;&gt;Blog: 3 Ways to Use Python with Apache Iceberg&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&quot;https://www.dremio.com/blog/using-duckdb-with-your-dremio-data-lakehouse/&quot;&gt;Blog: Using DuckDB with Your Dremio Data Lakehouse&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&quot;https://www.dremio.com/blog/3-ways-to-convert-a-delta-lake-table-into-an-apache-iceberg-table/&quot;&gt;Blog: 3 Ways to Convert a Delta Lake Table Into an Apache Iceberg Table&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&quot;https://www.dremio.com/blog/getting-started-with-project-nessie-apache-iceberg-and-apache-spark-using-docker/&quot;&gt;Blog: Getting Started with Project Nessie, Apache Iceberg, and Apache Spark Using Docker&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&quot;https://www.youtube.com/watch?v=604i8vaukZs&quot;&gt;Video: Apache Superset &amp;amp; Dremio: How to Run Superset from Docker and Connect to Dremio Cloud&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&quot;&quot;&gt;&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;h2&gt;Conceptual Content&lt;/h2&gt;
&lt;ul&gt;
&lt;li&gt;&lt;a href=&quot;https://bit.ly/am-virtual-data-marts&quot;&gt;Blog: Virtual Data Marts 101 - The Benefits and How-To&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&quot;https://www.datalakehouse.help&quot;&gt;Docs: Data Lakehouse Terms and Concepts&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&quot;https://www.dremio.com/blog/the-who-what-and-why-of-data-products/&quot;&gt;Blog: The Who, What, and Why of Data Products&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&quot;https://www.dremio.com/blog/why-use-dremio-to-implement-a-data-mesh/&quot;&gt;Blog: Why Use Dremio to Implement a Data Mesh?&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&quot;https://www.dremio.com/blog/overcoming-data-silos-how-dremio-unifies-disparate-data-sources-for-seamless-analytics/&quot;&gt;Blog: Overcoming Data Silos - How Dremio Unifies Disparate Data Sources for Seamless Analytics&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&quot;https://www.youtube.com/watch?v=ccNxVQfkSOg&quot;&gt;Video: Where Data Lakehouse and DataOps/Data-as-Code Converge (Project Nessie &amp;amp; Dremio Arctic)&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&quot;https://www.youtube.com/watch?v=bvXj4ANMy10&quot;&gt;Video: From Data Lake to Data Lakehouse (What, Why and How of Apache Iceberg/Dremio/Nessie Lakehouses)&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;
</content:encoded><author>Alex Merced</author></item><item><title>Nessie -  An Alternative to Hive &amp; JDBC for Self-Managed Apache Iceberg Catalogs</title><link>https://iceberglakehouse.com/posts/2024-1-nessie-an-alternative-to-hive/</link><guid isPermaLink="true">https://iceberglakehouse.com/posts/2024-1-nessie-an-alternative-to-hive/</guid><description>
Unlike traditional table formats, Apache Iceberg provides a comprehensive solution for handling big data&apos;s complexity, volume, and diversity. It&apos;s de...</description><pubDate>Mon, 08 Jan 2024 09:00:00 GMT</pubDate><content:encoded>&lt;p&gt;Unlike traditional table formats, Apache Iceberg provides a comprehensive solution for handling big data&apos;s complexity, volume, and diversity. It&apos;s designed to improve data processing in various analytics engines like Apache Spark, Apache Flink, and others. One of Iceberg&apos;s key features is its ability to maintain massive datasets efficiently while ensuring reliable data snapshots, schema evolution, and hidden partitioning.&lt;/p&gt;
&lt;p&gt;However, the utility of Apache Iceberg is greatly enhanced by the use of a catalog. A catalog in the context of Iceberg tables is essentially a metadata management tool that tracks table locations, schema versions, and other critical information. The primary role of a catalog is to simplify the portability of tables between different computing tools and environments. It acts as a centralized repository for all table-related metadata, making it easier for users to manage, access, and evolve their data structures without losing consistency or integrity.&lt;/p&gt;
&lt;p&gt;The need for a catalog becomes particularly crucial in complex data environments. As organizations increasingly migrate their workloads across various cloud platforms and processing tools, maintaining consistency and accessibility of data becomes a significant challenge. A robust catalog system ensures that tables are easily portable across different environments without losing their essential characteristics. This portability is critical for businesses that require flexibility in their data processing and analytics operations, enabling them to leverage the best tools for each specific task without being hindered by compatibility issues.&lt;/p&gt;
&lt;h2&gt;Limitations of Traditional Self-Managed Catalog Options&lt;/h2&gt;
&lt;p&gt;Traditionally, data engineers have relied on self-managed catalog options like Hive Metastore and JDBC catalogs (mySQL, Postgres, etc.). While these systems have been instrumental in the evolution of data management, they come with their own set of challenges, particularly when integrated with Apache Iceberg tables.&lt;/p&gt;
&lt;p&gt;Firstly, configuring and deploying these traditional catalogs can be a tedious and complex process. They often require significant effort to set up and maintain, especially in dynamic and scalable cloud environments. This complexity can lead to increased operational overhead and a greater potential for misconfiguration, which can be detrimental to fast-paced data operations.&lt;/p&gt;
&lt;p&gt;Moreover, Hive and JDBC catalogs may not fully leverage the latest features offered by Apache Iceberg. Many new features are now only being introduced through the &amp;quot;REST catalog&amp;quot; OpenAPI specification, which has no open-source or self-managed implementation. Right now, the only option that is open-source and self-managed that adds new functionality to Apache Iceberg tables is &lt;a href=&quot;https://www.projectnessie.org&quot;&gt;Nessie&lt;/a&gt;.&lt;/p&gt;
&lt;h2&gt;The Rising Need for Self-Managed Infrastructure&lt;/h2&gt;
&lt;p&gt;Despite the challenges, the need for self-managed catalog infrastructure is more relevant than ever, driven primarily by regulatory and security reasons. Many organizations operate under strict data governance and compliance requirements. These regulations often mandate specific data handling, storage, and processing protocols, which can be challenging to adhere to with third-party managed services.&lt;/p&gt;
&lt;p&gt;Self-managed catalogs offer greater control over data, allowing organizations to implement customized security measures, comply with specific regulations, and maintain data sovereignty. This control is crucial for businesses handling sensitive information or operating in heavily regulated industries like finance, healthcare, and government.&lt;/p&gt;
&lt;h2&gt;Nessie: Bridging the Gap in Catalog Management&lt;/h2&gt;
&lt;p&gt;Project Nessie is an innovative open-source technology that revolutionizes the way Apache Iceberg catalogs are managed. Designed to enable new possibilities over traditional systems like Hive Metastore and JDBC catalogs, Nessie introduces a new paradigm in catalog management, one that is more aligned with the modern requirements of big data analytics.&lt;/p&gt;
&lt;h2&gt;Why Nessie Stands Out&lt;/h2&gt;
&lt;p&gt;Nessie is not just another catalog; it&apos;s a solution crafted with the complexities and challenges of modern data architectures in mind. Here&apos;s why Nessie is rapidly becoming the go-to choice for managing Apache Iceberg catalogs:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Open-Source and Self-Managed&lt;/strong&gt;: Nessie is an open-source project, making it accessible and modifiable according to specific organizational needs. This aspect is particularly appealing for teams looking to implement a self-managed infrastructure, providing them the flexibility to tailor the catalog to their regulatory and security requirements. Although, cloud managed Nessie-based catalogs are offered by &lt;a href=&quot;https://www.dremio.com/get-started&quot;&gt;Dremio&lt;/a&gt;, which also include automated table maintenance and cleanup.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Compatibility with Leading Data Tools&lt;/strong&gt;: One of Nessie&apos;s significant advantages is its compatibility with a wide range of data processing tools. It seamlessly integrates with popular engines like Dremio, Apache Spark, Apache Flink, Presto and Trino. This compatibility ensures that organizations can continue using their preferred tools without worrying about catalog compatibility issues.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Enabling Advanced Features&lt;/strong&gt;: Nessie is not just about maintaining the status quo; it&apos;s about pushing the boundaries. It enables features that are not fully supported by traditional catalogs, such as catalog versioning, zero-copy clones, catalog-level rollbacks, and multi-table transactions. These features bring a new level of efficiency and capability to data management, allowing for more complex and sophisticated data operations.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Simplifying Data Operations&lt;/strong&gt;: With Nessie, the complexity of managing large-scale data across different environments is greatly reduced. Its ability to handle versioning and rollback at the catalog level simplifies data governance and compliance, a crucial aspect for many organizations.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Future-Proofing Data Management&lt;/strong&gt;: As data architectures continue to evolve, Nessie offers a future-proof solution. Its design is inherently scalable and adaptable, ready to meet the growing demands of big data analytics and the ever-changing landscape of data management technologies.&lt;/p&gt;
&lt;/li&gt;
&lt;/ul&gt;
&lt;h2&gt;Getting Hands on with Nessie&lt;/h2&gt;
&lt;p&gt;Here are many articles for getting hands on with Nessie locally on your laptop so you can see its power in action:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;a href=&quot;https://www.dremio.com/blog/intro-to-dremio-nessie-and-apache-iceberg-on-your-laptop/&quot;&gt;BLOG: Intro to Dremio, Nessie, and Apache Iceberg on Your Laptop&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&quot;https://dev.to/alexmercedcoder/data-engineering-create-a-apache-iceberg-based-data-lakehouse-on-your-laptop-41a8&quot;&gt;BLOG: Data Engineering: Create a Apache Iceberg based Data Lakehouse on your Laptop&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&quot;https://www.youtube.com/watch?v=JIrjkEWhgNE&quot;&gt;Video: Setting up a Dremio/Nessie Lakehouse on your Laptop for Evaluation in less than 10 minutes&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&quot;https://www.youtube.com/watch?v=SIriNcVIGJQ&amp;amp;list=PLsLAVBjQJO0p0Yq1fLkoHvt2lEJj5pcYe&quot;&gt;Video: Playlist - Apache Iceberg Lakehouse Engineering&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;
</content:encoded><author>Alex Merced</author></item><item><title>Apache Iceberg, Git-Like Catalog Versioning and Data Lakehouse Management - Pillars of a Robust Data Lakehouse Platform</title><link>https://iceberglakehouse.com/posts/2024-1-apache-iceberg-git-life-catalog-versioning/</link><guid isPermaLink="true">https://iceberglakehouse.com/posts/2024-1-apache-iceberg-git-life-catalog-versioning/</guid><description>
Managing vast amounts of data efficiently and effectively is crucial for any organization aiming to leverage its data for strategic decisions. The ke...</description><pubDate>Wed, 03 Jan 2024 09:00:00 GMT</pubDate><content:encoded>&lt;p&gt;Managing vast amounts of data efficiently and effectively is crucial for any organization aiming to leverage its data for strategic decisions. The key to unlocking this potential lies in advanced data management practices, particularly in versioning and catalog management. This is where the combined power of Dremio’s Lakehouse Management features and Project Nessie&apos;s catalog-level versioning comes into play.&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;&lt;a href=&quot;https://www.dremio.com/blog/intro-to-dremio-nessie-and-apache-iceberg-on-your-laptop/&quot;&gt;Blog: Try Dremio and Nessie on your laptop&lt;/a&gt;&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;Imagine managing your data with the same flexibility and ease as code versioning in Git. That&apos;s the revolutionary idea behind Project Nessie. It brings Git-like semantics to data, enabling data teams to handle versioning at the catalog level with unprecedented ease. This approach to data versioning not only enhances data reliability and reproducibility but also opens up new possibilities for data experimentation and rollback, without the risk of data corruption or loss.&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;&lt;a href=&quot;https://www.dremio.com/blog/bi-dashboard-acceleration-cubes-extracts-and-dremios-reflections/&quot;&gt;Blog: BI Dashboard Acceleration with Dremio&apos;s Reflections&lt;/a&gt;&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;&lt;a href=&quot;https://docs.dremio.com/cloud/arctic/&quot;&gt;Dremio’s Lakehouse Management features&lt;/a&gt; build upon Nessie&apos;s capabilities, offering a user-friendly interface that simplifies monitoring and managing the data catalog. The seamless integration with Project Nessie means that Dremio users can enjoy all the benefits of catalog versioning while leveraging a platform that is intuitive and easy to navigate.&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;&lt;a href=&quot;https://www.youtube.com/watch?v=mDwpsg8btto&quot;&gt;Video: ZeroETL &amp;amp; Virtual Data Marts - Cutting Edge Data Lakehouse Engineering&lt;/a&gt;&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;One of the standout features of Dremio&apos;s Lakehouse Management is its &lt;a href=&quot;https://docs.dremio.com/cloud/arctic/automatic-optimization&quot;&gt;automated maintenance and cleanup of Apache Iceberg tables&lt;/a&gt;. This automation not only reduces the manual workload for data teams but also ensures that the data lakehouse remains efficient, organized, and free from redundant or obsolete data.&lt;/p&gt;
&lt;h3&gt;Catalog Versioning on the Dremio Lakehouse Platform&lt;/h3&gt;
&lt;p&gt;To truly appreciate the impact of these advancements in data management, let’s dive into a practical example. This example can be in any Dremio environment with a self-managed Nessie catalog or an Arctic catalog from Dremio Cloud. We&apos;ll breakdown this example after the code snippet.&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-sql&quot;&gt;-- Creating the main employee data table in the default branch
CREATE TABLE HR_EmployeeData (
    employeeId INT,
    employeeName VARCHAR,
    department VARCHAR,
    salary FLOAT,
    startDate DATE
);

-- Creating a staging table for incoming employee data updates in the default branch
CREATE TABLE HR_StagingEmployeeData (
    employeeId INT,
    employeeName VARCHAR,
    department VARCHAR,
    salary FLOAT,
    startDate DATE
);

-- Inserting sample employee data into the staging table
INSERT INTO HR_StagingEmployeeData (employeeId, employeeName, department, salary, startDate) VALUES
(1, &apos;John Doe&apos;, &apos;Finance&apos;, 55000, &apos;2021-01-01&apos;),
(2, &apos;Jane Smith&apos;, &apos;Marketing&apos;, -48000, &apos;2022-01-02&apos;),  -- Negative salary (problematic)
(3, &apos;Alice Johnson&apos;, &apos;IT&apos;, 62000, &apos;2023-02-15&apos;);       -- Future date (problematic)

-- Creating a new branch for data integration
CREATE BRANCH HR_dataIntegration_010224;

-- Switching to the dataIntegration branch
USE BRANCH HR_dataIntegration_010224;

-- Merging staging data into the EmployeeData table on the dataIntegration branch
MERGE INTO HR_EmployeeData AS target
USING HR_StagingEmployeeData AS source
ON target.employeeId = source.employeeId
WHEN MATCHED THEN
    UPDATE SET employeeName = source.employeeName, department = source.department, salary = source.salary, startDate = source.startDate
WHEN NOT MATCHED THEN
    INSERT (employeeId, employeeName, department, salary, startDate) VALUES (source.employeeId, source.employeeName, source.department, source.salary, source.startDate);

-- Performing data quality checks on the dataIntegration branch
-- Check for non-negative salaries
SELECT COUNT(*) AS InvalidSalaryCount
FROM HR_EmployeeData
WHERE salary &amp;lt; 0;

-- Check for valid start dates (not in the future)
SELECT COUNT(*) AS InvalidStartDateCount
FROM HR_EmployeeData
WHERE startDate &amp;gt; CURRENT_DATE;

-- QUERY MAIN BRANCH
SELECT * FROM HR_EmployeeData AT BRANCH main;

-- QUERY INGESTION BRANCH
SELECT * FROM HR_EmployeeData AT BRANCH HR_dataIntegration_010224;

-- Assuming checks have passed, switch back to the main branch and merge changes from dataIntegration
USE BRANCH main;
MERGE BRANCH HR_dataIntegration_010224 INTO main;

-- QUERY MAIN BRANCH
SELECT * FROM HR_EmployeeData AT BRANCH main;

-- QUERY INGESTION BRANCH
SELECT * FROM HR_EmployeeData AT BRANCH HR_dataIntegration_010224;

-- The checks for data quality (negative salaries and future start dates) are simplified for this example.
-- In a real-world scenario, more sophisticated validation logic and error handling would be required.
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;In our scenario, we start by establishing two tables within our Dremio environment:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;DACSalesData&lt;/strong&gt;: This is the main table where we store consolidated sales data.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;DACStagingSalesData&lt;/strong&gt;: This staging table is used to manage incoming sales data before it&apos;s confirmed for integration into the main table.&lt;/li&gt;
&lt;/ul&gt;
&lt;pre&gt;&lt;code class=&quot;language-sql&quot;&gt;CREATE TABLE DACSalesData (id INT, productId INT, saleAmount FLOAT, saleDate DATE);
CREATE TABLE DACStagingSalesData (id INT, productId INT, saleAmount FLOAT, saleDate DATE);
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;These tables represent a typical data setup in a lakehouse, where data is ingested, staged, and then integrated.&lt;/p&gt;
&lt;p&gt;We simulate real-world data entries by inserting sample sales records into the &lt;code&gt;DACStagingSalesData&lt;/code&gt; table. This data includes various scenarios like standard sales, negative amounts (perhaps due to refunds or errors), and future-dated sales (possibly indicating scheduled transactions or data entry errors).&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-sql&quot;&gt;INSERT INTO DACStagingSalesData (id, productId, saleAmount, saleDate) VALUES
(1, 101, 150.0, &apos;2022-01-01&apos;),
(2, 102, -50.0, &apos;2022-01-02&apos;),
(3, 103, 200.0, &apos;2023-01-03&apos;);
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Here’s where Nessie’s branching model plays a pivotal role. We create a new branch called dataIntegration_010224 for integrating our staging data. This branch acts as a sandbox where we can safely test and validate our data before it affects the main dataset.&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-sql&quot;&gt;CREATE BRANCH dataIntegration_010224;
USE BRANCH dataIntegration_010224;
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;This branching mechanism is akin to Git workflows, providing a safe space for data manipulation without impacting the main data branch.&lt;/p&gt;
&lt;p&gt;We use the MERGE INTO statement to integrate data from the staging table into the main sales data table.&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-sql&quot;&gt;MERGE INTO DACSalesData AS target
USING DACStagingSalesData AS source
ON target.id = source.id ...
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Before finalizing the integration, we perform critical data quality checks. We scrutinize the data for negative sales amounts and future-dated records, ensuring the integrity and accuracy of our sales data.&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-sql&quot;&gt;SELECT COUNT(*) AS InvalidAmountCount FROM DACSalesData WHERE saleAmount &amp;lt; 0;
SELECT COUNT(*) AS InvalidDateCount FROM DACSalesData WHERE saleDate &amp;gt; CURRENT_DATE;
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Upon successful validation, we switch back to the main branch and merge our verified data from the dataIntegration_010224 branch. This process highlights the strength of Nessie&apos;s versioning system, ensuring that our main dataset remains pristine and error-free.&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-sql&quot;&gt;USE BRANCH main;
MERGE BRANCH dataIntegration_010224 INTO main;
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Through this example, we&apos;ve seen how Dremio and Project Nessie provide an efficient, reliable, and intuitive platform for managing and versioning data in a lakehouse architecture. The combination of Dremio&apos;s user-friendly interface and Nessie&apos;s robust versioning capabilities, including branching and merging, empowers data teams to handle complex data workflows with ease. This not only enhances data integrity but also accelerates the decision-making process, making it an invaluable asset in today&apos;s data-centric landscape.&lt;/p&gt;
</content:encoded><author>Alex Merced</author></item></channel></rss>