Building an analytics architecture for unstructured data and multimodal AI

June 11, 2025

Data scientists today face a perfect storm: an explosion of inconsistent, unstructured, multimodal data scattered across silos – and mounting pressure to turn it into accessible, AI-ready insights. The challenge isn’t just dealing with diverse data types, but also the need for scalable, automated processes to prepare, analyze, and use this data effectively.

Many organizations fall into predictable traps when updating their data pipelines for AI. The most common: treating data preparation as a series of one-off tasks rather than designing for repeatability and scale. For example, hardcoding product categories in advance can make a system brittle and hard to adapt to new products. A more flexible approach is to infer categories dynamically from unstructured content, like product descriptions, using a foundation model, allowing the system to evolve with the business.

Forward-looking teams are rethinking pipelines with adaptability in mind. Market leaders use AI-powered analytics to extract insights from this diverse data, transforming customer experiences and operational efficiency. The shift demands a tailored, priority-based approach to data processing and analytics that embraces the diverse nature of modern data, while optimizing for different computational needs across the AI/ML lifecycle.

Tooling for unstructured and multimodal data projects

Different data types benefit from specialized approaches. For example:

Text analysis leverages contextual understanding and embedding capabilities to extract meaning;
Video pipelines processing employs computer vision models for classification;
Time-series data uses forecasting engines.

Platforms must match workloads to optimal processing methods while maintaining data access, governance, and resource efficiency.

Consider text analytics on customer support data. Initial processing might use lightweight natural language processing (NLP) for classification. Deeper analysis could employ large language models (LLMs) for sentiment detection, while production deployment might require specialized vector databases for semantic search. Each stage requires different computational resources, yet all must work together seamlessly in production.

Representative AI Workloads

AI Workload Type Storage Network Compute Scaling Characteristics

Real-time NLP classification In-memory data stores; Vector databases for embedding storage Low-latency (<100ms); Moderate bandwidth GPU-accelerated inference; High-memory CPU for preprocessing and feature extraction Horizontal scaling for concurrent requests; Memory scales with vocabulary

Textual data analysis Document-oriented databases and vector databases for embedding; Columnar storage for metadata Batch-oriented, high-throughput networking for large-scale data ingestion and analysis GPU or TPU clusters for model training; Distributed CPU for ETL and data preparation Storage grows linearly with dataset size; Compute costs scale with token count and model complexity

Media analysis Scalable object storage for raw media; Caching layer for frequently-
accessed datasets Very high bandwidth; Streaming support Large GPU clusters for training; Inference-optimized GPUs Storage costs increase rapidly with media data; Batch processing helps manage compute scaling

Temporal forecasting, anomaly detection Time-partitioned tables; Hot/cold storage tiering for efficient data management Predictable bandwidth; Time-window batching Often CPU-bound; Memory scales with time window size Partitioning by time ranges enables efficient scaling; Compute requirements grow with prediction window.

Note: Comparative resource requirements for representative AI workloads across storage, network, compute, and scaling. Source: Google Cloud

The different data types and processing stages call for different technology choices. Each workload needs its own infrastructure, scaling methods, and optimization strategies. This variety shapes today’s best practices for handling AI-bound data:

Use in-platform AI assistants to generate SQL, explain code, and understand data structures. This can dramatically speed up initial prep and exploration phases. Combine this with automated metadata and profiling tools to reveal data quality issues before manual intervention is required.

Execute all data cleaning, transformation, and feature engineering directly within your core data platform using its query language. This eliminates data movement bottlenecks and the overhead of juggling separate preparation tools.

Automate data preparation workflows with version-controlled pipelines inside your data environment, to ensure reproducibility and free you to focus on modeling over scripting.

Take advantage of serverless, auto-scaling compute platforms so your queries, transformations, and feature engineering tasks run efficiently for any data volume. Serverless platforms allow you to focus on transformation logic rather than infrastructure.

These best practices apply to structured and unstructured data alike. Contemporary platforms can expose images, audio, and text through structured interfaces, allowing summarization and other analytics via familiar query languages. Some can transform AI outputs into structured tables that can be queried and joined like traditional datasets.

By treating unstructured sources as first-class analytics citizens, you can integrate them more cleanly into workflows without building external pipelines.
Today’s architecture for tomorrow’s challenges

Effective modern data architecture operates within a central data platform that supports diverse processing frameworks, eliminating the inefficiencies of moving data between tools. Increasingly, this includes direct support for unstructured data with familiar languages like SQL. This allows them to treat outputs like customer support transcripts as query-able tables that can be joined with structured sources like sales records – without building separate pipelines.

As foundational AI models become more accessible, data platforms are embedding summarization, classification, and transcription directly into workflows, enabling teams to extract insights from unstructured data without leaving the analytics environment. Some, like Google Cloud BigQuery, have introduced rich SQL primitives, such as AI.GENERATE_TABLE(), to convert outputs from multimodal datasets into structured, queryable tables without requiring bespoke pipelines.

AI and multimodal data are reshaping analytics. Success requires architectural flexibility: matching tools to tasks in a unified foundation. As AI becomes more embedded in operations, that flexibility becomes critical to maintaining velocity and efficiency.

Learn more about these capabilities and start working with multimodal data in BigQuery.

Informatica enhances IDMC with AI-powered MDM, governance, and compliance tools

July 31, 2025

New ‘Quishing’ Scams Are Tricking Millions

July 31, 2025

This Minuscule Qi2 MagSafe Power Bank For iPhone Is Just $39.99 Right Now

July 31, 2025

After laying off 15,000 employees, Microsoft reveals Xbox Game Pass revenue hit nearly $5 billion

July 31, 2025

Previous article
Databricks aims to optimize agent building for enterprises with Agent Bricks
Next article
Amazon to invest $20 billion in AI data centers across Pennsylvania

Informatica enhances IDMC with AI-powered MDM, governance, and compliance tools

AI July 31, 2025

Managing Azure VMs with Project Flash

AI July 31, 2025

First look: Guided code generation with Kiro

AI July 31, 2025

Leave a reply Cancel reply

Comment:
Please enter your comment!

Name:*
Please enter your name here

Email:*
You have entered an incorrect email address!

Please enter your email address here

Website:

Save my name, email, and website in this browser for the next time I comment.

Facebook Facebook
Instagram Instagram
Mail Mail
TikTok TikTok
Twitter Twitter

Terms and Conditions

Privacy Policy

Subscribe to get the latest news, offers and special announcements.
By subscribing, you're accepting to receive promotions.

Affiliate Disclosure

Contact

© Copyright - Tech News Today 2025