Building an analytics architecture for unstructured data and multimodal AI

Data scientists today face a perfect storm: an explosion of inconsistent, unstructured, multimodal data scattered across silos – and mounting pressure to turn it into accessible, AI-ready insights. The challenge isn’t just dealing with diverse data types, but also the need for scalable, automated processes to prepare, analyze, and use this data effectively.

Many organizations fall into predictable traps when updating their data pipelines for AI. The most common: treating data preparation as a series of one-off tasks rather than designing for repeatability and scale. For example, hardcoding product categories in advance can make a system brittle and hard to adapt to new products. A more flexible approach is to infer categories dynamically from unstructured content, like product descriptions, using a foundation model, allowing the system to evolve with the business.

Forward-looking teams are rethinking pipelines with adaptability in mind. Market leaders use AI-powered analytics to extract insights from this diverse data, transforming customer experiences and operational efficiency. The shift demands a tailored, priority-based approach to data processing and analytics that embraces the diverse nature of modern data, while optimizing for different computational needs across the AI/ML lifecycle.

Tooling for unstructured and multimodal data projects

Different data types benefit from specialized approaches. For example:

  • Text analysis leverages contextual understanding and embedding capabilities to extract meaning;
  • Video pipelines processing employs computer vision models for classification;
  • Time-series data uses forecasting engines.

Platforms must match workloads to optimal processing methods while maintaining data access, governance, and resource efficiency.

Consider text analytics on customer support data. Initial processing might use lightweight natural language processing (NLP) for classification. Deeper analysis could employ large language models (LLMs) for sentiment detection, while production deployment might require specialized vector databases for semantic search. Each stage requires different computational resources, yet all must work together seamlessly in production.

Representative AI Workloads

AI Workload Type Storage Network Compute Scaling Characteristics
Real-time NLP classification In-memory data stores; Vector databases for embedding storage Low-latency (<100ms); Moderate bandwidth GPU-accelerated inference; High-memory CPU for preprocessing and feature extraction Horizontal scaling for concurrent requests; Memory scales with vocabulary
Textual data analysis Document-oriented databases and vector databases for embedding; Columnar storage for metadata Batch-oriented, high-throughput networking for large-scale data ingestion and analysis GPU or TPU clusters for model training; Distributed CPU for ETL and data preparation Storage grows linearly with dataset size; Compute costs scale with token count and model complexity
Media analysis Scalable object storage for raw media; Caching layer for frequently-
accessed datasets
Very high bandwidth; Streaming support Large GPU clusters for training; Inference-optimized GPUs Storage costs increase rapidly with media data; Batch processing helps manage compute scaling
Temporal forecasting, anomaly detection Time-partitioned tables; Hot/cold storage tiering for efficient data management Predictable bandwidth; Time-window batching Often CPU-bound; Memory scales with time window size Partitioning by time ranges enables efficient scaling; Compute requirements grow with prediction window.
Note: Comparative resource requirements for representative AI workloads across storage, network, compute, and scaling. Source: Google Cloud

The different data types and processing stages call for different technology choices. Each workload needs its own infrastructure, scaling methods, and optimization strategies. This variety shapes today’s best practices for handling AI-bound data:

  • Use in-platform AI assistants to generate SQL, explain code, and understand data structures. This can dramatically speed up initial prep and exploration phases. Combine this with automated metadata and profiling tools to reveal data quality issues before manual intervention is required.
  • Execute all data cleaning, transformation, and feature engineering directly within your core data platform using its query language. This eliminates data movement bottlenecks and the overhead of juggling separate preparation tools.
  • Automate data preparation workflows with version-controlled pipelines inside your data environment, to ensure reproducibility and free you to focus on modeling over  scripting.
  • Take advantage of serverless, auto-scaling compute platforms so your queries, transformations, and feature engineering tasks run efficiently for any data volume.  Serverless platforms allow you to focus on transformation logic rather than infrastructure.

These best practices apply to structured and unstructured data alike. Contemporary platforms can expose images, audio, and text through structured interfaces, allowing summarization and other analytics via familiar query languages. Some can transform AI outputs into structured tables that can be queried and joined like traditional datasets.

By treating unstructured sources as first-class analytics citizens, you can integrate them more cleanly into workflows without building external pipelines. 

Today’s architecture for tomorrow’s challenges

Effective modern data architecture operates within a central data platform that supports diverse processing frameworks, eliminating the inefficiencies of moving data between tools. Increasingly, this includes direct support for unstructured data with familiar languages like SQL. This allows them to treat outputs like customer support transcripts as query-able tables that can be joined with structured sources like sales records –  without building separate pipelines.

As foundational AI models become more accessible, data platforms are embedding summarization, classification, and transcription directly into workflows, enabling teams to extract insights from unstructured data without leaving the analytics environment.  Some, like Google Cloud BigQuery, have introduced rich SQL primitives, such as AI.GENERATE_TABLE(), to convert outputs from multimodal datasets into structured, queryable tables without requiring bespoke pipelines.

AI and multimodal data are reshaping analytics. Success requires architectural flexibility: matching tools to tasks in a unified foundation. As AI becomes more embedded in operations, that flexibility becomes critical to maintaining velocity and efficiency.

Learn more about these capabilities and start working with multimodal data in BigQuery.

Donner Music, make your music with gear
Multi-Function Air Blower: Blowing, suction, extraction, and even inflation

Leave a reply

Please enter your comment!
Please enter your name here