|
|
|||
|
||||
OverviewYour production AI systems are failing right now, and your monitoring stack cannot see it. Every dashboard is green. Latency is within SLO. The inference endpoint returns a 200. But the fraud model trained on pre-pandemic data is scoring against a distribution that no longer exists. The recommendation engine drifted three sprints ago and nobody noticed. The LLM-powered support assistant started hallucinating policy details after a prompt template was promoted without regression testing. These are not hypothetical scenarios. They are live production incidents happening across every industry, and traditional DevOps observability was never designed to catch them. The gap between what your infrastructure metrics report and what your models are actually doing is where silent failures live, where revenue leaks, where compliance violations accumulate, and where trust erodes one undetected prediction at a time. Inside this book, readers will learn how to: - Instrument the five-layer AI observability stack covering infrastructure, data pipeline, model behavior, output quality, and business outcome telemetry for full production visibility - Detect model drift before it causes damage using statistical methods like PSI and KS tests with threshold design and automated alerting pipelines - Monitor large language models in production including hallucination detection, prompt regression testing, evaluator-as-judge pipelines, and token-level cost attribution - Build observability for agentic AI systems with tool-call tracing, multi-step workflow instrumentation, and agent safety patterns - Design SLOs for non-deterministic systems that go beyond RED and USE metrics to capture the failure modes that actually matter for machine learning - Implement governance and compliance as code with immutable audit logging, tamper-evident event stores, and alignment to SR 11-7, EU AI Act, and HIPAA - Operationalize FinOps for AI workloads by instrumenting unit-cost telemetry across GPU compute, inference endpoints, and LLM token consumption - Diagnose and resolve silent failures using structured failure taxonomies, root-cause analysis, and incident response playbooks built for probabilistic systems - Integrate OpenTelemetry into ML infrastructure to unify traces, metrics, and logs across training pipelines, feature stores, and serving endpoints This is not a strategy deck. This is a working reference for engineers and architects who carry production responsibility for AI infrastructure. Every chapter delivers concrete instrumentation patterns, failure taxonomies, runbook templates, and architecture decisions grounded in operational experience. Whether you are a staff ML engineer debugging a silent accuracy regression, a platform engineer designing an observability stack, or an SRE writing SLOs for your first model endpoint, this book gives you patterns you can ship this sprint. The AI systems in your organization today are making predictions that affect revenue, risk, customer trust, and regulatory standing. The models powering those predictions degrade silently. Feature pipelines break without alerts. LLMs hallucinate with full confidence. Agentic workflows take actions no human reviewed. The teams that instrument observability across all five layers will catch failures before customers do. Those relying on infrastructure metrics alone will discover problems after damage compounds. Production AI deserves production-grade observability. This is your engineering playbook. Open it now. Full Product DetailsAuthor: Jordan Louis-CharlesPublisher: Cybersoft Publishing LLC Imprint: Cybersoft Publishing LLC Dimensions: Width: 15.20cm , Height: 1.90cm , Length: 22.90cm Weight: 0.472kg ISBN: 9798904980078Pages: 354 Publication Date: 30 April 2026 Audience: General/trade , General Format: Paperback Publisher's Status: Active Availability: Available To Order We have confirmation that this item is in stock with the supplier. It will be ordered in for you and dispatched immediately. Table of ContentsReviewsAuthor InformationTab Content 6Author Website:Countries AvailableAll regions |
||||