|
|
|||
|
||||
OverviewShip enterprise ready AI that is fast, affordable, and controllable with small language models engineered through quantization and distillation. Many teams want the benefits of language models, but costs, latency, and compliance block real progress. This book focuses on making production systems work on real infrastructure, with methods that lower memory use, improve tokens per second, and keep behavior auditable. You will see where small models beat larger ones, how to size fleets for peak demand, and how to align performance targets with budgets. The material is grounded in healthcare, finance, retail, and manufacturing examples, so the guidance maps cleanly to day to day decisions. You will learn practical approaches that move beyond proofs of concept. The book explains how to compress and serve models without losing essential quality, how to benchmark instruction following and safety, and how to meet obligations under current governance standards. Each topic connects to production tasks, such as rollout planning, model monitoring, and incident response. The goal is clear, help you deploy reliable systems that meet service levels and cost controls. apply weight only quantization with int8 or int4 using gptq and awq use activation quantization including smoothquant and fp8 reduce long context costs with kv cache quantization and eviction serve at scale with vllm paged attention and continuous batching tune tensorrt llm schedulers for throughput and tail latency deploy hugging face tgi on gaudi and inferentia2 use speculative decoding and inflight batching in production plan hardware across h100 h200 b200 and evaluate gaudi 3 model tokens per second ttft and end to end throughput run edge and on device with llamacpp gguf mlc webgpu and apple mlx convert pipelines to gguf onnx directml openvino ir and nncf evaluate with mt bench and ifeval plus safety multilingual math and code map risks with owasp llm top 10 and set enterprise controls operate under eu ai act timelines and the nist ai rmf profile build logging monitoring canaries autoscaling and rollback plans Code heavy guide: includes working examples, configs, and commands that you can adapt to real services, from serving stacks to evaluation pipelines. Get the playbook for small language models in production, and start building systems that are fast, cost aware, and ready for enterprise use, grab your copy today. Full Product DetailsAuthor: Talia GrahamPublisher: Independently Published Imprint: Independently Published Dimensions: Width: 17.80cm , Height: 1.50cm , Length: 25.40cm Weight: 0.485kg ISBN: 9798268181524Pages: 278 Publication Date: 02 October 2025 Audience: General/trade , General Format: Paperback Publisher's Status: Active Availability: Available To Order We have confirmation that this item is in stock with the supplier. It will be ordered in for you and dispatched immediately. Table of ContentsReviewsAuthor InformationTab Content 6Author Website:Countries AvailableAll regions |
||||