Small Language Models in Production: Optimizing inference, reducing costs, and delivering enterprise-ready AI with quantization and distillation methods

Author:   Talia Graham
Publisher:   Independently Published
ISBN:  

9798268181524


Pages:   278
Publication Date:   02 October 2025
Format:   Paperback
Availability:   Available To Order   Availability explained
We have confirmation that this item is in stock with the supplier. It will be ordered in for you and dispatched immediately.

Our Price $79.17 Quantity:  
Add to Cart

Share |

Small Language Models in Production: Optimizing inference, reducing costs, and delivering enterprise-ready AI with quantization and distillation methods


Overview

Ship enterprise ready AI that is fast, affordable, and controllable with small language models engineered through quantization and distillation. Many teams want the benefits of language models, but costs, latency, and compliance block real progress. This book focuses on making production systems work on real infrastructure, with methods that lower memory use, improve tokens per second, and keep behavior auditable. You will see where small models beat larger ones, how to size fleets for peak demand, and how to align performance targets with budgets. The material is grounded in healthcare, finance, retail, and manufacturing examples, so the guidance maps cleanly to day to day decisions. You will learn practical approaches that move beyond proofs of concept. The book explains how to compress and serve models without losing essential quality, how to benchmark instruction following and safety, and how to meet obligations under current governance standards. Each topic connects to production tasks, such as rollout planning, model monitoring, and incident response. The goal is clear, help you deploy reliable systems that meet service levels and cost controls. apply weight only quantization with int8 or int4 using gptq and awq use activation quantization including smoothquant and fp8 reduce long context costs with kv cache quantization and eviction serve at scale with vllm paged attention and continuous batching tune tensorrt llm schedulers for throughput and tail latency deploy hugging face tgi on gaudi and inferentia2 use speculative decoding and inflight batching in production plan hardware across h100 h200 b200 and evaluate gaudi 3 model tokens per second ttft and end to end throughput run edge and on device with llamacpp gguf mlc webgpu and apple mlx convert pipelines to gguf onnx directml openvino ir and nncf evaluate with mt bench and ifeval plus safety multilingual math and code map risks with owasp llm top 10 and set enterprise controls operate under eu ai act timelines and the nist ai rmf profile build logging monitoring canaries autoscaling and rollback plans Code heavy guide: includes working examples, configs, and commands that you can adapt to real services, from serving stacks to evaluation pipelines. Get the playbook for small language models in production, and start building systems that are fast, cost aware, and ready for enterprise use, grab your copy today.

Full Product Details

Author:   Talia Graham
Publisher:   Independently Published
Imprint:   Independently Published
Dimensions:   Width: 17.80cm , Height: 1.50cm , Length: 25.40cm
Weight:   0.485kg
ISBN:  

9798268181524


Pages:   278
Publication Date:   02 October 2025
Audience:   General/trade ,  General
Format:   Paperback
Publisher's Status:   Active
Availability:   Available To Order   Availability explained
We have confirmation that this item is in stock with the supplier. It will be ordered in for you and dispatched immediately.

Table of Contents

Reviews

Author Information

Tab Content 6

Author Website:  

Countries Available

All regions
Latest Reading Guide

NOV RG 20252

 

Shopping Cart
Your cart is empty
Shopping cart
Mailing List