Back to Blog
Cloud Computing

Amazon Textract: The Complete Guide to AWS Document Processing

Amazon Textract goes beyond OCR to extract structured data from invoices, forms, IDs, and loan packages using ML-powered APIs. This guide covers DetectDocumentText, AnalyzeDocument, Queries, AnalyzeExpense, AnalyzeID, Analyze Lending, Custom Queries, pricing, security, and a comparison with Azure AI Document Intelligence.

Cloud Computing
Service Deep Dive
19 min read
4 views

What Is Amazon Textract?

Inevitably, every organization processes documents — invoices, receipts, contracts, tax forms, medical records, loan applications, identity documents. Traditionally, extracting structured data from these documents required either manual data entry or brittle OCR tools that needed constant reconfiguration whenever form layouts changed. Amazon Textract eliminates both approaches with intelligent, ML-powered document processing.

Amazon Textract is a fully managed machine learning service from Amazon Web Services that automatically extracts text, handwriting, layout elements, and structured data from scanned documents. Unlike traditional OCR that simply reads characters off a page, Amazon Textract understands document structure — it identifies tables, forms with key-value pairs, signatures, and the relationships between different parts of a document.

For example, feed Amazon Textract an invoice and it does not just read the text — it understands that “Invoice Number” is a label and “INV-2024-001” is its value. Similarly, it identifies that a table’s header row contains “Item,” “Quantity,” and “Price,” and that the rows below are associated data — not disconnected text scattered across the page. Consequently, this contextual understanding is what separates intelligent document processing from basic character recognition, and it is what makes Amazon Textract the foundation for intelligent, automated document processing workflows across the entire AWS ecosystem.

Moreover, Textract’s impact on operational efficiency is substantial. Organizations that manually process documents — entering invoice data into ERP systems, transcribing medical forms into electronic records, or reviewing loan applications page by page — typically spend hours per document and introduce errors at every step. Textract reduces processing time from hours to seconds while maintaining a level of consistency and accuracy that manual human data entry simply cannot match at scale — especially when processing thousands of documents daily.

Amazon Textract Capabilities at a Glance

6 languages
Supported for Text Detection
10 samples
Minimum for Custom Queries
99%+ confidence
On Clear Printed Text

Importantly, Importantly, Amazon Textract supports multiple input formats including PDFs, PNGs, JPEGs, and TIFFs. Specifically, it handles both single-page and multi-page documents, processes printed and handwritten text, and returns results with confidence scores and bounding box coordinates for every extracted element. Furthermore, Textract is built on the same deep learning technology developed by Amazon’s computer vision scientists to analyze billions of documents daily for Amazon’s own operations.

Key Takeaway

Amazon Textract goes far beyond basic OCR. It understands document structure — tables, forms, key-value pairs, signatures, and layout — and returns structured, machine-readable data from virtually any document type. If your organization manually processes documents, Textract is the fastest path to automation on AWS.


How Amazon Textract Works

Fundamentally, Essentially, Amazon Textract operates as a serverless API service. You send a document (stored in S3 or as raw bytes), specify which type of analysis you need, and receive structured JSON results containing every detected element with its text, confidence score, and position on the page.

Under the hood, Under the hood, Textract’s ML models have been trained on millions of documents spanning dozens of industries and document types. Consequently, virtually any document you upload is automatically recognized and processed without templates or configuration. Furthermore, the models are continuously improved by AWS, so accuracy gets better over time without any action on your part.

Amazon Textract API Overview

Currently, Amazon Textract provides several specialized APIs, each designed for a different document processing task:

  • DetectDocumentText: Essentially, the simplest API — extracts all text from a document as words and lines. Essentially plain OCR but powered by deep learning for higher accuracy on challenging inputs like handwriting, low-quality scans, and noisy backgrounds.
  • AnalyzeDocument: Crucially, the core intelligence API. Extracts text plus structural elements — tables (rows, columns, cells), forms (key-value pairs), signatures, and layout elements (paragraphs, titles, headers, footers, lists). This is where Textract’s understanding of document structure sets it apart from basic OCR.
  • Queries: Additionally, a feature within AnalyzeDocument that lets you ask natural language questions about a document (e.g., “What is the patient name?” or “What is the due date?”) and receive precise answers. Pre-trained on paystubs, bank statements, W-2s, loan applications, mortgage notes, and insurance cards.
  • AnalyzeExpense: Specifically, purpose-built for invoices and receipts. Automatically identifies vendor names (even from logos without explicit labels), line items, quantities, prices, and totals. Returns normalized field names for consistent downstream processing across different invoice formats.
  • AnalyzeID: Similarly, purpose-built for identity documents. Extracts data from U.S. passports and driver’s licenses without templates, enabling automated identity verification, account creation, and KYC compliance workflows.
  • Analyze Lending: Finally, a managed workflow for mortgage loan packages. Automatically classifies pages into document types (W-2, paystub, bank statement, tax return) and routes each to the appropriate extraction API. Returns consolidated results across the entire loan package.

Synchronous vs Asynchronous Amazon Textract Processing

Importantly, Amazon Textract offers two processing modes to match different application patterns. Specifically, synchronous APIs (DetectDocumentText, AnalyzeDocument) process single-page documents in real time and return results immediately — ideal for interactive applications where users upload a document and expect instant feedback. In contrast, asynchronous APIs (StartDocumentTextDetection, StartDocumentAnalysis) handle multi-page documents by submitting a processing job and notifying you via SNS when results are ready — designed for batch processing pipelines and large document workflows where immediate response is not required.

For production deployments, the most common architecture combines S3 event notifications with Lambda functions. When a document lands in an S3 bucket, Lambda triggers Textract analysis, processes the structured results, and stores extracted data in DynamoDB, RDS, or Amazon OpenSearch for downstream applications. This event-driven pattern scales automatically and requires zero infrastructure management.

Moreover, for high-volume document processing, AWS provides IDP (Intelligent Document Processing) CDK constructs — pre-built infrastructure templates that deploy a complete document processing pipeline with S3 ingestion, SQS queuing for throttle management, Lambda orchestration, Textract analysis, and result storage. These constructs implement production best practices including exponential backoff with jitter, comprehensive error handling, dead-letter queues for failed documents, and CloudWatch monitoring dashboards — saving weeks of development time on pipeline infrastructure that would otherwise need to be built from scratch.

Additionally, Textract integrates with Amazon Augmented AI (A2I) for human review workflows. When Textract’s confidence score falls below your defined threshold, A2I automatically routes the document to a human reviewer — either from your own team or through a managed workforce. After human review and correction, the validated data flows seamlessly back into your automated pipeline for downstream processing. This human-in-the-loop pattern is essential for high-stakes document processing where automated extraction errors carry significant business or compliance risk.

Custom Queries for Amazon Textract

Additionally, Amazon Textract provides the ability to customize the pre-trained Queries feature using your own documents. Through the AWS Console, you can upload as few as ten sample documents, annotate the target data fields, and train a custom extraction model within hours. This is particularly valuable for industry-specific document types where the pre-trained models may not recognize specialized fields — such as extracting GST numbers from Indian invoices or policy numbers from insurance documents unique to your organization.

Importantly, Custom Queries maintains your data ownership and privacy throughout the training process. Your training documents and annotated data remain within your AWS account, and the resulting custom model is private to your organization. The trained model operates alongside Textract’s pre-trained capabilities, so you can use both standard and custom Queries in the same API call — combining the out-of-the-box accuracy of pre-trained models with the precision of your domain-specific customization.


Core Amazon Textract Features

Beyond the APIs described above, several capabilities make Amazon Textract particularly powerful for enterprise document processing. These features work together to handle virtually any document type — from simple single-page forms to complex multi-page financial reports with nested tables and mixed handwritten and printed content:

Table Extraction
Preserves table structure during extraction — rows, columns, cells, and headers maintain their relationships. Extracted table data can be loaded directly into databases, spreadsheets, or data pipelines with the original structure intact.
Form Key-Value Pair Extraction
Automatically identifies label-value pairs in forms (e.g., “First Name: Jane”). Distinguishes between keys and values without templates, enabling structured data capture from any form layout.
Handwriting Recognition
Detects and extracts handwritten text alongside printed text. Handles mixed documents where some fields are typed and others filled in by hand — common in healthcare intake forms, applications, and field reports.
Signature Detection
Detects the presence and location of signatures on documents. Essential for validating signed contracts, checks, consent forms, and loan agreements. Returns bounding box coordinates and confidence scores.
Layout Analysis
Extracts structural layout elements — paragraphs, titles, section headers, footers, page numbers, and lists. Preserves reading order and document hierarchy, making extracted content suitable for downstream NLP processing.
Natural Language Queries
Ask questions about documents in plain English (“What is the total amount due?”) and receive precise extracted answers. Pre-trained on common business document types with Custom Queries support for domain-specific fields.

Need to Automate Document Processing?
Our AWS team designs and deploys Textract-powered document processing pipelines


Amazon Textract Pricing Model

Fundamentally, Amazon Textract uses pay-per-page pricing with no minimum commitments. Rather than listing specific dollar amounts that change over time, here is how the cost structure works:

Understanding Amazon Textract Cost Dimensions

  • Pages processed: Essentially, charged per page, with separate rates for each API type. DetectDocumentText (plain OCR) is the cheapest. AnalyzeDocument (tables, forms, queries) costs more due to the structural analysis. AnalyzeExpense and AnalyzeID have their own per-page rates.
  • Feature combinations: Importantly, within AnalyzeDocument, you can enable multiple features (tables, forms, queries, signatures, layout) per call. Each enabled feature adds to the per-page cost. Therefore, only enable the features you actually need for each document type.
  • Volume tiers: Furthermore, tiered pricing means per-page costs decrease as monthly volume increases. High-volume document processing workflows benefit significantly from this graduated pricing.
  • Custom Queries: Additionally, training custom extraction models incurs a one-time training cost. Inference using custom models has a separate per-page rate.
Cost Optimization Strategy

Critically, match the API to the task. If you only need raw text, use DetectDocumentText — do not pay for AnalyzeDocument’s structural analysis. For invoices, use AnalyzeExpense rather than generic AnalyzeDocument, as it returns pre-normalized fields. Additionally, route documents to the cheapest capable API based on document type classification. For current pricing by API and volume tier, see the official Textract pricing page.


Amazon Textract Security and Compliance

Since Textract processes sensitive business documents — financial records, identity documents, medical forms, legal contracts — security is paramount.

Amazon Textract Data Protection

Specifically, all data processed by Amazon Textract is encrypted in transit (TLS) and at rest (AWS KMS). Importantly, documents are processed and results returned — Textract does not persistently store your documents after analysis. Furthermore, Textract supports VPC endpoints via AWS PrivateLink, ensuring that document data never traverses the public internet. IAM policies provide fine-grained access control over which users and applications can call Textract APIs.

Moreover, for organizations with strict data residency requirements, Textract processes documents in the AWS Region where you make the API call. Your documents never leave the selected Region during processing. Combined with S3 bucket policies, KMS encryption keys, and IAM access controls, this architecture ensures that sensitive document data remains within your defined security boundary at all times.

Additionally, Amazon Textract is HIPAA eligible, making it suitable for healthcare organizations processing medical records, insurance claims, and patient intake forms containing protected health information. It also supports SOC 1/2/3, PCI DSS, and ISO 27001 compliance standards. For financial services organizations processing loan documents, invoices, and tax forms, these certifications ensure regulatory compliance without additional infrastructure or audit burden.


What’s New in Amazon Textract

Amazon Textract continues to receive regular updates from AWS. Recently, the Layout analysis feature type was added, which extracts structural elements like paragraphs, titles, headers, footers, and lists — preserving the reading order and hierarchy of complex documents. This is particularly valuable for downstream NLP processing and content management systems where understanding document structure and reading order matters as much as the raw text content itself.

Additionally, Custom Queries now let organizations train extraction models on as few as 10 annotated samples, making domain-specific document processing accessible without ML expertise. Combined with the pre-trained Queries capability — which already covers paystubs, bank statements, W-2s, loan application forms, mortgage notes, claims documents, and insurance cards — organizations can handle both standard and proprietary document formats through a single, unified API.

Furthermore, Textract has improved its handwriting recognition accuracy and expanded support for mixed-content documents where printed and handwritten text coexist on the same page. For organizations in healthcare, insurance, and government — where handwritten annotations on printed forms are common — these improvements directly reduce the need for manual review and correction of extracted data.


Real-World Amazon Textract Use Cases

Given its versatility, Amazon Textract serves organizations across every industry that processes paper or digital documents. From financial services firms processing thousands of loan applications daily to healthcare organizations digitizing decades of patient records, Textract powers the transition from manual document handling to automated, scalable pipelines. Below are the use cases we implement most frequently for our enterprise clients:

Invoice and Receipt Processing
Automatically extract vendor names, line items, quantities, prices, totals, and dates from invoices and receipts using AnalyzeExpense. Normalize field names across different vendor formats for consistent downstream processing in accounts payable systems.
Mortgage and Loan Processing
Use Analyze Lending to automatically classify and extract data from entire loan packages — W-2s, paystubs, bank statements, tax returns — reducing mortgage processing time from days to minutes.
Healthcare Document Automation
Extract patient data from intake forms, insurance claims, pre-authorization forms, and medical records. HIPAA eligibility ensures compliance when processing protected health information.
Identity Verification
Extract data from passports and driver’s licenses using AnalyzeID for automated KYC compliance, account creation, and onboarding workflows — reducing manual identity verification from days to seconds.
Contract Analysis
Extract key terms, dates, parties, and clauses from contracts and legal documents. Combine with Amazon Comprehend for entity extraction and sentiment analysis on contract language.
Searchable Document Archives
Convert scanned paper archives into searchable, indexed digital libraries. Extract text from historical documents and store it in Amazon OpenSearch or CloudSearch for full-text search across your entire document repository.

Amazon Textract vs Azure AI Document Intelligence

If you are evaluating document processing services across cloud providers, here is how Amazon Textract compares with Microsoft’s Azure AI Document Intelligence (formerly Form Recognizer):

Capability Amazon Textract Azure AI Document Intelligence
Core OCR Yes — DetectDocumentText with deep learning Yes — Read API with advanced OCR
Table Extraction ✓ Preserves full table structure Yes — Table extraction with cell merging
Form Key-Value Pairs ✓ Automatic without templates Yes — Pre-built and custom models
Natural Language Queries ✓ Ask questions in plain English ◐ Field extraction with labels
Invoice Processing Yes — AnalyzeExpense with normalization ✓ Pre-built invoice model
ID Document Processing Yes — AnalyzeID (U.S. documents) ✓ Broader international ID support
Lending / Mortgage ✓ Analyze Lending managed workflow ✕ No equivalent managed workflow
Custom Model Training Yes — Custom Queries (10+ samples) Yes — Custom models with labeling
Handwriting Recognition Yes — Mixed print and handwriting Yes — Mixed print and handwriting
Language Support ◐ 6 languages (EN, ES, DE, FR, IT, PT) ✓ 300+ languages for print OCR

Choosing the Right Amazon Textract Alternative

Clearly, both services are mature and capable for document processing. Ultimately, your cloud ecosystem determines the best fit. If you build on AWS, Textract’s native integration with S3, Lambda, SQS, and A2I makes it the natural choice for automated document pipelines. Conversely, if your infrastructure runs on Azure, Document Intelligence integrates natively with Azure Blob Storage, Logic Apps, and Azure AI Services.

Notably, Textract’s Analyze Lending workflow for mortgage processing has no equivalent in Azure — a significant differentiator for financial services organizations automating loan origination. Similarly, the natural language Queries feature provides a more intuitive extraction interface than Azure’s label-based field extraction for ad-hoc document analysis.

However, Azure holds clear advantages in two areas. First, language support: Azure supports 300+ languages for printed text OCR versus Textract’s 6 languages — a critical gap for global organizations processing multilingual documents. Second, international ID document coverage: Azure recognizes identity documents from many countries, while Textract’s AnalyzeID is currently limited to U.S. government-issued passports and driver’s licenses.

For organizations on AWS that need broader language support, a hybrid approach works well: use Textract for English-language document processing (where its structural understanding excels) and Amazon Bedrock with multimodal models for multilingual document analysis where Textract’s language limitations are a constraint.


Getting Started with Amazon Textract

Fortunately, Amazon Textract requires no setup — there are no models to deploy and no training required for the pre-built APIs. You call the API with your document and receive structured results immediately. The free tier provides enough capacity to validate your use case before committing to production-level spending.

Your First Amazon Textract API Call

Below is a minimal Python example that extracts text from a document stored in S3:

import boto3

# Initialize the Textract client
client = boto3.client('textract', region_name='us-east-1')

# Extract text from a document in S3
response = client.detect_document_text(
    Document={
        'S3Object': {
            'Bucket': 'my-documents',
            'Name': 'invoices/invoice-001.pdf'
        }
    }
)

# Print extracted lines of text
for block in response['Blocks']:
    if block['BlockType'] == 'LINE':
        print(f"{block['Text']} ({block['Confidence']:.1f}%)")

Subsequently, for structured extraction (tables, forms, queries), replace detect_document_text with analyze_document and specify the desired feature types. For invoice processing, use analyze_expense instead. For identity documents, use analyze_id. Each API returns structured JSON with confidence scores and bounding box coordinates for every extracted element. For more details and advanced patterns, see the Amazon Textract documentation.


Amazon Textract Best Practices and Pitfalls

Advantages
Goes beyond OCR — understands tables, forms, key-value pairs, and layout
Natural language Queries let you extract specific fields without templates
Purpose-built APIs for invoices (AnalyzeExpense), IDs, and lending workflows
Fully serverless with automatic scaling — no infrastructure to manage
Custom Queries trainable with as few as 10 sample documents
HIPAA eligible, SOC compliant, PCI DSS compliant
Limitations
Limited to 6 languages — not suitable for multilingual global deployments
Complex table structures can produce inconsistent extraction results
No checkbox detection capability for form processing
Cloud-only — no offline processing option for air-gapped environments
AnalyzeID currently limited to U.S. government-issued documents

Recommendations for Amazon Textract Deployment

  • First, classify documents before processing: Route different document types to the most appropriate API — invoices to AnalyzeExpense, IDs to AnalyzeID, general forms to AnalyzeDocument. This approach maximizes accuracy and minimizes cost, since each API is optimized for its specific document type.
  • Additionally, set confidence thresholds for your use case: Textract returns confidence scores for every extracted element. For high-stakes applications (financial processing, compliance), flag extractions below 95% confidence for human review rather than processing automatically. For lower-stakes applications (search indexing), 80% may be sufficient.
  • Furthermore, implement retry logic with exponential backoff: Textract enforces per-account rate limits (transactions per second). Implement exponential backoff with jitter to handle throttling gracefully, especially during batch processing of large document volumes. AWS provides CDK constructs that implement these patterns out of the box.

Architecture and Validation Best Practices

  • Moreover, use asynchronous APIs for multi-page documents: Synchronous APIs only support single-page documents. For PDFs with multiple pages, use the asynchronous Start/Get pattern with SNS notifications to process documents without blocking your application. Queue documents via SQS to smooth traffic and stay within rate limits.
  • Finally, validate extracted data against business rules: ML-powered extraction is highly accurate but not infallible. Implement validation logic — checking that dates are valid, totals match line items, required fields are present, and numeric formats are consistent — to catch extraction errors before they propagate into downstream systems like ERP, CRM, or compliance databases.
Key Takeaway

Amazon Textract transforms manual document processing into automated, scalable pipelines — extracting structured data from invoices, forms, IDs, and loan packages through purpose-built APIs. The key to successful deployment is matching the right API to each document type, setting appropriate confidence thresholds, and implementing business-rule validation for extracted data. An experienced AWS partner can help you design document processing architectures that maximize accuracy while minimizing cost.

Ready to Automate Your Document Workflows?
Let our AWS team build intelligent document processing pipelines powered by Textract


Frequently Asked Questions About Amazon Textract

Common Questions Answered
What is Amazon Textract used for?
Essentially, Amazon Textract is used for extracting structured data from documents — invoices, receipts, forms, contracts, tax documents, identity documents, medical records, and loan packages. It goes beyond basic OCR by understanding document structure: tables, key-value pairs, signatures, and layout elements. Common use cases include automated invoice processing in accounts payable, mortgage loan processing for financial institutions, healthcare document digitization for electronic health records, identity verification for KYC compliance, contract analysis for legal teams, and creating searchable document archives from paper records.
How is Amazon Textract different from OCR?
Fundamentally, traditional OCR reads characters from a page and returns raw text — it does not understand the structure of a document. In contrast, Amazon Textract uses deep learning to understand tables, forms, key-value pairs, signatures, and layout hierarchies. For example, In contrast, Textract knows that “Due Date” is a label and “March 15, 2026” is its value — OCR would simply return both as disconnected text strings. Additionally, Moreover, Textract can answer natural language questions about documents and normalize invoice data across different vendor formats.
What languages does Amazon Textract support?
Currently, Amazon Textract supports text detection in English, Spanish, German, French, Italian, and Portuguese. For organizations requiring broader language support, Azure AI Document Intelligence supports 300+ languages for printed text OCR. Alternatively, Amazon Rekognition’s DetectText API supports additional languages for text in natural scenes, though it lacks Textract’s structural understanding of documents.

Technical and Integration Questions

Does Amazon Textract store my documents?
No. Importantly, Amazon Textract does not persistently store your documents after processing. Specifically, it analyzes the document, returns structured results with confidence scores and coordinates, and discards the input. Furthermore, all processing is encrypted in transit and at rest. Furthermore, you can use VPC endpoints to keep document traffic entirely within your private network.
What is the difference between Amazon Textract and Amazon Rekognition for text?
Specifically, Amazon Rekognition’s DetectText API is designed for text in natural scenes — street signs, product labels, overlaid graphics in images and videos. In contrast, Amazon Textract is designed for document processing — invoices, forms, contracts, and structured documents. Therefore, use Rekognition when you need to find text within photos or video frames, and use Textract when you need to extract structured data from business documents. Consequently, they serve different use cases despite both involving text extraction.
Weekly Briefing
Security insights, delivered Tuesdays.

Join 1 million+ security professionals. Practical, vendor-neutral analysis of threats, tools, and architecture decisions.