Data Handling and Retention¶
Last updated: April 17, 2026
This document describes how the service handles customer data operationally. It is a practical companion to the Privacy Policy and is intended for readers who want the infrastructure-level detail behind the privacy claims.
Goal¶
Operate the API in a way that is useful for customers while minimizing unnecessary retention of submitted documents. The async service necessarily uses temporary staged storage to bridge upload, processing, and retrieval. All retention is bounded, enforced by infrastructure lifecycle rules, and documented publicly.
Data types¶
1. Customer-submitted PDFs¶
- uploaded directly to S3 using a presigned PUT URL after
POST /uploads
2. Derived outputs¶
- tagged PDF returned to the customer
- structured workflow metadata (processing outcome, signals, review guidance)
- review-oriented metadata that helps indicate when human follow-up may still be appropriate
3. Job metadata¶
- job ID, status, timestamps, error details, workflow metadata
- stored in DynamoDB with TTL-based automatic expiration
4. Operational metadata¶
- request timestamp, request ID, latency, status code, error category, file size, page count
Handling model¶
| Data | Storage | Retention | Enforcement |
|---|---|---|---|
| Source PDF | S3 (inputs/{job_id}.pdf) |
3 days | S3 lifecycle rule (infrastructure-level) |
| Tagged PDF | S3 (outputs/{job_id}.pdf) |
3 days | S3 lifecycle rule (infrastructure-level) |
| Job metadata | DynamoDB | 3 days | DynamoDB TTL (expires_at attribute) |
| Queue messages | SQS | 1 day | SQS message retention policy |
| Dead-letter messages | SQS DLQ | 14 days | SQS message retention policy |
| Logs | CloudWatch | 30 days | CloudWatch log-group retention |
Why infrastructure-level enforcement matters¶
Retention is enforced by S3 lifecycle rules and DynamoDB TTL — not by application code. This means:
- artifacts are deleted even if the application has bugs, crashes, or is not running
- retention windows cannot be accidentally extended by application changes
- the enforcement mechanism is auditable via Terraform configuration
Storage security¶
| Control | Implementation |
|---|---|
| Encryption at rest | AES-256 server-side encryption (S3) |
| Public access | Blocked (S3 public access block on all settings) |
| Access control | IAM-scoped roles; only Lambda and worker task roles have read/write access |
| Transport | TLS for all API communication; internal AWS service calls use AWS SDK (TLS) |
| Presigned upload URLs | Time-limited (15 minutes); scoped to a single input object key |
| Presigned download URLs | Time-limited (1 hour); scoped to a single output object key |
Queue message content¶
SQS messages contain only job references (job_id, input_key), not document content. Document data flows through S3, not through the queue.
Client-visible lifecycle semantics¶
The staged-upload model means customers should expect several distinct time windows:
- the upload URL is short-lived and may need refresh if the client waits too long before uploading
- the job remains
upload_pendinguntil the client finalizes a successfully staged object - the tagged result remains retrievable only within the bounded artifact-retention window
- result download URLs are themselves presigned and shorter-lived than the overall artifact-retention window
This distinction matters operationally: a live job record does not imply a still-valid upload URL, and a completed job does not imply a permanently valid download URL.
Logging policy¶
API Gateway access logs¶
API Gateway emits one structured JSON line per HTTP request. The log format is pinned by Terraform and captures only:
- request ID
- source IP (typically a RapidAPI proxy IP when accessed via the marketplace)
- timestamp
- HTTP method and route key
- response status code
- response length
- integration latency
API Gateway access logs do not include request headers, API key values, request bodies, response bodies, or any document content.
Application logs (Lambda and worker)¶
- request ID, job ID, status transitions
- success / failure outcome and error category
- file size, page count, processing latency
Application logs do not include:
- document content or extracted text
- PDF binary data
- presigned URL values
- API key values
Retention¶
All CloudWatch log groups for the service (API Gateway access logs, Lambda logs, worker logs) use a 30-day retention setting, enforced at the infrastructure level.
Internal access¶
Access to submitted files, outputs, and logs is limited to authorized operators with a legitimate operational need. IAM policies enforce least-privilege access.
Deletion¶
All artifacts are automatically deleted by infrastructure lifecycle rules. No manual deletion is required for normal operation.
Retention windows are listed in the Handling model table above. Note that S3 lifecycle rules run asynchronously (typically once per day), so actual object deletion may occur up to roughly 24 hours after the retention boundary is crossed. DynamoDB TTL similarly deletes items on a background schedule after the TTL timestamp has elapsed.
Training and secondary use¶
Customer documents are not used for model training or unrelated analytics.
Service posture¶
The service is an automation aid for native/text-layer PDF remediation workflows. Some outputs — especially figure-description cases without credible source text or caption context — may still require source-side or human remediation.