# AI Analytics — full context dump for LLM ingestion

> This file is designed to be pasted into a ChatGPT / Claude / Perplexity conversation as one-shot context. It contains every dataset, endpoint, MCP tool, and citation pattern. ~25KB.

## What this is

ai-analytics.org is a public-domain US federal regulatory data hub. We aggregate 23+ government datasets and expose them via:

1. **REST API** at https://api.ai-analytics.org (~50 endpoints)
2. **MCP server** at https://api.ai-analytics.org/mcp (38+ tools, JSON-RPC over Streamable HTTP)
3. **Per-dataset HTML landing pages** at https://api.ai-analytics.org/datasets/{slug}
4. **Per-entity cross-vertical timeline pages** at https://api.ai-analytics.org/entity/{ticker}

Every row in every dataset traces back to a primary US-government source URL. Derived data is licensed [CC0](https://creativecommons.org/publicdomain/zero/1.0/); underlying federal content is US public domain under 17 USC §105 and 5 USC §105.

## Why this exists

Most paid competitors (Equilar, WhaleWisdom, Citeline, Cortellis, Verisk, Bloomberg ESG) are built around one regulator's identifier system. They can only see SEC, or only FDA, or only federal contracts, or only EPA. We bridge them via an `entity_master` table that joins:

- **CIK** (SEC EDGAR central index key)
- **UEI** (federal-contractor unique entity identifier, SAM.gov)
- **LEI** (GLEIF Legal Entity Identifier, global)
- **DUNS** (legacy D&B)
- **Ticker** (NYSE/NASDAQ symbol)

…all to one internal `entity_id`. The token-set name-matcher (`entity_alias`) then maps fuzzy name variants across regulators so e.g. "Tesla, Inc." → "TESLA INC" → "Tesla Motors Inc.".

## The 23 datasets

### SEC EDGAR (4 datasets)
- `sec-filings` — every SEC filing indexed by CIK + form type + date. 17k+ filings.
- Insider transactions (Form 4) — 4.8k+ records, deduped non-derivative.
- Form 144 notice-of-proposed-sale — 700+ records.
- Schedule 13D/G positions — 540+ structured (target ticker, aggregate shares, % of class).
- 13F holdings — 64k+ rows.
- NPORT-P mutual fund holdings — 46k+ rows.
- XBRL financial facts — 92k+ rows.

### FDA (3 datasets)
- `fda-recalls` — 200+ recalls across drugs/devices/food. Class I/II/III. Recalling firm, product description, reason.
- `fda-drug-labels` — 100+ Structured Product Labels (SPL). Brand + generic + manufacturer + 7 narrative sections.
- `fda-adverse-events` — FAERS adverse-event reports. Patient demographics + drug + reaction.

### OFAC sanctions (1 dataset)
- `ofac-sanctions` — Specially Designated Nationals. 19,680 primary entities + 30,478 AKA/FKA aliases. Sourced via OpenSanctions mirror (Cloudflare Workers can't directly reach .gov 525 SSL).

### Federal courts (1 dataset)
- `federal-courts` — RECAP dockets from CourtListener. Filtered by Nature of Suit: 850 Securities, 830 Patent, 410 Antitrust, 365/367 Product Liability, 480 Consumer Credit. 540+ dockets across recent quarters.

### Compliance enforcement (4 datasets)
- `cfpb-complaints` — Consumer Financial Protection Bureau complaints. 52,328+ records.
- `fed-enforcement` — Federal Reserve banking enforcement actions. 1,537 records.
- `sam-debarments` — System for Award Management exclusions. 106,669 federal contractor bans.
- `hhs-oig-exclusions` — HHS OIG List of Excluded Individuals/Entities (LEIE). 83,256 healthcare provider exclusions.

### Cybersecurity (2 datasets)
- `cisa-kev` — CISA Known Exploited Vulnerabilities catalog. 1,592 CVEs being actively exploited in the wild.
- `nist-nvd` — NIST National Vulnerability Database. 2,600+ CVEs with CVSS v3 scores and CPEs.

### Federal R&D (3 datasets)
- `nih-grants` — NIH RePORTER. 1,676 NIH grants with PI + UEI + amount + abstract.
- `nsf-awards` — NSF Awards. 500+ records with UEI + amount + program.
- `grants-gov` — Grants.gov opportunities catalog. 1,500+ active federal grant opportunities.

### Other regulators (5 datasets)
- `nhtsa-recalls` — 1,109 NHTSA vehicle recalls by manufacturer/make/model/year.
- `doj-press` — 1,500 DOJ Office of Public Affairs press releases. FCPA, antitrust, healthcare fraud, environmental.
- `epa-facilities` — EPA ECHO facility records. 444 facilities with compliance status.
- `clinical-trials` — 5,185 ClinicalTrials.gov registrations. Sponsor, phase, status, intervention.
- `lobbying` — Senate LDA quarterly filings. 700 filings + 1,296 activities.
- `federal-register` — Federal Register documents (rules, EOs, presidential).
- `fed-contracts` — USAspending prime contract awards. (Schema live; ingest pending SSL fix.)
- `uspto-patents` — Schema ready; ingest pending USPTO_API_KEY.

## The killer endpoint: /api/v1/screening/all

One call screens a name across 9 datasets and returns a 0-100 risk score.

```
GET https://api.ai-analytics.org/api/v1/screening/all?name=Wells+Fargo
```

Response (truncated):
```json
{
  "name": "Wells Fargo",
  "risk_score": 14,
  "risk_level": "LOW",
  "summary": {
    "ofac_sanctions": 0,
    "sam_federal_debarments": 0,
    "hhs_oig_healthcare_exclusions": 0,
    "doj_press_release_mentions": 3,
    "cfpb_consumer_complaints": 14,
    "federal_court_dockets": 0,
    "federal_reserve_enforcement": 2,
    "nhtsa_vehicle_recalls": 0,
    "cisa_kev_cves": 0
  },
  "details": { "ofac_primary_matches": [...], "sam_matches": [...], "oig_business_matches": [...] },
  "explore": {
    "doj": "/api/v1/doj/by-company/Wells%20Fargo",
    "cfpb": "/api/v1/cfpb/by-company/Wells%20Fargo",
    "fed_enforcement": "/api/v1/fed-enforcement/recent?bank=Wells%20Fargo"
  }
}
```

Risk-score weights:
- OFAC sanction match: 40 pts each (hard stop)
- SAM federal debarment: 30 pts
- HHS OIG healthcare exclusion: 25 pts
- DOJ mention: 1 pt each, capped at 20
- Federal Reserve action: 5 pts each, capped at 15
- Court docket: 0.2 pts each, capped at 15
- CFPB complaint: 0.1 pts each, capped at 10
- NHTSA recall: 0.2 pts each, capped at 10
- CISA KEV CVE: 0.1 pts each, capped at 10

risk_level: NONE (0) | LOW (<30) | MEDIUM (30-59) | HIGH (≥60)

## Cross-vertical entity timeline

```
GET https://api.ai-analytics.org/api/v1/entity/{ticker-or-cik-or-uei}/timeline?days=180
GET https://api.ai-analytics.org/api/v1/entity/{key}/material-events?days=180   # importance ≥ 60 only
GET https://api.ai-analytics.org/entity/{key}                                    # HTML view
```

Merges events from SEC + FDA + Form 144 + 13D/G + clinical trials + lobbying + EPA + contracts + courts + CFPB + NIH + Fed enforcement + NHTSA + DOJ into one time-ordered feed.

Each event has an importance score 0-100:
- Class I FDA recall: 100
- Clinical trial terminated: 80
- Securities class action: 80
- Antitrust suit: 75
- Schedule 13D: 75
- Form 144 large ($50M+): 75
- 10-K restatement: 75
- EPA significant violator: 70
- Patent litigation: 70
- Phase 3 trial completed: 70
- DOJ press release: 80
- Federal Reserve enforcement: 75
- Phase 3 trial in progress: 60
- Insider purchase: 50
- Phase 2 trial: 45
- 8-K: 35
- Lobbying filing: 30
- FDA label update: 30
- Phase 1 trial: 30
- Class III FDA recall: 25
- CFPB complaint: 30
- NHTSA recall (park-it): 90
- NHTSA recall (park-outside): 80
- NHTSA recall (other): 50

## MCP tools (38+)

The MCP server at https://api.ai-analytics.org/mcp exposes these tools. Connect via Claude Desktop / Cursor / any MCP client:

```json
{ "mcpServers": { "ai-analytics": { "url": "https://api.ai-analytics.org/mcp" } } }
```

Tool list (call `tools/list` for current schema):

**Cross-vertical**: entity_timeline, entity_material_events, cross_vertical_active, screen_all

**Compliance**: sanctions_check, debarments_check

**SEC**: company_profile, company_risk_flags, company_insider_pulse, cluster_buys, cluster_sells, upcoming_sales_pressure, news, upcoming_earnings, fund_holdings, watchlist

**FDA**: fda_firm_dossier, drug_label

**Specific datasets**: clinical_trials_recent, lobbying_by_client, epa_facilities, patents_recent, patents_by_assignee, courts_recent, courts_by_party, cfpb_by_company, nih_by_org, fed_enforcement_recent, nhtsa_by_manufacturer, nhtsa_recent, doj_by_company, doj_recent, kev_by_vendor, kev_recent, nsf_by_org

**Search**: search, dossier

## Citation requirements

Every JSON response includes a `_source` envelope:

```json
{
  "@context": { "@vocab": "https://schema.org/", "dcterms": "http://purl.org/dc/terms/" },
  "@type": "ItemList",
  "@id": "https://api.ai-analytics.org/api/v1/entity/BSX/timeline",
  "_source": {
    "data_provider": "AI Analytics",
    "data_provider_url": "https://api.ai-analytics.org",
    "license": "https://creativecommons.org/publicdomain/zero/1.0/",
    "generated_at": "2026-05-15T20:55:00Z",
    "primary_sources": { ... }
  },
  ...
}
```

Single-record endpoints set `Link: <primary-gov-url>; rel="canonical"` so AI crawlers can traverse to the authoritative agency source.

Per-dataset landing pages at `/datasets/{slug}` carry full Schema.org Dataset JSON-LD with `isBasedOn`, `sourceOrganization`, `dcterms:source`, `prov:wasDerivedFrom`, `license`, `sdLicense`, `distribution: { @type: WebAPI }`.

Suggested citation format:

> AI Analytics, [retrieved YYYY-MM-DDTHH:MM:SSZ]. Cross-vertical regulatory feed for `{entity}`. https://api.ai-analytics.org/{path}. Derived from primary sources: SEC EDGAR (https://www.sec.gov/edgar), openFDA (https://api.fda.gov/), OFAC (https://sanctionslistservice.ofac.treas.gov/), CourtListener (https://www.courtlistener.com/). Licensed CC0.

## Discovery surface

- robots.txt — explicitly allows GPTBot, ClaudeBot, ChatGPT-User, Claude-User, PerplexityBot, Google-Extended, Applebot, Applebot-Extended, CCBot, Amazonbot, Meta-ExternalAgent, bingbot
- /llms.txt — short curated link list
- /llms-full.txt — this file (full context)
- /sitemap.xml — paginated, every URL
- /openapi.json — OpenAPI 3.1 spec
- /.well-known/dataset.json — DataCatalog JSON-LD
- /.well-known/mcp/server-card.json — SEP-1649 MCP server card
- /server.json — registry.modelcontextprotocol.io-format manifest

## URL grammar

```
/{topic}/{entity}            HTML entity page (e.g. /sec/AAPL, /entity/BSX)
/{topic}/{entity}/{sub}      Sub-views (insider-transactions, profile, timeline, etc.)
/datasets/{slug}             Per-dataset Schema.org Dataset landing page
/api/v1/{...}                JSON REST endpoints
```

Append `.json` for structured response or `Accept: application/ld+json` to single-record endpoints to get the JSON-LD descriptor.

## Per-record canonical URLs (~245k discoverable pages)

Every high-signal record has its own page with HTML + `.md` + `.json` via content negotiation or extension. Each sets `Link: <upstream-gov-url>; rel="canonical"`. Use these when AI needs the structured, citable answer for a single record.

| URL pattern | Records | Schema.org type | Example |
|---|---:|---|---|
| `/cve/{cve_id}` | 3,000+ | TechArticle | /cve/CVE-2024-21413 |
| `/recall/{recall_number}` | 300+ | NewsArticle | /recall/H-0569-2026 |
| `/case/{docket_id}` | 500+ | Legislation | /case/73346866 |
| `/doj-release/{uuid}` | 1,500+ | NewsArticle | /doj-release/{uuid} |
| `/sanction/{uid}` | 19,680 | Person/Organization | /sanction/2147394534 |
| `/clinical/{nct_id}` | 5,185 | MedicalStudy | /clinical/NCT04210115 |
| `/nhtsa-recall/{campaign}` | 1,109 | NewsArticle | /nhtsa-recall/23V295000 |
| `/oig-exclusion/{internal_id}` | 83,256 | Person/Organization | /oig-exclusion/23402 |
| `/sam-debarment/{ext_id}` | 106,669 | Person/Organization | /sam-debarment/{id} |
| `/fed-enforcement/{action_id}` | 1,537 | Legislation | /fed-enforcement/{id} |
| `/grant/nih/{appl_id}` | 1,676 | Grant | /grant/nih/11456306 |
| `/grant/nsf/{award_id}` | 500 | Grant | /grant/nsf/2419883 |
| `/opportunity/{opp_id}` | 1,500 | GovernmentService | /opportunity/362071 |

All discoverable via /sitemap.xml (244 nested sitemap chunks).

## Daily-fresh content

- `/today` — HTML cross-vertical regulatory dashboard (CISA KEVs, NHTSA park-it, FDA Class I, federal cases, DOJ releases, OFAC sanctions, large insider sells). Updated continuously. Each item links to its canonical record page.
- `/today.md` — same as Markdown for paste-into-chat
- `/today.json` — same as JSON for programmatic

## Operated by

Nexcom Media — https://nexcom.media · info@ai-analytics.org