sec-cli: a fast CLI for SEC filings, built for LLM workflows

sec-cli is a CLI that pulls SEC EDGAR filings and turns them into structured output you can actually pipe into a language model. No API key. No paid service. Compiles to a single Go binary with a Python wrapper if you’d rather stay in Python.

go install github.com/kritidutta01/sec-cli/cmd/sec-cli@latest

export SEC_CLI_USER_AGENT="Your Name your@email.com"  # EDGAR requires this

sec-cli get AAPL --section "Risk Factors" --output md
sec-cli diff NVDA --from 2023 --to 2024 --output md

That’s it for the quick start.

The problem it solves

EDGAR is the most important public financial dataset in the world. It’s also a nightmare to work with programmatically. Modern 10-K filings from large companies run 200+ pages of iXBRL-embedded HTML where financial figures are scattered across the document, table structure is encoded in presentation linkbase XML shipped alongside the filing, and section boundaries are inferred from inline CSS rather than semantic tags.

The existing open source options either give you a raw HTML dump you have to parse yourself, or a paid API where you’re trusting someone else’s extraction logic and can’t audit it.

sec-cli closes that gap. The output is something you’d be comfortable pasting directly into a Claude or GPT context window.

How it compares

The existing options fall into three buckets.

Paid API services charge $49 to several hundred dollars per month. You get structured output, but the extraction logic is a black box. When a number is wrong, you cannot trace it back to the source tag. For LLM workflows where auditability matters, that is a fundamental constraint.

edgartools (Python, open source) is the most credible open source alternative. It handles company lookup, filing metadata, and basic text extraction well. For financial tables, it walks the HTML table structure rather than reading the iXBRL fact stream. That works until it hits nested headers, merged cells, footnote markers inside cells, or inconsistent column spans, all of which produce silently wrong output. There is no confidence signal, so you cannot tell from the output when the extraction degraded.

DIY scripting around the SEC’s raw EDGAR API is what most practitioners end up doing. It works for one filing at a time, written by hand. It does not generalize, produces no consistent schema, and has no accuracy signal.

sec-cli does three things differently.

It reads the iXBRL fact stream directly, the same data layer FactSet and Bloomberg read, rather than guessing structure from HTML layout. Every financial figure in a modern filing is explicitly tagged with its GAAP concept, reporting period, and scale. Reading those tags sets a higher accuracy ceiling by construction.

Every table carries an explicit confidence signal. When extraction degrades, you know before you pipe the output anywhere. Downstream consumers, whether a human auditing the numbers or an LLM reasoning over them, can act on that signal rather than discovering the problem later.

The diff layer has no open source equivalent. No other tool compares two years of the same 10-K at structural, lexical, and semantic granularity, with financial table rows aligned by GAAP concept rather than visual position.

Single Go binary. No API key, no paid account, no dependency chain beyond the Go toolchain.

What it does

sec-cli get fetches and parses a filing, then renders it as JSON, Markdown, or plain text. The full filing or a single section.

sec-cli get AAPL                                   # latest 10-K, JSON
sec-cli get AAPL --year 2023 --output md           # 2023 filing as Markdown
sec-cli get AAPL --section "Risk Factors"          # one section, latest filing
sec-cli get MSFT --section 1A --output text        # by item number
sec-cli get AAPL --type 10-Q                       # quarterly instead of annual

sec-cli diff compares two years of the same filing and surfaces what actually changed, not every comma rephrasing.

sec-cli diff AAPL --from 2022 --to 2024            # structural diff, JSON
sec-cli diff AAPL --from 2022 --to 2024 --output md
sec-cli diff NVDA --from 2023 --to 2024 --section "1A" --layer lexical

Three diff layers: structural (subsection grain, added/removed/modified), lexical (word level, annotated [+..+] / [-..-]), and semantic (embedding distance ranking, planned for v1.0.1). Financial table diffs align rows by GAAP concept (us-gaap:Revenues matches whether the label says “Net revenues” or “Net sales”) so the comparison is meaningful across year-label changes.

Two decisions that matter

iXBRL fact stream over HTML table walking. For modern filings (2019 and up for large companies), every financial figure in the document is tagged with its GAAP concept name, reporting period, and scale. sec-cli reads those tags directly rather than trying to guess structure from table layout. The same data FactSet and Bloomberg read. The accuracy numbers reflect this: on the test corpus, statement cell accuracy is 100% across all fixtures. That will come down once the v1.0.1 real-filing corpus (AAPL, MSFT, JPM) is added, but the ceiling is fundamentally higher than any layout-driven approach.

Every table carries a confidence signal. When the fact stream fills a table completely, confidence is high. When coverage drops below 95%, it degrades to medium or low. The parser never silently produces a clean-looking table from partial data. Pre-iXBRL filings (anything before 2019) are refused cleanly with a pointer to v1.1, not attempted and corrupted.

"confidence": {
  "level": "high",
  "row_match_rate": 0.97,
  "cell_resolved_rate": 0.96
}

Why Go

The two firm requirements for a CLI in this space are single binary distribution and fast XML parsing.

SEC EDGAR filings are large. A modern 10-K primary document is 15 to 30 MB of HTML. The presentation linkbase XML that accompanies it is another few megabytes. Parsing those with Go’s standard library XML parser, indexing tens of thousands of iXBRL tags, and projecting tables takes under a second on commodity hardware. The equivalent Python implementation using lxml or BeautifulSoup is roughly 4 to 8 times slower on the same workload, which matters when you are processing a batch of filings.

The distribution story is cleaner. go install github.com/kritidutta01/sec-cli/cmd/sec-cli@latest produces a statically linked binary with no runtime dependencies. Users don’t manage a Python environment or deal with version conflicts to use the CLI. Homebrew tap and GoReleaser cross compiled binaries for darwin, linux, and arm64 drop out of the same pipeline.

Go’s standard library covers everything the tool needs: net/http for the EDGAR client, encoding/xml for the linkbase parser, encoding/json for output, and database/sql with go-sqlite3 for the SQLite cache. No dependency sprawl.

The Python wrapper bridges via subprocess and JSON. The Go binary is the extraction engine; the Python layer is a typed interface over its canonical output. That split means the extraction logic is tested once in Go with a hermetic suite, and the Python wrapper tests only the deserialization.

Python wrapper

If you’d rather stay in Python:

pip install seccli

import seccli

doc = seccli.get("AAPL")
doc.metadata.company          # "Apple Inc."
doc.tables[0].rows[0].values  # [391035000000, 383285000000, ...]

changes = seccli.diff("AAPL", frm=2022, to=2024)
for s in changes.sections:
    print(s.item, s.status)   # "1A", "modified"

The wrapper drives the binary via subprocess and deserializes the JSON into typed dataclasses. No additional dependencies.

What’s in and what isn’t

v1.0 supports iXBRL era filings only. 10-K, 10-Q, and 8-K from large filers since 2021, mid and small cap since earlier rollouts. If you need filings from before 2019, that’s a v1.1 scope item and the error message will tell you so.

The test suite is hermetic: a fake HTTP transport, recorded fixtures, no network calls. go test ./... works offline. The accuracy harness scores the pipeline against a corpus of synthetic hand-verified fixtures; real-filing corpus expansion is v1.0.1.

Why I built this

sec-cli is the infrastructure layer for two larger projects I’m shipping this summer: FinBench (coming soon), an open benchmark for financial LLM reasoning, and Tearsheet (coming soon), a local agentic analyst with deterministic replay. Both depend on being able to pull and parse 10-K filings reliably. Rather than embed that logic inside each project, I built it as a standalone CLI other people can actually use.

The code is at github.com/kritidutta01/sec-cli. DESIGN.md explains the extraction decisions in detail.

GitHub · LinkedIn · RSS