I spent the past two weeks implementing the parsing core of sec-cli, the part that takes a raw SEC filing and turns it into clean, structured output you can pipe into a language model without embarrassment. This post is a technical record of what that actually required.
EDGAR filings arrive in three mutually incompatible formats depending on when the company filed them. Figuring out which format you have, and then extracting tables from it correctly, is the bulk of the implementation work.
The three eras
The SEC’s EDGAR system is a 30-year-old filing archive. Every public company files there. The format of what they file has changed significantly over time.
Filings before 2010 are plain text, with SGML document wrappers, form-feed page breaks, and runs of hyphens or equals signs as horizontal rules. Structure is implicit: you infer a table from the visual alignment of whitespace-padded columns. No semantic markup anywhere.
Between roughly 2010 and 2021, filings moved to plain HTML. Table structure is explicit, but there’s no semantic layer. Row labels say “Revenues” or “Net income” but nothing machine-readable links those labels to GAAP concept definitions. Column headers are typically bare years. Inline styles encode visual formatting that sometimes, but not reliably, correlates with semantic meaning: bold often means “total”, but not always.
Starting with large filers in 2019 and all filers by 2021, the SEC mandated inline
XBRL embedded in HTML (iXBRL). Every material financial figure in the filing is
wrapped in an <ix:nonFraction> or <ix:nonNumeric> element that carries the
GAAP concept name (e.g. us-gaap:Revenues), a contextRef pointing to the
reporting period, a unitRef, a decimals precision indicator, and a scale
multiplier. The document still renders in a browser, but it also carries a
structured data layer that’s unambiguous about what every number means.
v1.0 of sec-cli targets the iXBRL era only. The modern corpus, any filing from a
large or mid-cap company since 2021, is entirely iXBRL. Supporting earlier formats
would require a completely different extraction path with a fundamentally lower
accuracy ceiling, and mixing both paths in v1.0 would smear the accuracy story.
Filings that predate iXBRL are refused cleanly (parsed: false, reason: "pre-iXBRL filing, v1.1 target") rather than attempted badly.
The format router
Before you can parse anything you need to know what you’re looking at. The format
router in internal/router classifies raw filing bytes using four checks, each
cheaper than the last.
The first thing to resolve is whether you’re even looking at the filing or a filing index page. EDGAR often returns an index: an HTML table listing all the documents in a submission, with the 10-K as one row. The router detects this, fetches the primary document, and re-enters classification. Recursion is bounded at depth 1, because a real primary document should never itself be an index.
If the bytes don’t contain an <html> tag in the first 64 KB, they’re either an
ASCII filing from before 2010 or something unrecognized. ASCII filings announce
themselves via SGML wrappers (<SEC-DOCUMENT>, <DOCUMENT>, <TYPE>) or by
form-feed characters combined with horizontal rule sequences, the “line art” those
old filings use as page separators. Everything else gets classified Unknown.
For documents that are HTML, the next question is whether the iXBRL namespace is
declared. iXBRL filings put xmlns:ix="http://www.xbrl.org/2013/inlineXBRL" on
the <html> root element. If that string isn’t in the first 64 KB, the document
is plain HTML. This check is a single bytes.Contains call on the document head,
fast enough to dismiss the entire 2010 to 2021 corpus without parsing.
For anything that declares the iXBRL namespace, the router measures tag density:
the fraction of financial figure table cells that carry an ix: wrapper. Fully
tagged iXBRL filings land above 0.80; partially tagged transitional filings
(which appeared during the SEC rollout) land above 0.10.
The density threshold sounds simple but required one non-obvious fix. Measuring over all numeric cells produces a misleadingly low number. AAPL’s FY2024 10-K has page numbers, footnote markers (1, 2, 3…), headcounts, and other small integers scattered throughout that are never iXBRL tagged. Measuring over those drops AAPL’s density from the true ~0.89 to around 0.70, close enough to the 0.80 threshold to be fragile on other filings.
The fix was to exclude cells whose text is a calendar year (1900 to 2100, used as column headers) or a “small integer” below 100 (page numbers, footnote markers, headcounts). After filtering to only financial figures, AAPL lands comfortably above the threshold.
iXBRL: facts, contexts, and scale
Once you know you have an iXBRL filing, the extraction path has two parts: the fact stream and the presentation linkbase.
The fact stream is everything in the <ix:nonFraction> and <ix:nonNumeric>
elements scattered throughout the document. Each fact carries a concept name (the
GAAP taxonomy name, like us-gaap:Revenues), a contextRef pointing to a
<xbrli:context> that defines the reporting period and entity, a scale
multiplier (scale="6" means multiply the raw value by 10^6, since annual reports
typically report in millions), and a decimals precision indicator. The decimals
field is informational, not a magnitude. Do not double-apply with scale.
Two things caused actual bugs worth documenting.
Sign via parentheses. In financial statements, negative values are written (123)
rather than -123. The iXBRL spec says the element should carry a sign="-"
attribute in this case, and some filers do. But many filers omit the attribute and
just wrap the rendered number in parentheses in the surrounding text. The fact
extractor has to track whether an ix:nonFraction element sits inside a run of
text that opens with ( and closes with ). Getting this wrong produces silently
inverted values, a loss reported as a gain.
Segment contexts. A single concept like us-gaap:Revenues often appears in
multiple contexts: once for the consolidated entity (what you want) and once per
reportable segment (Americas, Europe, etc.). The fact index keys on
(concept, contextRef), so without filtering you’d see multiple rows for the same
line item. Primary consolidated contexts have no segment dimension; segment
contexts carry a <xbrli:segment> child in the context definition. The projection
layer filters to primary contexts only.
The presentation linkbase
The fact stream gives you values. It doesn’t tell you which statements they belong to, what order the rows go in, or what the display label is for each concept.
That information lives in the presentation linkbase, a separate XML file
(*_pre.xml) that EDGAR requires filers to include alongside the primary
document. It defines which role URI each financial statement gets (e.g.
http://apple.com/role/CONSOLIDATEDSTATEMENTSOFOPERATIONS), the parent-child arc
structure that sets the presentation order of concepts within each role, and
preferred label roles that tell you whether a concept should render its
periodStartLabel (opening cash balance in a cash flow statement) or totalLabel
(this line is a sum).
Without the linkbase, you’d have a bag of facts with no statement membership, no row ordering, and no way to distinguish “Total operating expenses” from a regular line item.
Table projection
Given a role’s ordered concept list and the indexed fact stream, the projection fills a rows × columns grid.
Rows come from the linkbase in presentation order, with structural (parent-only)
concepts filtered out. Row type comes from the preferred label: roleTotalLabel
maps to total, a non-structural parent maps to subtotal, everything else is
data.
Columns are the primary contexts that carry facts for the role’s concepts. Not every context qualifies, some appear in only one or two rows due to incidental cross-statement tagging (cash flow reconciliations often tag balance sheet concepts at extra period-ends). A context must cover at least 50% of the rows that the best covered context covers. This floor drops the incidental ones cleanly.
Opening and closing balances in cash flow statements are a special case. They’re
shown inside a duration column (the fiscal year) but are actually instant-period
facts, specifically the balance at a given date. The projection handles this via
the periodStartLabel and periodEndLabel preferred label roles: a row with
periodEndLabel reads the instant fact at the column’s period end; periodStartLabel
reads the instant at one day before the period start (the prior period’s close).
One thing that matters for LLM consumers: a cell with no matching fact is null
in the output, never 0. Zero is a real reported value. Null means the row
doesn’t apply to that period. Conflating them produces wrong totals when a model
tries to reason over the numbers.
The confidence contract
Every table the parser produces carries an explicit confidence signal:
"confidence": {
"level": "high",
"row_match_rate": 0.97,
"cell_resolved_rate": 0.96,
"untagged_cell_count": 2
}
level is high (≥ 95% rows fully filled and ≥ 95% cells resolved), medium,
or low. The design principle is borrowed from forecasting: a parser that
correctly extracts 70% of tables and says so is more useful than one that
attempts 100% and silently corrupts 30%. Downstream consumers, whether a human
auditing the output or an LLM summarizing it, need to know when to trust the
extraction.
On AAPL’s FY2024 10-K, all three primary statements (income, balance sheet, cash
flow) project at confidence high. That’s the baseline I’m holding for v1.0
accuracy reporting.
What’s next
Phase 7 is the layout fallback for narrative tables, specifically the tables in
MD&A, Risk Factors, and notes sections that appear inside iXBRL filings but aren’t
fact tagged. These need a heuristic path: header detection from bold formatting
and year-pattern columns, footnote stripping, number normalization. Layout extracted
tables are capped at confidence medium by design, since the iXBRL path’s semantic
grounding isn’t available for them.
After that: free-text extraction, the normalized output model, SQLite cache, and
the get and diff CLI commands. The sec-cli launch post will go up when
sec-cli get AAPL produces output you’d actually pipe into a language model.
The code is at github.com/kritidutta01/sec-cli.
The design decisions are in DESIGN.md in the root of the repo, worth reading
if you’re building anything in the EDGAR/financial-data space.