5,507 public companies. 12 XBRL disclosure topics. Extracted text from SEC filings, plus Parquet for programmatic access. One ZIP, delivered via secure download link.
One .txt per company per topic. Text extracted from XBRL filings by the SEC — clean, structured, ready for NLP.
All disclosures in one columnar file. Filter by topic, company, or filing year. pandas / Polars / DuckDB ready.
Every company in the dataset: ticker, CIK, registrant name, SIC code, industry classification.
Python and DuckDB examples to load, query, and analyze the data. Get started in minutes.
import duckdb
db = duckdb.connect()
df = db.sql("SELECT * FROM 'disclosures.parquet' WHERE topic = 'Income Taxes'").df()
print(f"{len(df)} companies with income tax disclosures")Fine-tune language models on extracted financial disclosure text. Organized by topic with code samples to get started fast.
Build governance signals, tax strategy features, or debt structure classifiers from disclosure narratives.
Benchmark disclosure language across an industry. Compare how peers report on the same topic.
We're building new SEC filing datasets — insider trades, financial statements, institutional holdings, and more. Tell us what you're working on and we'll help you get the data.
Request a Dataset →Enter your email, pay once, download instantly.