Training Sets

12 datasets

NameTypeSizeUploaded
Philosophy Corpus (Enriched)

LisaMegaWatts/philosophy-corpus — 981 source texts (Aristotle, Plato, Cicero, Seneca, Marcus Aurelius, Epictetus, Kant, Spinoza, Nietzsche, etc.) processed through custom text pipeline with deduplication and quality scoring. Enriched JSONL format. Used to train JuliaSLM (5M param, Chinchilla-optimal at 100M tokens).

jsonl4555.5 MB2026-02-25
Historic Chat — Synthetic Cross-Era Dialogues

Synthetically generated multi-turn conversations between 153 historical figures across 38 philosophical/political topics, produced by the @alpha/historic-chat-gen package. Each conversation pairs two figures from different eras (e.g., Cicero & Fisher Ames, Hannibal & Goldwater, Madison & Themistocles) and assigns a randomized tone from 7 types (formal_debate, casual_discussion, heated_argument, philosophical_inquiry, mentorship, reluctant_agreement, comedic_misunderstanding). Turn counts range from 6-16 per conversation via seeded RNG. Generation used OpenAI gpt-4.1-mini (temperature 0.9, JSON response mode) with structured prompts containing each figure's era, bio, speech traits, worldview, and vocabulary. Outputs were validated for strict speaker alternation and correct turn counts, then atomically committed in batches to a local SQLite DB with full cost/token tracking. Exported in flat chat-token format using <|user|> / <|assistant|> / <|end_of_text|> delimiters — no system prompt or speaker names embedded — ready for causal language model training. 12,992 conversations, ~34MB.

txt32.6 MB2026-02-24
concordance-v2

Concordance v2 training data from Gutenberg (9.4k books), SimpleWiki (316k articles), and Wiktionary (2M+ entries). Paragraph-based windowing with ±1 neighbor context. Gutenberg boilerplate stripped. Short paragraphs (<40 chars) filtered. Deterministic shuffle (seed=42) across all sources. 2.2M lines, 788 MB.

txt787.9 MB2026-02-21
concordance

Concordance training data from Gutenberg + SimpleWiki + Wiktionary. Sentence-based windowing (±2 sentences). Single context per word. Sources grouped (not shuffled).

txt322.7 MB2026-02-21
concordance-gutenberg-first

Concordance-style training data from Project Gutenberg. Paragraph-based context windows extracted from classic literature. Each entry contains a target paragraph with ±1 neighboring paragraphs for context. Single context per word. Not shuffled.

txt354.1 MB2026-02-21
Novels Dataset

Combined novels generated from Discord channel conversations. 38 channel-inspired novels covering topics like AI agents, MCP tools, debugging, generative tools, and more.

txt1.6 MB2026-02-20
Discord Books v1 - User/Assistant Pairs

4388 training pairs in User:/Assistant: format. Discord conversations + synthetic steering data (greetings, Q&A, instruction following).

txt1021.9 KB2026-02-20
Discord Books v2 - Natural Conversations

944 multi-turn conversations with real usernames (841 Discord + 103 synthetic). ShareGPT-style JSONL with speaker names preserved. Multi-speaker, merged consecutive messages, natural flow.

jsonl1.3 MB2026-02-20
Code Instructions Dataset

Programming instruction-output pairs for code generation fine-tuning across Python, SQL, and JavaScript.

json1.5 KB2026-02-20
Q&A Pairs

Question-answer pairs across multiple subjects with difficulty ratings.

csv1.1 KB2026-02-20
Discord Books Training Set

4388 training pairs from blah Discord server + synthetic steering data

txt1021.9 KB2026-02-20
Sentiment Analysis Training Data

12 labeled examples for sentiment classification — positive, negative, and neutral.

jsonl1.1 KB2026-02-20