Training Sets

12 datasets

Name	Type	Size	Lines	Uploaded
Philosophy Corpus (Enriched) LisaMegaWatts/philosophy-corpus — 981 source texts (Aristotle, Plato, Cicero, Seneca, Marcus Aurelius, Epictetus, Kant, Spinoza, Nietzsche, etc.) processed through custom text pipeline with deduplication and quality scoring. Enriched JSONL format. Used to train JuliaSLM (5M param, Chinchilla-optimal at 100M tokens).	jsonl	4555.5 MB	—	2026-02-25
Historic Chat — Synthetic Cross-Era Dialogues Synthetically generated multi-turn conversations between 153 historical figures across 38 philosophical/political topics, produced by the @alpha/historic-chat-gen package. Each conversation pairs two figures from different eras (e.g., Cicero & Fisher Ames, Hannibal & Goldwater, Madison & Themistocles) and assigns a randomized tone from 7 types (formal_debate, casual_discussion, heated_argument, philosophical_inquiry, mentorship, reluctant_agreement, comedic_misunderstanding). Turn counts range from 6-16 per conversation via seeded RNG. Generation used OpenAI gpt-4.1-mini (temperature 0.9, JSON response mode) with structured prompts containing each figure's era, bio, speech traits, worldview, and vocabulary. Outputs were validated for strict speaker alternation and correct turn counts, then atomically committed in batches to a local SQLite DB with full cost/token tracking. Exported in flat chat-token format using <\|user\|> / <\|assistant\|> / <\|end_of_text\|> delimiters — no system prompt or speaker names embedded — ready for causal language model training. 12,992 conversations, ~34MB.	txt	32.6 MB	—	2026-02-24
concordance-v2 Concordance v2 training data from Gutenberg (9.4k books), SimpleWiki (316k articles), and Wiktionary (2M+ entries). Paragraph-based windowing with ±1 neighbor context. Gutenberg boilerplate stripped. Short paragraphs (<40 chars) filtered. Deterministic shuffle (seed=42) across all sources. 2.2M lines, 788 MB.	txt	787.9 MB	—	2026-02-21
concordance Concordance training data from Gutenberg + SimpleWiki + Wiktionary. Sentence-based windowing (±2 sentences). Single context per word. Sources grouped (not shuffled).	txt	322.7 MB	—	2026-02-21
concordance-gutenberg-first Concordance-style training data from Project Gutenberg. Paragraph-based context windows extracted from classic literature. Each entry contains a target paragraph with ±1 neighboring paragraphs for context. Single context per word. Not shuffled.	txt	354.1 MB	—	2026-02-21
Novels Dataset Combined novels generated from Discord channel conversations. 38 channel-inspired novels covering topics like AI agents, MCP tools, debugging, generative tools, and more.	txt	1.6 MB	20,990	2026-02-20
Discord Books v1 - User/Assistant Pairs 4388 training pairs in User:/Assistant: format. Discord conversations + synthetic steering data (greetings, Q&A, instruction following).	txt	1021.9 KB	11,444	2026-02-20
Discord Books v2 - Natural Conversations 944 multi-turn conversations with real usernames (841 Discord + 103 synthetic). ShareGPT-style JSONL with speaker names preserved. Multi-speaker, merged consecutive messages, natural flow.	jsonl	1.3 MB	944	2026-02-20
Code Instructions Dataset Programming instruction-output pairs for code generation fine-tuning across Python, SQL, and JavaScript.	json	1.5 KB	32	2026-02-20
Q&A Pairs Question-answer pairs across multiple subjects with difficulty ratings.	csv	1.1 KB	9	2026-02-20
Discord Books Training Set 4388 training pairs from blah Discord server + synthetic steering data	txt	1021.9 KB	11,444	2026-02-20
Sentiment Analysis Training Data 12 labeled examples for sentiment classification — positive, negative, and neutral.	jsonl	1.1 KB	12	2026-02-20