Training Sets
12 datasets
| Name | Type | Size | Uploaded |
|---|---|---|---|
| Philosophy Corpus (Enriched) LisaMegaWatts/philosophy-corpus — 981 source texts (Aristotle, Plato, Cicero, Seneca, Marcus Aurelius, Epictetus, Kant, Spinoza, Nietzsche, etc.) processed through custom text pipeline with deduplication and quality scoring. Enriched JSONL format. Used to train JuliaSLM (5M param, Chinchilla-optimal at 100M tokens). | jsonl | 4555.5 MB | 2026-02-25 |
| Historic Chat — Synthetic Cross-Era Dialogues Synthetically generated multi-turn conversations between 153 historical figures across 38 philosophical/political topics, produced by the @alpha/historic-chat-gen package. Each conversation pairs two figures from different eras (e.g., Cicero & Fisher Ames, Hannibal & Goldwater, Madison & Themistocles) and assigns a randomized tone from 7 types (formal_debate, casual_discussion, heated_argument, philosophical_inquiry, mentorship, reluctant_agreement, comedic_misunderstanding). Turn counts range from 6-16 per conversation via seeded RNG. Generation used OpenAI gpt-4.1-mini (temperature 0.9, JSON response mode) with structured prompts containing each figure's era, bio, speech traits, worldview, and vocabulary. Outputs were validated for strict speaker alternation and correct turn counts, then atomically committed in batches to a local SQLite DB with full cost/token tracking. Exported in flat chat-token format using <|user|> / <|assistant|> / <|end_of_text|> delimiters — no system prompt or speaker names embedded — ready for causal language model training. 12,992 conversations, ~34MB. | txt | 32.6 MB | 2026-02-24 |
| concordance-v2 Concordance v2 training data from Gutenberg (9.4k books), SimpleWiki (316k articles), and Wiktionary (2M+ entries). Paragraph-based windowing with ±1 neighbor context. Gutenberg boilerplate stripped. Short paragraphs (<40 chars) filtered. Deterministic shuffle (seed=42) across all sources. 2.2M lines, 788 MB. | txt | 787.9 MB | 2026-02-21 |
| concordance Concordance training data from Gutenberg + SimpleWiki + Wiktionary. Sentence-based windowing (±2 sentences). Single context per word. Sources grouped (not shuffled). | txt | 322.7 MB | 2026-02-21 |
| concordance-gutenberg-first Concordance-style training data from Project Gutenberg. Paragraph-based context windows extracted from classic literature. Each entry contains a target paragraph with ±1 neighboring paragraphs for context. Single context per word. Not shuffled. | txt | 354.1 MB | 2026-02-21 |
| Novels Dataset Combined novels generated from Discord channel conversations. 38 channel-inspired novels covering topics like AI agents, MCP tools, debugging, generative tools, and more. | txt | 1.6 MB | 2026-02-20 |
| Discord Books v1 - User/Assistant Pairs 4388 training pairs in User:/Assistant: format. Discord conversations + synthetic steering data (greetings, Q&A, instruction following). | txt | 1021.9 KB | 2026-02-20 |
| Discord Books v2 - Natural Conversations 944 multi-turn conversations with real usernames (841 Discord + 103 synthetic). ShareGPT-style JSONL with speaker names preserved. Multi-speaker, merged consecutive messages, natural flow. | jsonl | 1.3 MB | 2026-02-20 |
| Code Instructions Dataset Programming instruction-output pairs for code generation fine-tuning across Python, SQL, and JavaScript. | json | 1.5 KB | 2026-02-20 |
| Q&A Pairs Question-answer pairs across multiple subjects with difficulty ratings. | csv | 1.1 KB | 2026-02-20 |
| Discord Books Training Set 4388 training pairs from blah Discord server + synthetic steering data | txt | 1021.9 KB | 2026-02-20 |
| Sentiment Analysis Training Data 12 labeled examples for sentiment classification — positive, negative, and neutral. | jsonl | 1.1 KB | 2026-02-20 |