{"slug":"open-source","id":"prompt-open-source","type":"prompt-pattern","title":"Prompting Patterns for Open Source Models","description":"What works with self-hosted and API-accessed open models — Llama 4, DeepSeek R1, Qwen 3/3.5, Hermes 4, and others. Covers system prompt formats, quantization-aware prompting, reasoning toggles, and temperature tuning.","last_updated":"2026-04-10","last_verified":null,"verification_status":"unverified","markdown_url":"/content/prompt-patterns/open-source.md","html_url":"/prompt-patterns/open-source","api_url":"/api/v1/prompt-patterns/open-source.json","content_hash":"30d8b547b330c525c6f26d83cd1a0c77089e58e3abf604eaeaca94186564e987","sha256":"30d8b547b330c525c6f26d83cd1a0c77089e58e3abf604eaeaca94186564e987","tags":["open-source","llama","deepseek","qwen","hermes","self-hosted","prompt-patterns"],"relationships":{"links":[{"text":"Prompt Engineering Guide","href":"/guides/prompting","html_path":"/guides/prompting","target_id":"prompting","target_type":"guide","target_title":"Prompt Engineering Guide — How to Write Better AI Prompts"}],"related":[{"id":"prompt-anthropic","title":"Prompting Patterns for Anthropic Claude","type":"prompt-pattern","html_url":"/prompt-patterns/anthropic","markdown_url":"/content/prompt-patterns/anthropic.md","shared_tags":["prompt-patterns"],"score":3},{"id":"prompt-google","title":"Prompting Patterns for Google Gemini","type":"prompt-pattern","html_url":"/prompt-patterns/google","markdown_url":"/content/prompt-patterns/google.md","shared_tags":["prompt-patterns"],"score":3},{"id":"prompt-openai","title":"Prompting Patterns for OpenAI GPT-5.4","type":"prompt-pattern","html_url":"/prompt-patterns/openai","markdown_url":"/content/prompt-patterns/openai.md","shared_tags":["prompt-patterns"],"score":3},{"id":"prompt-xai","title":"Prompting Patterns for xAI Grok","type":"prompt-pattern","html_url":"/prompt-patterns/xai","markdown_url":"/content/prompt-patterns/xai.md","shared_tags":["prompt-patterns"],"score":3},{"id":"provider-qwen","title":"Alibaba Qwen Provider Profile","type":"provider","html_url":"/providers/qwen","markdown_url":"/content/providers/qwen.md","shared_tags":["qwen","open-source"],"score":2},{"id":"provider-deepseek","title":"DeepSeek Provider Profile","type":"provider","html_url":"/providers/deepseek","markdown_url":"/content/providers/deepseek.md","shared_tags":["deepseek","open-source"],"score":2}],"explicit":{}},"metadata":{"title":"Prompting Patterns for Open Source Models","type":"prompt-pattern","id":"prompt-open-source","description":"What works with self-hosted and API-accessed open models — Llama 4, DeepSeek R1, Qwen 3/3.5, Hermes 4, and others. Covers system prompt formats, quantization-aware prompting, reasoning toggles, and temperature tuning.","last_updated":"2026-04-10","tags":["open-source","llama","deepseek","qwen","hermes","self-hosted","prompt-patterns"]},"content_text":"# Prompting Patterns for Open Source Models\n\nOpen source models are a different game. You control the infrastructure, the configuration, and often the prompt template format itself. The patterns that work with proprietary models (GPT, Claude, Gemini) do not always transfer directly -- open models have different system prompt formats, different sensitivity to temperature and quantization, and model-specific features like reasoning toggles.\n\nThis guide covers the practical prompting differences for the most capable open models: Llama 4 (Scout and Maverick), DeepSeek R1 and V3.2, Qwen 3 and 3.5, and Hermes 4. These patterns apply whether you are self-hosting or accessing these models through API providers.\n\nFor general prompting techniques that work across all models, see the [Prompt Engineering Guide](/guides/prompting).\n\n---\n\n## System Prompt Formats: Model-Specific Templates\n\nUnlike proprietary models where the API handles prompt formatting, open models require you to use the correct chat template. Using the wrong format degrades output quality significantly -- the model may ignore your system prompt, produce garbled output, or fail to follow instructions.\n\n### Llama 4 (Meta)\n\nLlama 4 uses a specific template with `[INST]` tags:\n\n```\n<|begin_of_text|><|start_header_id|>system<|end_header_id|>\n\nYou are a helpful assistant specializing in Python code review.\nAlways explain your reasoning before suggesting changes. Focus on\ncorrectness, readability, and performance in that order.\n<|eot_id|><|start_header_id|>user<|end_header_id|>\n\nReview this function for issues:\n\ndef get_users(db, status):\n    query = f\"SELECT * FROM users WHERE status = '{status}'\"\n    return db.execute(query).fetchall()\n<|eot_id|><|start_header_id|>assistant<|end_header_id|>\n```\n\nIf you are using a framework like vLLM, llama.cpp, or Ollama, the template is usually applied automatically. But if you are hitting a raw endpoint or building your own serving stack, you must apply the template yourself. Getting it wrong is one of the most common causes of poor performance with open models.\n\n### ChatML Format (Qwen, Hermes, many others)\n\nMany models use the ChatML format, including Qwen 3/3.5 and Hermes 4:\n\n```\n<|im_start|>system\nYou are a data analyst. When given a dataset, always start by\ndescribing the shape of the data (rows, columns, types) before\nanalyzing it. Present findings as numbered insights, most important\nfirst.\n<|im_end|>\n<|im_start|>user\nHere's our monthly sales data for 2025:\n[paste CSV data]\nWhat trends do you see?\n<|im_end|>\n<|im_start|>assistant\n```\n\n### Alpaca Format (older/fine-tuned models)\n\nSome fine-tuned models still use the Alpaca instruction format:\n\n```\n### Instruction:\nYou are a technical writer. Rewrite the following error message to be\nuser-friendly. Keep it under 15 words.\n\n### Input:\nError 0x8007045D: The request could not be performed because of an\nI/O device error.\n\n### Response:\n```\n\n**Rule of thumb:** Check the model card on Hugging Face for the correct chat template before deploying. Most inference failures with open models trace back to prompt formatting issues.\n\n---\n\n## Quantization Awareness: More Explicit Prompts for Smaller Models\n\nWhen running quantized models (Q4, Q5, Q8 via GGUF), you are trading precision for speed and memory. Q4 models lose some instruction-following ability compared to their full-precision counterparts. The fix is more explicit prompts.\n\nWhat changes with quantized models:\n- They are more likely to drift from complex instructions\n- They handle fewer simultaneous constraints\n- Ambiguous prompts produce worse results than with full-precision models\n- Output formatting is less reliable\n\nPattern for quantized models -- be brutally explicit:\n\n```\nTASK: Classify the following customer email.\n\nCATEGORIES (pick exactly one):\n- BILLING: about charges, invoices, refunds, payments\n- TECHNICAL: about bugs, errors, how-to, integrations\n- ACCOUNT: about login, permissions, settings, cancellation\n- OTHER: does not fit above categories\n\nOUTPUT FORMAT: Return ONLY the category name, nothing else.\nNo explanation. No preamble. Just the single word.\n\nEMAIL:\n\"Hi, I've been trying to connect your API to our Salesforce instance\nbut keep getting a 403 error. I've double-checked our API key and it\nseems correct. Can someone help?\"\n\nCATEGORY:\n```\n\nWith a full-precision model, you could write this more casually and it would figure out what you want. With a Q4 quantization, the explicit format, the enumerated categories, and the \"return ONLY the category name\" instruction prevent the model from adding unwanted explanation or picking an off-list category.\n\n---\n\n## DeepSeek R1: Explicit Reasoning Mode\n\nDeepSeek R1 is one of the strongest reasoning models available as open source. It has explicit \"think\" and \"non-think\" modes. For hard problems, you want to trigger the thinking mode.\n\nThe simplest way:\n\n```\nThink step by step.\n\nA company has 3 data centers. Each has an independent failure rate of\n2% per month. They need at least 2 data centers running to serve\ncustomers.\n\nWhat is the probability of a service outage (fewer than 2 data centers\nrunning) in any given month? What about over a full year?\n\nShow your complete calculation.\n```\n\n\"Think step by step\" is not just a suggestion with DeepSeek R1 -- it activates the model's explicit reasoning chain. You will see the model produce a `<think>` block with its working before giving the final answer.\n\nFor tasks where you do NOT want the reasoning overhead (simple lookups, formatting, translation), skip the reasoning trigger:\n\n```\nTranslate the following English text to formal Japanese. Do not explain\nyour translation choices. Output only the Japanese text.\n\n\"We are pleased to announce that the quarterly review meeting has been\nrescheduled to March 15th at 2:00 PM.\"\n```\n\nThe difference in latency between think mode and direct mode is substantial. Use think mode only when accuracy on a hard problem justifies the wait.\n\n---\n\n## Qwen 3/3.5: Hybrid Reasoning\n\nQwen 3 and 3.5 support a similar hybrid approach to DeepSeek, with explicit control over whether the model reasons before answering.\n\n**Think mode** (for complex tasks):\n\n```\n/think\n\nYou are reviewing a distributed system design. The system uses\neventual consistency with a 5-second propagation delay.\n\nA user updates their profile name on Server A, then immediately reads\ntheir profile from Server B. Describe all possible outcomes, the\nprobability of each, and how you would solve the inconsistency problem\nwith minimal latency impact.\n```\n\nThe `/think` prefix tells Qwen to engage its reasoning mode. The model will produce its chain-of-thought before answering.\n\n**Non-think mode** (for fast responses):\n\n```\n/no_think\n\nConvert this JSON to a Python dataclass:\n\n{\"name\": \"string\", \"age\": \"int\", \"email\": \"string\", \"roles\": [\"string\"],\n \"active\": \"bool\"}\n```\n\nQwen 3.5 is particularly strong on multilingual tasks, supporting 201 languages. When working cross-language, specify the output language and script explicitly:\n\n```\n/no_think\n\nSummarize the following Japanese article in English. 3 bullet points\nmaximum. Focus on the business implications, not the technical details.\n\n[paste Japanese text]\n```\n\n---\n\n## Hermes Models: Think Tags for Reasoning Control\n\nHermes 4 (based on Llama 4 Maverick, 405B parameters) supports explicit `<think>` tags that toggle reasoning mode. This gives you fine-grained control over when the model reasons and when it responds directly.\n\n```\n<|im_start|>system\nYou are a code security auditor. When analyzing code, use <think> tags\nto reason through potential vulnerabilities before presenting your\nfindings.\n<|im_end|>\n<|im_start|>user\nReview this authentication function for security issues:\n\ndef authenticate(username, password):\n    user = db.query(f\"SELECT * FROM users WHERE username='{username}'\")\n    if user and user.password == password:\n        return create_session(user.id)\n    return None\n<|im_end|>\n<|im_start|>assistant\n<think>\nLet me analyze this step by step...\n- SQL injection via f-string formatting in the query\n- Plaintext password comparison (no hashing)\n- No rate limiting or brute force protection\n- Session creation doesn't check if user is active/banned\n</think>\n\nI found 4 security vulnerabilities in this function...\n```\n\nYou can also instruct Hermes to skip thinking for simple tasks by telling it not to use `<think>` tags:\n\n```\nDo not use <think> tags. Respond directly.\n\nWhat is the current LTS version of Node.js?\n```\n\n---\n\n## Context Limits and Temperature\n\n### Context Windows\n\nMost open models have 128K token context windows -- large by historical standards but notably smaller than the 1M windows offered by GPT-5.4, Claude, and Gemini. Some specifics:\n\n- **Llama 4 Scout:** 10M tokens (the exception -- massive context)\n- **Llama 4 Maverick:** 1M tokens\n- **DeepSeek R1/V3.2:** 128K tokens\n- **Qwen 3.5:** 128K tokens\n- **Hermes 4:** 128K tokens\n\nFor models with 128K context, plan your prompts accordingly. You cannot paste a full codebase like you can with Claude or Gemini. Instead, provide only the relevant files or sections, and be explicit about what you are including and why:\n\n```\nI'm debugging a race condition in our order processing system. Here are\nthe 3 relevant files (out of ~200 in the codebase). The issue is that\ntwo concurrent orders for the same item can both succeed even when only\none item is in stock.\n\nFile 1 - order_service.py (the main order processing logic):\n[paste file]\n\nFile 2 - inventory.py (stock management):\n[paste file]\n\nFile 3 - database.py (transaction handling):\n[paste file]\n\nFocus on the interaction between these files. The bug is in how\ninventory is checked and decremented.\n```\n\n### Temperature Settings\n\nOpen models often need lower temperature than proprietary models for consistent output. Proprietary models have extensive post-training that stabilizes output; open models can be more erratic at higher temperatures.\n\nRecommended temperature ranges for open models:\n- **Factual/analytical tasks:** 0.1 - 0.3\n- **Code generation:** 0.2 - 0.4\n- **General conversation:** 0.5 - 0.7\n- **Creative writing:** 0.7 - 0.9\n\nIf you are getting inconsistent output from an open model, lowering the temperature is the first thing to try. A Q4 quantized model at temperature 0.9 will produce noticeably noisier output than the same model at 0.3.\n\n---\n\n## Quick Reference\n\n- **System prompt format:** Use the correct template for your model. Llama 4 uses header tags, Qwen/Hermes use ChatML, some fine-tunes use Alpaca. Check the model card.\n- **Quantization:** Q4/Q5 models need more explicit prompts. Enumerate options, specify output format exactly, and use \"return ONLY\" constraints.\n- **DeepSeek R1:** Add \"Think step by step\" to activate reasoning mode. Skip it for simple tasks to reduce latency.\n- **Qwen 3/3.5:** Use `/think` for hard tasks and `/no_think` for fast responses. Strong multilingual support across 201 languages.\n- **Hermes 4:** Use `<think>` tags for explicit reasoning control. Tell the model not to use them when you want direct responses.\n- **Context limits:** Most open models have 128K context (exceptions: Llama 4 Scout 10M, Maverick 1M). Include only relevant files and explain what you included.\n- **Temperature:** Use lower temperatures than proprietary models. Start at 0.3 for factual tasks, 0.5 for conversation, 0.7 for creative work.\n- **Verification:** Open models, especially quantized ones, need more output verification than proprietary models. Spot-check factual claims and code correctness.","content_length":12025,"generated_at":"2026-04-24"}