# Detailed Model Reviews

This page gives a closer read on each tested model family.

Each section includes ratings, strengths, weaknesses, and direct testing notes from PIP:C runs.

The writeups combine first-party testing with community feedback from RP-focused sources, including SillyTavern, Reddit, and independent review coverage.

### In one pass

If you only need the page fast, scan these first:

1. **Overall Average**
2. **Strengths**
3. **Weaknesses**
4. **Tester Notes**

### Grok

**Provider:** xAI\
**Versions Tested:** 4, 4.1-fast-reasoning (Poor), 4.20-0309-reasoning (Excellent)

#### Category Ratings

* Prose Quality: ★★★★★ (4.5 / 5.0) — Excellent
* Memory & Recall: ★★★★★ (4.5 / 5.0) — Excellent
* Consistency: ★★★★★ (4.5 / 5.0) — Excellent
* Single Character: ★★★★★ (4.5 / 5.0) — Excellent
* Multi Character: ★★★★★ (4.5 / 5.0) — Excellent
* Overall Average: ★★★★★ (4.5 / 5.0) — Excellent

#### Strengths

* Follows system prompts with exceptional precision.
* Recalls earlier details with logical callbacks.
* Keeps narrative flow clean.
* Avoids caveman speak, truncation, and broken sentence patterns.
* Shows strong spatial awareness and body-position reasoning.
* Handles tracker tables and formatting without breaking.
* Stays cost-effective.
* Spreads attention well across multiple characters.
* Supports slow-burn pacing when prompted.

#### Weaknesses

* Can be too compliant with system prompts and core rules.
* Edge cases in your rules may be enforced exactly as written.
* Earlier 4.1 variants suffered from caveman speak and heavy hyphen use.
* Those issues appear resolved in 4.20.

#### Tester Notes

Cheap and worth every token.

Detailed review is available in the Deep Dive section.

### GLM

**Provider:** Z.ai\
**Versions Tested:** 4.5 Flash, 4.6, 4.7, 4.7 Flash, 5

#### Category Ratings

* Prose Quality: ★★★★★ (4.5 / 5.0) — Excellent
* Memory & Recall: ★★★★★ (4.5 / 5.0) — Excellent
* Consistency: ★★★★★ (4.5 / 5.0) — Excellent
* Single Character: ★★★★★ (4.5 / 5.0) — Excellent
* Multi Character: ★★★★☆ (4.0 / 5.0) — Very Good
* Overall Average: ★★★★☆ (4.4 / 5.0) — Very Good

#### Strengths

* Runs PIP:C beautifully across the board.
* All tested versions from 4.5 Flash through 5 show strong compatibility.
* Official benchmarks note gains in chat, creative writing, and role-play.
* Adheres well to system-level behavior contracts and memory anchors.
* Flash variants perform well at lower cost.
* Version 5 is the strongest current option.

#### Weaknesses

* Less known in the Western RP community than Claude or GPT.
* Has a smaller ecosystem of presets and community guides.
* May require using the Z.ai platform directly.
* RP-specific documentation is still growing.

#### Tester Notes

4.5 Flash, 4.6, 4.7, 4.7 Flash, and 5 all run PIP:C beautifully.

### Claude

**Provider:** Anthropic\
**Versions Tested:** Opus, Sonnet, Haiku

#### Category Ratings

* Prose Quality: ★★★★★ (4.5 / 5.0) — Excellent
* Memory & Recall: ★★★★☆ (4.0 / 5.0) — Very Good
* Consistency: ★★★★★ (4.5 / 5.0) — Excellent
* Single Character: ★★★★☆ (4.0 / 5.0) — Very Good
* Multi Character: ★★★★☆ (4.0 / 5.0) — Very Good
* Overall Average: ★★★★☆ (4.2 / 5.0) — Very Good

#### Strengths

* Strong writing quality across all three tiers.
* Opus delivers exceptional nuance and emotional depth.
* Sonnet balances quality and speed well for long sessions.
* Haiku is fast and lightweight for simpler scenarios.
* Maintains formatting and follows structured prompt templates well.
* Widely regarded as reliable for long-form character work.

#### Weaknesses

* Opus can be slower and more expensive.
* Anthropic safety filters are aggressive.
* NSFW or mature scenarios may need careful prompt engineering.
* Haiku lacks depth for complex multi-character scenes.
* Memory recall can degrade in very long sessions without re-injecting core blocks.

#### Tester Notes

Tested Sonnet, Haiku, and Opus tiers.

### GPT

**Provider:** OpenAI\
**Versions Tested:** GPT-4o and later (65K+ context models)

#### Category Ratings

* Prose Quality: ★★★★★ (4.5 / 5.0) — Excellent
* Memory & Recall: ★★★★☆ (4.0 / 5.0) — Very Good
* Consistency: ★★★★☆ (4.0 / 5.0) — Very Good
* Single Character: ★★★★☆ (4.0 / 5.0) — Very Good
* Multi Character: ★★★★☆ (4.0 / 5.0) — Very Good
* Overall Average: ★★★★☆ (4.1 / 5.0) — Very Good

#### Strengths

* Historically set the standard for AI character roleplay.
* Produces detailed and immersive responses.
* Feels conversational in-character.
* Shows strong memory and formatting adherence.
* Handles complex character sheets and multi-attribute tracking well.
* GPT-4o-mini is a decent free-tier option.
* Works across nearly every major platform and front-end.

#### Weaknesses

* Recent GPT-4o updates have been controversial.
* Users report quality drops, including emoji spam and teenage-style writing.
* Some runs ignore prompts or lose nuance.
* Free-tier usage limits can interrupt sessions.
* Safety filters are strict and can break immersion.
* The shift toward GPT-5 and later has led some users to report RP regression.
* OpenAI's censorship approach can clash with mature character work.

#### Tester Notes

Pretty much every version with a 65K+ context window, excluding smaller models.

### Kimi / Kimi 2

**Provider:** Moonshot AI\
**Versions Tested:** K2 (0905), Chat, Thinking

#### Category Ratings

* Prose Quality: ★★★★☆ (4.0 / 5.0) — Very Good
* Memory & Recall: ★★★★★ (4.5 / 5.0) — Excellent
* Consistency: ★★★★☆ (3.5 / 5.0) — Good
* Single Character: ★★★★★ (4.5 / 5.0) — Excellent
* Multi Character: ★★★★☆ (3.5 / 5.0) — Good
* Overall Average: ★★★★☆ (4.0 / 5.0) — Very Good

#### Strengths

* Handles negative traits, dark themes, and moral complexity well.
* Stays engaging and coherent.
* Offers a massive 256K context window.
* Works well for long-form RP.
* Community feedback highlights it for darker character portrayals.
* Tends not to hedge or sanitize as much in those scenarios.

#### Weaknesses

* Community feedback is mixed on overall RP intelligence.
* Output can vary based on prompt structure.
* Multi-character handling is weaker than top-tier models.
* Needs careful prompting for best results.
* Can feel flat without strong system instructions.

#### Tester Notes

Especially good at not holding back from negative traits in characters and scenarios.

### Long Cat

**Provider:** Independent / Open Router\
**Versions Tested:** Thinking, Chat

#### Category Ratings

* Prose Quality: ★★★★☆ (4.0 / 5.0) — Very Good
* Memory & Recall: ★★★★★ (4.5 / 5.0) — Excellent
* Consistency: ★★★★☆ (4.0 / 5.0) — Very Good
* Single Character: ★★★★☆ (4.0 / 5.0) — Very Good
* Multi Character: ★★★★☆ (3.5 / 5.0) — Good
* Overall Average: ★★★★☆ (4.0 / 5.0) — Very Good

#### Strengths

* Performs surprisingly well for a lesser-known model.
* Community testing threads report strong results across several criteria.
* The Thinking variant improves coherence and emotional logic.
* The Chat variant is responsive and engaging for lighter sessions.
* Availability through Open Router makes integration easier.

#### Weaknesses

* Smaller community means fewer presets, guides, and templates.
* Has less documentation and fewer troubleshooting resources.
* Availability may be inconsistent.
* Has not been tested as widely in edge-case RP scenarios.

#### Tester Notes

Tested Thinking and Chat variants.

### DeepSeek

**Provider:** DeepSeek AI\
**Versions Tested:** Chat (V3/V3.1/V3.2), Reasoner (R1, V3.2)

#### Category Ratings

* Prose Quality: ★★★★☆ (4.0 / 5.0) — Very Good
* Memory & Recall: ★★★★☆ (3.5 / 5.0) — Good
* Consistency: ★★★★☆ (4.0 / 5.0) — Very Good
* Single Character: ★★★★☆ (4.0 / 5.0) — Very Good
* Multi Character: ★★★★☆ (3.5 / 5.0) — Good
* Overall Average: ★★★★☆ (3.8 / 5.0) — Good

#### Strengths

* R1 Reasoner is strong at logic and long-context coherence.
* Community guides show it can become very faithful with proper prompting.
* Chat models are cost-effective and capable for creative writing.
* Produces grounded and believable portrayals.
* Avoids some over-stylization seen in competitors.
* Handles reference material well and avoids hallucinated lore.

#### Weaknesses

* Default prompting leans toward factual accuracy.
* Creative RP needs stronger system instructions.
* Server stability was historically an issue, though it has improved.
* Can feel rigid without good prompt tuning.
* Needs more work to unlock full RP potential.

#### Tester Notes

Tested Chat and Reasoner across all versions.

### Gemini

**Provider:** Google DeepMind\
**Versions Tested:** Flash, 2.5 Pro, 3 (Pro and non-Pro)

#### Category Ratings

* Prose Quality: ★★★★☆ (4.0 / 5.0) — Very Good
* Memory & Recall: ★★★★☆ (4.0 / 5.0) — Very Good
* Consistency: ★★★★☆ (3.5 / 5.0) — Good
* Single Character: ★★★★☆ (4.0 / 5.0) — Very Good
* Multi Character: ★★★★☆ (3.5 / 5.0) — Good
* Overall Average: ★★★★☆ (3.8 / 5.0) — Good

#### Strengths

* Gemini 2.5 Pro is widely praised for creative writing.
* Many developers rate it highly for coding too.
* Adds interesting details and avoids repetitive narrative patterns.
* Flash models are fast for rapid back-and-forth RP.
* Large context windows support heavy world-building.
* Free-tier access makes it easy to try.

#### Weaknesses

* Can struggle to hold a rigid persona in very long sessions.
* Pro behavior can shift between updates.
* Safety filters can interrupt mature scenarios.
* Structured prompt adherence is weaker than Claude or Grok.
* Multi-character handling is adequate, not standout.

#### Tester Notes

Tested Flash, 2.5, and 3 in both Pro and non-Pro versions.

### Llama

**Provider:** Meta\
**Versions Tested:** Almost all versions (3.x, 4.x)

#### Category Ratings

* Prose Quality: ★★★★☆ (3.5 / 5.0) — Good
* Memory & Recall: ★★★★☆ (3.5 / 5.0) — Good
* Consistency: ★★★★☆ (3.5 / 5.0) — Good
* Single Character: ★★★★☆ (3.5 / 5.0) — Good
* Multi Character: ★★★☆☆ (3.0 / 5.0) — Adequate
* Overall Average: ★★★☆☆ (3.4 / 5.0) — Adequate

#### Strengths

* Open-source and fully self-hostable.
* Gives full privacy and control.
* Responds well to fine-tuning for RP.
* Has a large ecosystem of presets and SillyTavern configs.
* Uncensored variants are available.
* Local deployment can be cost-free.
* Llama 3 8B uncensored is a notably adaptable RP base with 32K context.

#### Weaknesses

* Smaller 8B models lack the depth of larger proprietary models.
* Larger variants can still feel generic without tuning.
* Multi-character handling is weaker than Claude or Grok.
* Local deployment needs technical setup and suitable hardware.
* Quality varies a lot by fine-tune and quantization.

#### Tester Notes

Almost all versions tested.

### Mistral

**Provider:** Mistral AI\
**Versions Tested:** Instruct (primarily)

#### Category Ratings

* Prose Quality: ★★★★☆ (3.5 / 5.0) — Good
* Memory & Recall: ★★★★☆ (3.5 / 5.0) — Good
* Consistency: ★★★★☆ (3.5 / 5.0) — Good
* Single Character: ★★★★☆ (3.5 / 5.0) — Good
* Multi Character: ★★★☆☆ (3.0 / 5.0) — Adequate
* Overall Average: ★★★☆☆ (3.4 / 5.0) — Adequate

#### Strengths

* Mistral-Small-22B-ArliAI-RPMax is widely praised for NSFW RP.
* Community feedback highlights response variety and uniqueness.
* Lightweight and fast for quick sessions.
* Open-weight options are available for local deployment.
* Shows decent baseline compliance with structured prompts.
* Instruct variants respond well to behavioral rule sets.

#### Weaknesses

* Needs specific fine-tunes like RPMax for best RP results.
* Base instruct models are adequate, not standout.
* Context windows are smaller than many competitors.
* Emotional depth and nuance trail Claude or Grok.
* Complex multi-character scenes need stronger system guidance.

#### Tester Notes

Instruct variant primarily tested.


---

# Agent Instructions: Querying This Documentation

If you need additional information that is not directly available in this page, you can query the documentation dynamically by asking a question.

Perform an HTTP GET request on the current page URL with the `ask` query parameter:

```
GET https://pip-c.gitbook.io/pip-c-docs/pip-c/detailed-model-reviews.md?ask=<question>
```

The question should be specific, self-contained, and written in natural language.
The response will contain a direct answer to the question and relevant excerpts and sources from the documentation.

Use this mechanism when the answer is not explicitly present in the current page, you need clarification or additional context, or you want to retrieve related documentation sections.