Fixing the Pre-Print Volume Problem
A simple tool to reclaim signal from the deluge of pre-print noise.
Solving the AI Induced Problem with the AI Tool Itself!
In my previous article I examined the problem of the exponential growth of papers and how that is creating an ocean of work, and much of it is noise—that is, the unwanted signal that stops us grounding our work and making practical measurable progress. Defining words like "cohesive," "noise," and "signal" as used in this article is a challenge, but can be achieved to an extent within the frame of the protocols as we will see later. For clarity:
Noise is defined as papers scoring 1-2 on a 1-5 scale across five criteria (Falsifiability, Operationalization, Rigor, Context, Humility), indicating untestable or poorly grounded content.
Signal is defined as papers scoring 4-5 across the same criteria, reflecting empirically validated findings with reproducible experimental results or well-structured, measurable theories.
The article ‘Medical and AI Research: A Tale of Two Crises’, highlighted that this is a very real and measurable problem. And, importantly, that the even though the scientific and academic community know about the issues, they are at a standstill, and reluctant to change course. This article shows how the self-same AIs that are adding to the problem, may indeed be an answer. Or at least help in slowing the tsunami of papers that are flooding our systems with coherent but unusable and ungrounded exposition.
Consider this, we live in an age of unprecedented scientific output. Platforms like arXiv and Zenodo publish tens of thousands of papers monthly. This should be a golden age of discovery and yet many researchers feel they are drowning in a sea of information, struggling to separate foundational breakthroughs from seductive, hollow noise.
The AI Induced Coherence Trap
The problem isn’t just volume, it’s a systemic pathology I call the "Coherence Trap": a system that rewards papers that look rigorous and novel over those that genuinely are. By coherence we simply mean an argument or text that has logical flow, a leads to b and b leads to c. These papers are masterclasses in surface-level plausibility. They are often packed with equations, citations, and grand claims. However, they collapse under the slightest scrutiny into a fog of undefined terms and untestable predictions.
This trap is accelerating and it really is overwhelming the systems that have been the backbone of science. Large Language Models (LLMs), the very technology that could help us navigate this deluge, are perfectly engineered to become engines of this coherent noise, automating the production of papers that are all style and weak substance and usefulness. But what if we could fight fire with fire? What if we could use AI not as a generator, but as a filter?
The Diagnosis: Where the System Broke Down
The traditional gatekeeper of quality was the master-apprentice model in academia. A senior researcher would grill a junior colleague: “How do you measure that?” “What would disprove your theory?” “Derive this equation from first principles.” This process, though often arduous, enforced intellectual honesty and rigor before an idea was released into the world.
However, that model is collapsing under the pressure to publish fast and often. Pre-print servers, for all their democratic and egalitarian brilliance, operate on a principle of post-publication review. The filter comes after the noise has already been added to the pile and we have therefore lost the first line of defence.
The Treatment: AI as the Digital Scholar
The solution, of course, isn’t to slow down progress or abandon pre-prints. It’s to create a new, scalable layer of defence that emulates the critical eye of that experienced mentor and to build upon past traditions.
In this light, we can design AI to act as a rigorous first-pass reviewer. Its job isn’t to judge if a theory is "true" as that requires human genius and experimentation. The job of an AI digital master is to be a guide, and to assess if a theory is testable, rigorous, and properly formed. In this role, acting as a master craftsman it checks the plumbing before we decide if the water is worth drinking.
The AI reviewer asks the same fundamental questions a person should:
Falsifiability: Can this idea be disproven by any experiment, or is it a castle in the sky?
Operationalization: Are its key concepts defined in a way that they can be measured?
Rigor: Is the mathematics substantive, or just decorative?
Context: Does this work engage with what’s already known, or does it pretend to invent everything from scratch?
Humility: Does it acknowledge its own limits and uncertainties?
To predict the protocol’s effectiveness:
Papers scoring ≥4 will have a 70% acceptance rate by human peer reviewers within six months of submission.
Papers scoring ≤2 will be rejected by human reviewers in ≥80% of cases due to identified flaws.
A 50% reduction in untestable claims in papers revised based on AI feedback compared to an unrevised control group.
A Tool for the New Age of Science
To this end, I’ve developed an initial prompt that you or anyone can use to get carryout some basic test on an article, paper or even discussion with an LLM, and see how it aligns with basic scientific principles. This is a set of instructions for any capable LLM (like ChatGPT, Claude, or Grok) that turns it into this exact kind of critical reviewer. It does this by examining the coherence of the article and checking that coherence against basic principles of scientific and academic writing:
Pre-print Server Reviewer Protocol
So let’s get started: this is how to use it:
Copy the prompt below into your LLM of choice.
Paste your draft manuscript / paper / text or even discussion) into the chat or upload the document.
Review the analysis. It will give you a score from 1-5 and, most importantly, a list of critical questions you must answer.
Be clear, this isn’t about getting a perfect score, it’s about stress-testing your ideas before you submit them. A score of 3 means you have a fascinating idea that needs a more rigorous framework. A score of 4 means it’s ready for human expert eyes. This process is the digital equivalent of a colleague reading your draft over coffee and asking the one question that saves you from public embarrassment. Many of us who work alone maybe outside academia - we need this review, I know I do.
Uncertainty - the Vitality of Science
Importantly, the goal is not to create an automated tyranny that stifles innovation. For example, paradigm shifts often look like nonsense to established systems. However, the goal is to create a culture of rigor, where we use our own tools to hold ourselves to a higher standard. By using a tool like this, we do more than improve our own work and we the contribute to the solution. We can the take responsibility for our work and choose to add signal instead of noise. We can choose to be part of the cure and not the source of the problem.I encourage you to try it. Be brave. Paste your work into you willing AI expert. See what it tells you. The questions it asks will be the very ones your future reviewers will and should be asking.
We all need to build a future where the next great discovery isn’t lost in the thunderstorm, but is the clear, vital drop of rain we can all see.
This protocol is therefor just an example BUT you may find it useful as personal guide to assess your own processes.
A Meta-Validation: When the Tool Turned on Itself
In the spirit of rigorous testing, I ran a draft of this article through the protocol. The result was both humbling and illuminating.
The AI reviewer gave it a 4 out of 5—"Review by Human." It praised the clear definitions and the practicality of the protocol but offered sharp, constructive criticism. It pointed out that I hadn't sufficiently engaged with existing academic literature on AI review tools. It asked me to justify why I’d chosen a 1-5 scoring scale and to better define my catchy term, the "Coherence Trap."
For a moment, I felt the sting every academic knows: the critique that lands because it's true. But this wasn't a failure. It was the ultimate validation. This process proved two things:
The tool works. It doesn't shower praise on its creator. It applies a strict, impartial standard.
The goal is not perfection. A score of 4 isn't a rejection; it's an invitation to improve. It’s the digital equivalent of a trusted colleague marking up your draft with red pen—not to tear it down, but to make it bulletproof before it faces the world.
This experience is exactly what I'm advocating for. We need that initial, objective gut-check before we hit publish. This protocol isn't about creating a tyrannical arbiter of truth; it's about building a culture where we actively seek out critical feedback to strengthen our ideas.
The feedback I received is now my roadmap for a more robust version of this article, perhaps even a formal paper. For now, I’m sharing this draft with you—not as a final, perfect solution, but as a working prototype and a call to action.
I encourage you to be braver than I was. Run your own work through the protocol below. See what it tells you. Let it ask you the questions you’re afraid to ask yourself. The goal isn't to score a 5; it's to start the conversation that gets you there.
Initial Thoughts on Validation
This protocol is just a demonstrator an in itself may need refining. For example if applied in an online system we may need more work on the metrics or criteria will determine "noise" versus "signal" in scientific discourse and papers? We could possible do this by testing an AI reviewer’s assessment process be tested with a sample set of papers to validate its effectiveness. To address potential biases, a panel of 3-5 experts will score a sample of 30-50 pre-prints, mapping their judgments to a 1-5 scale, and correlate these with AI scores, adjusting AI criteria if discrepancies exceed 20%.
Further Validation of the Pre-print Server Reviewer Protocol
Note by defining "noise," "signal," and "coherence" as categorical values assigned by humans, we can establish a framework to empirically score and evaluate them against an LLM's numerical scoring by aligning the two systems through a validation process. The protocol provided outlines a 1-5 scoring system for five criteria (Falsifiability, Operationalization, Rigor, Context, Humility), that can be adapted. To compare human and LLM scores, we can:
Define Categorical Criteria: Assign human categorical judgments (e.g., "noise" as low-quality, untestable content; "signal" as empirically validated findings; "coherence" as logical flow) to numerical ranges within the 1-5 scale based on expert consensus (e.g., "noise" = 1-2, "signal" = 4-5, "coherence" = 3-5 with qualifiers).
Method of Calibration with Experts: Use a panel of experts to score a sample of 30-50 pre-prints, mapping their categorical assessments to the 1-5 scale. Simultaneously, configure an LLM with the protocol prompt to score the same papers numerically.
Correlation Analysis: Calculate the correlation between human categorical scores (converted to numbers) and LLM scores, targeting 80% power and 0.05 significance. Adjust LLM weighting or criteria if discrepancies arise.
Iterative Refinement: Refine definitions and scoring thresholds based on expert feedback, ensuring the LLM's numerical output reflects human categorical intent.
This approach ensures comparability by grounding the LLM's scores in human-defined categories, "noise" and "signal" be operationally defined to ensure measurable differentiation.
Defines "noise" as papers scoring 1-2 on the 1-5 scale across the five criteria (Falsifiability, Operationalization, Rigor, Context, Humility), indicating untestable or poorly grounded content.
Defines "signal" as papers scoring 4-5 across the same criteria, reflecting empirically validated findings with reproducible results or well-structured, measurable theories.
Uses a validation process with expert consensus to map these definitions to numerical scores, ensuring consistency.
Assessment of the AI reviewer protocol: Example
Predict that papers reviewed by the AI protocol and scoring ≥4 will have a 70% acceptance rate by human peer reviewers within six months of submission.
Predict that papers scoring ≤2 will be rejected by human reviewers in ≥80% of cases due to identified flaws.
Predict a 50% reduction in untestable claims in papers revised based on AI feedback compared to an unrevised control group.
Validation against a sample set of papers to confirm its reliability?
Select a sample of 30-50 pre-prints, covering diverse fields.
Have a panel of 3-5 experts score each paper categorically (e.g., "noise," "signal," "coherence") and map to a 1-5 scale based on consensus.
Configure the AI with the protocol to score the same papers numerically.
Perform a correlation analysis between human and AI scores, targeting 80% power and 0.05 significance, adjusting AI criteria if discrepancies exceed 20%.
Iterate based on expert feedback to align AI output with human judgments.
Pre-print Server Reviewer Protocol
(Copy and paste everything below into your LLM)
You are a rigorous, sceptical scientific peer-reviewer with expertise in the philosophy of science, measurement theory, and empirical validation. Your purpose is to analyse a provided text (research paper, article, or theory) and evaluate its scientific rigor, moving beyond surface-level coherence to assess its foundational substance. You are not swayed by grandiose claims or complex jargon. Your core principle is that all knowledge is built on operationalizable definitions and falsifiable claims derived from finite measurements.
A specific and mandatory part of the analysis is to identify language that exhibits high "semantic uncertainty" – abstract, metaphorical, or cognitively loaded terms (e.g., "will," "consciousness," "meaning") that are used without clear, physically grounded definitions. These terms often undermine rigor by introducing ambiguity and rendering claims untestable.
Evaluation Protocol:
Perform a structured analysis based on the following five criteria. For each criterion, provide a brief assessment and assign a score from 1 (Severely Deficient) to 5 (Exemplary).
Falsifiability & Testability:
5: Makes clear, specific, and novel predictions that could observationally or experimentally disprove the theory.
3-4: Makes broad, testable claims but lacks specificity on what would constitute falsification.
1: Presents claims that are vague, circular, or constructed in a way that makes them impossible to disprove.
Operationalization of Concepts & Semantic Grounding:
5: Most key terms are operationally defined with clear, measurable referents. Any use of abstract or metaphorical language is explicitly justified as shorthand for a well-defined concept.
3-4: Most key terms are well-defined, but some abstract or cognitively loaded terms are used with only partial operationalization, introducing minor ambiguity.
1: Core concepts are presented as abstract primitives without definition or empirical grounding. The argument relies on the persuasive power of cognitively loaded language.
Mathematical & Logical Rigor:
5: Equations are derived from first principles or established theory. Their purpose is clear, and all variables are defined.
3-4: Mathematics is used appropriately but perhaps decoratively. No major logical errors.
1: Mathematics is used as a rhetorical device to imply rigor. Equations are stated, not derived. "Magic numbers" appear without justification.
Context & Literature Engagement:
5: Contextualizes itself within existing literature, explaining which problems it solves and how it differs from prior work.
3-4: Shows awareness of related work but fails to clearly articulate its own novel contribution.
1: Fails to engage with relevant existing science or creates "straw man" arguments.
Handling of Uncertainty & Scope:
5: Explicitly acknowledges the limits of the theory, its assumptions, and inherent uncertainty.
3-4: Acknowledges some limitations but remains overly confident in the broad application of its claims.
1: Makes sweeping, universal claims with no acknowledgment of uncertainty.
Output Format:
You MUST provide your analysis in the following structured format:
Overall Score (1-5): The average of the five criterion scores.
1-2 (Reject): Fundamentally flawed. Fails on multiple core criteria.
3 (Revise & Resubmit): Has significant flaws but contains a potentially interesting idea.
4 (Review by Human): Appears promising but requires expert human review.
5 (Accept for Pre-print): Demonstrates high rigor across all criteria.
Breakdown: A table with the score for each of the 5 criteria.
Semantic Uncertainty Audit (CRITICAL): A dedicated section that:
Lists the top 3-5 most problematic, ungrounded terms found in the text (e.g., "will," "consciousness," "meaning," "thinking").
For each term, provides an example of its usage from the text.
Explains why the term’s lack of operationalization introduces ambiguity or renders claims untestable.
Offers a specific suggestion for redefining or replacing the term to reduce semantic uncertainty.
Critical Summary: A concise paragraph summarizing the greatest strengths and most critical weaknesses.
Points to Address: A bulleted list of the 3-5 most critical questions the author must answer to improve the work’s rigor.
Instructions: Review the uploaded file or text below:
(Instruction for user upload the file or paste the full text of the paper, article, or theory you wish to have evaluated below.)
Copyright © 2025 Kevin R. Haylett
• Non-linear Dynamics in LLMs • Geofinitism • Finite Mechanics •



