Automated Solutions and Human Judgement – Both Crucial for AI Training

Automated Solutions and Human Judgement – Both Crucial for AI Training


Not just any kind of data is suitable for AI training. Preparing data for AI might not be as interesting for the media as model releases, but it is very interesting for the companies actually making those releases. This is well illustrated by the fact that Scale AI – a company whose core business is annotating and structuring training data – was purchased last year for $15 billion by none other than Meta. This shows that prepping data for training is fast becoming the name of the game in AI. With this in mind, it’s also no surprise that the global data labeling market is currently valued at $2.3 billion – and projected to triple by 2031.

What “AI-ready” actually requires

In addition to being clean, structured, representative, and correctly annotated, data that’s used for training agentic AI must also be current, retrievable in real time, and formatted in ways that allow the agent to reason from it without any guesswork.

The first part of solving this problem has to do with automation. Sophisticated web crawlers and extraction tools have enabled developers to harvest fresh data at scale from hundreds of languages, locations, and domains simultaneously – an impossible task with manual human effort alone.

For this data to become usable, however, it has to be deduplicated, normalized, and filtered to reduce noise. These steps are largely mechanical, so automation does a good job here, as well. The harder problem, and the one where errors carry the most lasting consequences, is annotation: telling the model not just what the data contains, but what it means.

The importance of data labeling

In AI training, a label is essentially a piece of information that guides a model on how to understand something. For instance, an image of a tumour might be labeled as “malignant,” or a sentence could be tagged to indicate it expresses negative sentiment. Even a simple question-and-answer pair can have its response rated as “helpful,” “accurate,” or “harmful.” Since models learn by identifying patterns in these examples, the quality of annotation sets the upper limits of what any particular system can achieve.

For certain tasks – reading ambiguous medical images, catching sarcasm, assessing whether a response is appropriately nuanced – human judgment remains the more reliable option. Machines can approximate many of these tasks, but approximation is often not good enough – particularly when the cost of getting it wrong is high.

In practice, annotation is a structured process with guidelines, review stages, and consistency checks built in. The most revealing application of all this is Reinforcement Learning from Human Feedback, or RLHF. Human raters evaluate model responses and flag which ones are more useful, accurate, or appropriate – and those judgments, run across thousands of comparisons, gradually pull the AI toward behavior people actually want from it. The best language models available today – including Chat GPT-5, Claude, and Gemini – went through this repeatedly.

Automated labeling has made this process viable at scale. Tools that pre-annotate data using existing models – routing only uncertain cases to human reviewers – have dramatically reduced the time and cost of building large training sets. Without this, the datasets that power modern AI would simply be out of reach. The tradeoff is that automated pre-labeling inherits whatever biases the labeling model carries, and edge cases – the examples most important for robustness – are the ones most likely to slip through.

This is far from hypothetical, by the way. A landmark Science study found that a widely used healthcare algorithm systematically underestimated the needs of Black patients – not because of a coding error, but because its training data reflected historical inequalities in who received care. A 2025 LSE study found the same pattern in a different domain: Google’s Gemma produced inequitable summaries of social care case notes depending on patient gender. Human review is the only means of solving, or at least minimizing, such problems.

Human judgment doesn’t start at labeling

Labeling is the most discussed form of human involvement in data preparation, but the need for expert input doesn’t begin or end there.

Before the data gathering process begins, someone has to determine which data to collect – and this determines everything that follows. A training dataset built on a narrow range of sources will produce a system with a narrow range of behaviors.

For instance, web data collected without attention to geographic or demographic coverage will perform unevenly across populations. A review of radiology and biomedical research found that 71% of patient data used to train deep-learning diagnostic models came from just three US states. The same review found that recidivism prediction models trained on data from one state produced skewed results when deployed in others. Catching these flaws calls for someone who understands the use case well enough to ask whether the sample actually reflects the world the system will operate in.

Once data collection has begun, curation and validation present their own challenges. Synthetic data can fill shortfalls where real-world examples are scarce or sensitive. Whether it’s realistic enough to train on, though, is something a subject-matter expert has to assess – there’s no automated test that can settle the question.

After the model goes live, the data feeding it keeps changing – and AI systems encounter inputs they were never prepared for. For agentic AI, the consequences are quite specific: an agent working from outdated or incomplete data takes actions, not just guesses. Human scrutiny is still the most reliable way to catch this early.

The combination that works

Automation is what makes AI development possible at the scale it now operates. The volumes of data involved – and the speed at which training sets need to be built, tested, and revised – exceed what any human-driven process could handle. By some estimates, more than 70% of meaningful improvements in model performance are attributable to data quality rather than architectural changes, and reaching that quality threshold depends on automated collection and processing as much as anything else. Human annotators, for their part, are slow, expensive, and inconsistent in ways that don’t disappear with better management.

What neither can do alone is produce AI models that are both capable and reliable. Automation delivers the scale; human judgment shapes what that scale is used for – in deciding what to collect, how to label it, how to validate it, and how to catch the failures that algorithmic review misses. The organizations that treat either side of this as a solved problem tend to build systems that perform well in controlled evaluations and disappoint in production. The gap between the two is, almost always, a data problem.



Source link

Posted in

Amelia Frost

Leave a Comment