Experiment Reproducibility Neural Networks

MLOps December 9, 2024

Experiment Reproducibility in Neural Network Development

Reproducibility is not a nice-to-have in production ML; it is a prerequisite for making correct decisions. An experiment you cannot reproduce is not evidence; it is a rumor.

Why Neural Network Experiments Are Hard to Reproduce

Neural network training is stochastic at multiple levels. Random weight initialization produces different starting points. Data shuffle order affects gradient estimates. Dropout randomizes the computation graph at each step. Floating-point non-determinism in GPU operations can produce slightly different results even with identical seeds. These sources of randomness compound across thousands of training iterations, often producing meaningfully different models from identical code and identical data.

This intrinsic stochasticity is compounded by the complexity of the software stack. PyTorch or TensorFlow versions can change operation behavior in subtle ways. CUDA versions affect the numerical behavior of GPU operations. OS and driver versions affect timing and context switching. The dependency graph of a typical ML training environment includes dozens of packages, each of which can introduce version-specific behavior. Reproducing a result from six months ago, even with the same code, can be surprisingly difficult if the environment has changed.

Finally, there is the documentation problem. ML experiments are typically run interactively, with notebooks and ad hoc scripts that capture the final configuration but not the exploration path that led there. The experiment that produced a good result may have required a dozen previous experiments to set up correctly, but only the final one is documented. Reproducing not just the result but the reasoning process that produced it is nearly impossible without deliberate documentation practices.

The Infrastructure Layer: Seeds, Environments, and Artifacts

Reproducibility starts at the infrastructure layer, with systematic management of the three things that determine whether a training run can be reproduced: random seeds, execution environments, and training artifacts.

Seed management goes beyond the commonly known practice of setting torch.manual_seed or numpy.random.seed. For full reproducibility, you also need to set CUDA deterministic mode (torch.backends.cudnn.deterministic = True), which may have a small performance cost but eliminates GPU non-determinism. You need to set Python hash randomization (PYTHONHASHSEED environment variable) to prevent non-deterministic iteration over dictionaries and sets. And you need to record and log the seeds actually used, not just set them, so they can be recovered if not explicitly provided.

Environment reproducibility requires containerization. Docker images that pin the Python version, framework versions, CUDA version, and all dependency versions provide the only reliable way to ensure that code runs the same way across different machines and different times. For production ML systems, maintaining a registry of environment images associated with model versions is as important as maintaining the model weights themselves.

Artifact management means treating all inputs and outputs of training runs as versioned, addressable objects. Training data, preprocessing configurations, model checkpoints, evaluation results, and hyperparameter configurations should all be stored with content-addressed identifiers (hashes) that uniquely identify their exact content. This makes it possible to verify that a reproduction used exactly the same data and configuration as the original, not just files with the same names.

Experiment Tracking: The Operations Log

Seed and environment management addresses the technical prerequisites for reproducibility. Experiment tracking addresses the organizational challenge of recording enough context to understand and reproduce not just a single run but the entire history of experimentation that led to a decision.

Effective experiment tracking captures four categories of information. Configuration: the complete set of hyperparameters, architecture specifications, and data processing settings, preferably in a serialized format that can be deserialized to reproduce the exact run. Metrics: evaluation measurements at each checkpoint, with enough granularity to reconstruct training curves and understand convergence behavior. Artifacts: links to the model checkpoints, training data versions, and environment images associated with the run. Provenance: the connection between this run and the previous experiments that motivated it.

Tools like MLflow, Weights & Biases, and NeurFly's built-in tracking provide the infrastructure for recording this information automatically. The challenge is not tool selection; it is discipline in using whatever tool is available consistently across all experiments, including the exploratory ones that "might not go anywhere." Retrospectively, the exploratory experiments are often the most important to reproduce, because they provide the context for understanding why certain approaches were abandoned.

Statistical Reproducibility vs. Exact Reproducibility

In the presence of irreducible stochasticity, exact bit-for-bit reproducibility is sometimes impossible and always unnecessary. What matters for practical purposes is statistical reproducibility: the ability to produce results that are statistically consistent with the original, even if not numerically identical.

Assessing statistical reproducibility requires running multiple independent experiments and comparing their distributions. A result is statistically reproducible if a new set of runs with the same configuration produces results within the confidence interval of the original. A result that requires many runs to find, or that is statistically inconsistent across reruns, is not a reliable basis for production decisions regardless of how good the single best run looks.

This is an important discipline that many teams neglect. Reporting the best result across many runs without acknowledging the variance is a common source of misleading comparisons in both academic papers and internal ML reports. Production decisions should be based on the expected performance of a model over multiple independent training runs, not the best-case outcome of a lucky initialization.

NAS and Reproducibility: Special Considerations

Neural Architecture Search introduces additional reproducibility challenges. The search process itself is a large-scale stochastic optimization, and the architectures it discovers can vary across runs. Reporting a single NAS result without characterizing the distribution of discovered architectures over multiple search runs can give a misleading impression of the technique's reliability.

NeurFly's platform addresses this by maintaining a complete record of the search history — every architecture sampled, its evaluated performance, and the decisions made by the search algorithm at each step. This makes the search process auditable and reproducible: given the same supernet weights, search configuration, and random seeds, the search will produce the same sequence of architectural decisions.

It also enables a richer analysis of search results: rather than reporting a single "best" architecture, you can characterize the distribution of good architectures found across multiple search runs, identify architectural choices that are consistently selected (strong priors) versus those that vary (weak priors), and understand the sensitivity of the results to search hyperparameters.

Key Takeaways

Neural network training is stochastic at multiple levels; reproducibility requires explicit management of all randomness sources.
Environment containerization is the only reliable way to ensure code runs consistently across time and machines.
Experiment tracking should capture configuration, metrics, artifacts, and provenance for every experiment, including exploratory ones.
Statistical reproducibility — consistency over multiple runs — matters more than exact numerical reproducibility.
NAS introduces additional reproducibility challenges; search history logging and multi-run analysis are essential.

Conclusion

Reproducibility is an investment that pays back through faster debugging, more reliable decision-making, and reduced organizational risk when team composition changes. ML systems that cannot be reproduced cannot be trusted, maintained, or improved in a principled way.

The technical infrastructure for reproducibility is well-established and increasingly automated. The remaining challenge is cultural: building teams that treat every experiment as a record to be preserved and every result as a claim to be validated, not just a number to be reported. That culture is what separates teams that compound their learning over time from teams that repeatedly rediscover the same insights.

Our platform builds reproducibility in by default. If you want to talk about implementing reproducibility practices in your own ML organization, reach out through our contact page.

← Back to Blog