Neural Net Optimization, Part 1 — The Computational Graph

The first in a four-part series exploring neural network optimization, starting from scratch with a toy computational graph and building up to scalable, practical ML systems.

In the Beginning

Welcome to the first post in a three-part series on neural network optimization! My goal is to share a practical perspective on how the scale and complexity of neural network optimization evolves, especially for those who, like me, started without much formal training. Over the years, I’ve helped several early-stage companies move from a handful of Jupyter notebooks to more robust, scalable ML workflows. This series chronicles that journey and the lessons learned along the way.

Three Levels of Neural Network Implementation

  1. Computational Graph:
    A simple computational graph network, built from scratch, capable of a forward pass (post-order traversal) and auto-differentiation (backward pass) to fit any defined network to a set of data points.

  2. PyTorch Example:
    A minimal PyTorch implementation, capable of more complex learning on a basic multilayer perceptron. This demonstrates how frameworks can simplify and scale up what’s possible.

  3. Scaling Up:
    The real-world challenge: distributing training and tuning across multiple GPUs, or running many experiments in parallel on a single GPU.


The Repository

All code for this post is available on GitHub.


The Computational Graph

This is the third time in my life I’ve built this example from scratch, and I promise myself it’ll be the last! But it’s a fantastic way to understand the basics of neural networks, how they’re optimized, and why frameworks like PyTorch exist.

If you’ve ever wondered what’s happening “under the hood” in PyTorch or TensorFlow, this is it: at their core, these frameworks build and manipulate computational graphs just like this one—except they do it automatically, at scale, and with hardware acceleration. Every neural network you define in PyTorch is internally represented as a computational graph, where each operation (like addition, multiplication, or activation functions) becomes a node, and the framework handles the forward and backward passes for you.

The computational graph here is made from a few simple nodes (constants, inputs, addition, multiplication, and power). A graph object manages these nodes, letting us interact with the network as a whole rather than node-by-node.


Why This Is Hard to Scale

While building a computational graph from scratch is a great learning exercise, it quickly becomes impractical for real-world ML tasks, for two key reasons:

1. Complexity and Scalability

2. The Hyperparameter Trap


Example: Fitting a Polynomial

Take the example in my repository, example_polynomial_fit.py.
It fits a dataset of the form:

\[y = c_0 x_1^2 + c_1 x_1 + c_2 x_2^2 + c_3 x_2 + c_4\]

using a hand-built computational graph.


The Good: When It Works

With reasonable hyperparameters, the model converges smoothly:

MAX_GRAD = 1000
MIN_LR = 1e-3
LR_DECAY = 0.999
EPOCHS = 300
LR = 0.01

The Bad: When It Doesn’t

But with a bad learning rate (and uncapped gradients), the solution becomes unstable and never converges:

MAX_GRAD = 1e10
MIN_LR = 1e-3
LR_DECAY = 1
EPOCHS = 300
LR = 0.1

Key Takeaways


In Part 2, we’ll see how PyTorch implements neurons to build practical neural networks, using key concepts of computational graphs.