The Day We Taught Machines to Learn: Revisiting Frank Rosenblatt’s Perceptron

Let me tell you about a paper that fundamentally changed how we think about artificial intelligence. It’s 1958, and a psychologist named Frank Rosenblatt publishes something in Psychological Review – not a math or computer science journal – that would lay the groundwork for the neural networks powering today’s AI revolution.

Why a Psychologist Was Building Electronic Brains

Here’s what I find fascinating about Rosenblatt’s story. He wasn’t approaching this from pure computer science – he worked at Cornell Aeronautical Laboratory and was deeply interested in how actual brains work. He lays out three big questions:
(1) how sensory info is detected,
(2) how it’s stored, and
(3) how stored info influences behavior.

At that time, people knew a bit about detection, but storage and recall were still pretty mysterious.

The Biology Behind the Idea

Let’s think about how we would model the human brain if we were just starting from scratch. What are some ideas that we would want to consider?
Well, first, it would be valuable to know how the human brain works. Unfortunately, no one really knows exactly how the human brain works, but there are some things that we have observed. A neuron has these branch-like structures called dendrites that receive signals from other neurons. On the opposite end, there’s an axon that transmits signals out. The really interesting part is that neurons aren’t physically wired together – there’s a tiny gap called a synapse where electrochemical signals jump between cells.

3 classifications of neurons:

sensory
motor
inner

Now if we’re thinking about synapses and the ways that cells communicate, it has been observed that the more and more communication between two particular cells, that one’s axon to another’s dendrite, the more that that communication happens, a stronger bond becomes. And that makes sense to us, because we’ve observed this ourselves. For instance, have you practiced something over and over, and you develop muscle memory, or have you maybe studied for your exam? And the more you study, the more second nature this information becomes. It’s been observed with humans that the more we do something, the better we get at it. So these synapses become stronger between two cells, and in fact, even more cells may join in setting all of that complexity aside, that is not how artificial neural networks work, but there are some ideas here that we can take.

Rosenblatt contrasts two ideas: one is that memory is like a “code” or a stored image, where you could theoretically reconstruct the original input if you knew the code. The other is the “connectionist” view: that memory is just about changing the connections between neurons, not storing a literal image.

Here’s the key insight that Rosenblatt latched onto: the more two neurons communicate, the stronger their connection becomes. You’ve experienced this yourself – when you practice something over and over, it becomes muscle memory. When you study for an exam repeatedly, the information becomes second nature. The brain physically strengthens connections that get used frequently.

The brain actually is a network of neurons that are wired together in many, many interesting ways. Now we don’t have the capability to do that in computer science, at least right now, neural networks are not that smart to be able to determine which neurons need to be wired together, and when we do that, we might have certain numbers of cells wired to other groups of cells. And so we’ll call those groupings layers. So the cells in layer one might be then wired to the cells in layer two. Now incidentally, if we wired every single cell in layer one to every other one in layer two, that’s called a fully connected network.

We might wire we might wire every single one, just to see we might have other configurations called architectures, that might have a different design to them. And we’re learning all the time about this. So how many cells do you put in a layer? We have no idea. We don’t even have a mathematical model for it. We take the kitchen sink approach.

We’ll try some number, for example, we’ll try 100 neurons in layer one. We’ll try an arbitrary five layers.
Or maybe we’ll have some intuition based on someone else’s research.
And then we’ll try different combinations. We’ll try 100, 1000, 10000, a million, and then try all these different combinations and see which one gives me the better result.

But at that time they couldn’t try every combination. More number of nodes would mean, it takes a long time to compute these because think about it, if you have 100,000 cells in one layer fully connected to 100,000 in the next layer.
How many connections do you have? A lot of connections.

This gave Rosenblatt an idea: What if we could model this mathematically?

Let’s talk about this idea that connections between two cells become stronger the more that we use them, that’s been observed in the natural world. However, in software that’s kind of hard to emulate. The Hebbian theory, first introduced by Donald Hebb in 1949 in his book entitled The organization of behavior. He theorizes that in order to emulate this idea of the output of one cell going into the input of another, and then the two having some kind of connection that gets stronger and stronger over time, we could model that in computer science by creating a probability that the two will fire together, and the more the probability goes up. We’re emulating this idea of the connection being stronger. So the stronger connection would mean the probability would be higher. And so then imagine that there’s a knob between these two cells, and if you turn a knob all the way to zero, there’s zero probability. You turn it all the way to one, you have 100% probability to fire together. And so you can adjust this so you could then play with the strength of the two cells by adjusting this knob.

So putting this into practical terms, if we had 100 neurons in one layer and we wanted to fully connect them to 100 neurons in another layer, we would have 100 times 100 or 10,000 connections. That would mean we need 10,000 little knobs. How do you tune the combination of 10,000 knobs to find the optimal output? Well, you can do it through trial and error, which effectively we do. We try different values for them and adjust them all randomly to try to find the best combination, and hopefully we stumble upon it. That’s effectively how we train a neural network.

The Perceptron: A Probabilistic Brain Model

Reading through Rosenblatt’s paper today, I’m struck by how he frames everything in terms of probability theory rather than the symbolic logic that was popular at the time. He explicitly criticizes the discrete, binary approaches others were using, saying they’re “less well suited” for modeling systems where we don’t know the precise structure.

His perceptron model rested on five key principles that still resonate today:

Random initialization: The initial connections in the network are largely random, subject to minimal genetic constraints. (Sound familiar? We still randomly initialize neural network weights today!)
Plasticity through learning: After exposure to stimuli, the probability that one set of cells will trigger another changes due to “relatively long-lasting changes in the neurons themselves.”
Similarity creates pathways: Similar stimuli tend to activate the same responding cells, while dissimilar ones activate different cells.
Reinforcement shapes connections: Positive and negative feedback can strengthen or weaken the connections being formed.
Similarity is relative: What counts as “similar” depends entirely on the physical organization of the perceiving system and how it evolved through interaction with its environment.

That last point particularly strikes me – Rosenblatt understood that similarity isn’t some objective mathematical property. It emerges from the architecture of the learning system itself.

The layers of the perceptron were described as follows:

you have the input layer (the “retina” of sensory units)
one or more layers of “association units” that receive inputs
output layer of response units

These connections are essentially random at the start. The perceptron learns by adjusting the strength of these connections based on experience.

The Mathematics of Learning

What made the perceptron special wasn’t just the biological inspiration – it was that Rosenblatt worked out the math. He showed how you could model neurons as units that sum up excitatory and inhibitory inputs, and if that sum exceeds a threshold, the unit “fires” in an all-or-nothing fashion. This is essentially the step function activation we still teach in intro neural network courses.

He was modeling learning as a fundamentally probabilistic process, not a deterministic one. This is why perceptron can actually perform and generalize what it has learned. He explains that the perceptron isn’t just memorizing specific inputs—it can also generalize from the examples it’s seen to new, similar inputs.

Rosenblatt splits the network’s reaction to a stimulus into:

Predominant phase: right after a stimulus hits the “retina,” lots of scattered A-units light up across the association layer. Multiple response pools (source-sets) receive input simultaneously. Think of this as a brief, messy many-candidates state.
Postdominant phase: very quickly, one response pool wins and suppresses the others via inhibitory feedback (the classic winner-take-all dynamic). Activity collapses to a single response unit (R-unit) and its source-set; rivals are shut down.

How the “winner” is picked: two selection rules

Rosenblatt analyzes two idealized policies for deciding which response takes over in that postdominant phase:

Mean-discriminating system (μ-system)
- The system computes (implicitly) the average input strength per active connection for each response pool—i.e., the mean value across the inputs that particular response is receiving.
- The response with the highest mean spikes first, gains a head start, and becomes dominant.
- Intuition: this normalizes for how many A-units happen to be active. A pool with a few strong signals can beat a pool with many weak signals.
Sum-discriminating system (S-system)
- The system compares total input (i.e., the sum of values over all active inputs) into each response pool.
- The response with the largest sum gains the advantage and wins.
- Intuition: this doesn’t normalize by count, so a pool with more active inputs can win even if each input is only moderately strong.

μ-system is often more stable/robust because it’s less sensitive to random fluctuations.

How the perceptron stores information (“learning”)

Each A-unit carries a value that changes with reinforcement when it is active. This is how perceptron learns.

Why this amounts to memory:

On first presentation, the winning response is essentially random; but if the active A-units are reinforced (values increase), the same stimulus later more strongly drives that response—i.e., the system has stored the association in the value configuration of the A-layer/source-set.
Because values are distributed over many A-units, storage is distributed: many units share the trace, not a single location (implied by the source-set/value formalism).

Bivalent Systems

Up to that point, all the learning rules we analyzed are monovalent: whenever an A-unit (association unit) is active during reinforcement, its “value” (influence on its connected response) goes up. Even in the γ-system, any active unit gains.

In a bivalent system, reinforcement can be positive or negative. Crucially, an active A-unit might either gain or lose value depending on the kind of reinforcement currently applied. If an experimenter can apply these two kinds of feedback externally (think: “reward” vs “punishment”), the perceptron can learn by trial and error: try a response, reward it if correct (strengthen what fired), punish it if wrong (weaken what fired).

Two ways he instantiates bivalence

Reward/punishment as external signals.
- If the emitted response is correct → apply positive reinforcement → all active A-units in the winning response’s source-set increase value.
- If the emitted response is wrong → apply negative reinforcement → all active A-units decrease value (the network learns to disfavor the configuration that just led it astray).
Binary-coded responses (no explicit reward needed).
- Organize outputs as a set of bits (feature detectors). For each bit:
  • if the bit should be ON for the current stimulus, give positive feedback to the bit’s source-set;
  • if the bit should be OFF, give negative feedback to the active A-units in that bit’s source-set.
- This still yields bivalent learning, because active units can be pushed up (when the bit is on) or down (when it’s off). It’s essentially teaching the network to carve out binary attributes, not just pick one class.

Click here for followup and further reading: Blog