- Introduction
- Repo Structure
- Setup
- Implemented Architectures
- Manual Calculations
- Running the Tests
- Experiments & Results
- Conclusions & Insights
NeuralBytes is a pure Python package demonstrating how to implement several neural architectures from scratch, without using any external deep learning frameworks. In this repo, you will find:
- A simple Multi-Layer Perceptron (MLP) for binary classification.
- A minimalistic Convolutional Neural Network (CNN) that performs a single convolution layer.
- A character-level and word-level Recurrent Neural Network (RNN) with backpropagation through time (BPTT).
- A character-level and word-level Long Short-Term Memory (LSTM) network with gate-by-gate backpropagation.
Each model is implemented using plain Python lists and loops. The low-level matrix operations (multiplication, transpose, elementwise arithmetic) and activation functions (sigmoid, ReLU, tanh, softmax, and their derivatives) are defined in utils.py and activations.py.
neural_bytes/: Python package containing implementations of MLP, CNN, RNN, LSTM, plus utility modules.mlp_by_hand.pdf&rnn_by_hand.pdf: Manual calculation for MLP and RNN to verify backpropagation steps and understanding the logic behind each approach.plots_mlp/,plots_rnn/,plots_lstm/: Loss vs. epochs curves for various experiments.test_*.py: Example scripts demonstrating how to train and evaluate each model.predictions.txt: Log of MLP predictions and accuracies for different hyperparameters.requirements.txt: Containsmatplotlib(this is optional for visualization purposes).
-
Clone the repository
git clone git@github.com:ramosv/NeuralBytes.git
-
Go into the project directory
cd NeuralBytes
Thats it, the package has no further dependencies to install.
Note: If you would like to plot the loss/epoch then you will need to install matplotlib via pip install matplotlib
No other dependencies are required. All code is built using Python standard library functions.
-
Implements a 2-layer feedforward neural network (up to two neurons per hidden layer, configurable up to N hidden layers).
-
Uses sigmoid activations and mean squared error (MSE) loss.
-
Forward pass:
- Compute
$Z^{[l]} = W^{[l]},A^{[l-1]} + b^{[l]}$ - Apply
$\sigma(Z^{[l]})$ to get$A^{[l]}$
- Compute
-
Backward pass: Elementwise derivatives of sigmoid, computing
$\delta^{[l]}$ ,$dW^{[l]}$ ,$db^{[l]}$ . -
Parameter update: Gradient descent with learning rate
$\alpha$ . -
HyperParameters:
- Number of hidden layers (
hidden_layer) - Learning rate (
lr) - Number of epochs (
epochs)
- Number of hidden layers (
-
Matrix operations: All matrix multiplications, transposes, elementwise additions/subtractions are manually implemented in
utils.py. These are reused across all architectures.
-
Single 2D convolutional layer with num_filters filters of size
(filter_height × filter_width). -
ReLU activation on convolution outputs.
-
Flattening of all activated feature maps into a single vector.
-
Fully connected layer mapping to
num_classesoutputs (sigmoid for binary, softmax for multi-class). -
Loss:
- Binary (num_classes=1): MSE
- Multi-class: Cross-entropy
-
Backpropagation:
- Compute gradients of FC weights/biases.
- Backpropagate through flattened ReLU maps, zeroing gradients where activation ≤ 0.
- Accumulate filter gradients by convolving each
$,d!Conv$ map with the original input.
-
HyperParameters
-
filter_size,num_filters,num_classes,epochs,lr.
-
-
Character-level and word-level RNN with a single hidden layer (size
hidden_size) and BPTT. -
Input one-hot vectors of dimension
vocab_size-> hidden state$h_t = \tanh(U,x_t + W,h_{t-1} + b)$ . -
Output
$z_t = V,h_t + c$ ,$p_t = \text{softmax}(z_t)$ . -
Loss: Cross-entropy over the sequence.
-
Gradients:
- Compute
$,dy_t = p_t - y_t$ . - Backpropagate through
$V$ , propagate$,dh_t = V^T,dy_t + dh_{t+1}$ . - Apply
$\tanh'$ to get$,dh_{\text{raw}}$ , accumulate into$dU, dW, db$ . - Clip all gradients to
$\pm \text{clip_value}$ .
- Compute
-
Parameter update via Adagrad.
-
Sampling:
- Given a seed index and hidden state, repeatedly sample next character/word from the softmax distribution to generate sequences of length
n.
- Given a seed index and hidden state, repeatedly sample next character/word from the softmax distribution to generate sequences of length
-
Character-level and word-level LSTM with gates: forget (
$f_t$ ), input ($i_t$ ), output ($o_t$ ), and cell candidate ($g_t$ ). -
Equations per time step
$t$ :$f_t = \sigma(W_f,h_{t-1} + U_f,x_t + b_f)$ $i_t = \sigma(W_i,h_{t-1} + U_i,x_t + b_i)$ $g_t = \tanh(W_c,h_{t-1} + U_c,x_t + b_c)$ $c_t = f_t \odot c_{t-1} ;+; i_t \odot g_t$ $o_t = \sigma(W_o,h_{t-1} + U_o,x_t + b_o)$ $h_t = o_t ;\odot; \tanh(c_t)$ -
$z_t = V,h_t + c$ ,$p_t = \text{softmax}(z_t)$
-
Loss: Cross-entropy over the sequence.
-
Backpropagation through time:
-
Compute
$,dy_t = p_t - y_t$ , accumulate$dV$ ,$dc$ . -
Backprop through
$h_t$ ,$c_t$ , and each gate:-
$do = dh \odot \tanh(c_t)$ ,$dW_o, dU_o, db_o$ via$\sigma'(o_t)$ . -
$dct = dh \odot o_t \odot \tanh'(c_t) + dc_{t+1}$ . -
$df = dct \odot c_{t-1}$ ,$dW_f, dU_f, db_f$ via$\sigma'(f_t)$ . -
$di = dct \odot g_t$ ,$dW_i, dU_i, db_i$ via$\sigma'(i_t)$ . -
$dg = dct \odot i_t$ ,$dW_c, dU_c, db_c$ via$\tanh'(g_t)$ . - Propagate
$dh_{t-1}$ and$dc_{t-1}$ .
-
-
Clip all gradients then update via Adagrad.
-
-
Sampling: Similar to RNN but with cell state
$c_t$ carried along.
Two PDF files walk through the math by hand. These were instrumental for debugging and ensuring correctness of backpropagation steps and understanding of each algorithm.
- MLP by Hand: mlp_by_hand.pdf
- RNN by Hand: rnn_by_hand.pdf
Each architecture has a corresponding test script:
-
MLP
python test_mlp.py
- Trains a tiny MLP on a toy dataset (
X = [[0,0,1,1],[0,1,0,1]],Y = [[0,1,1,1]]). - Outputs predictions and accuracy to
predictions.txt. - You can adjust
epochs,lr, andhidden_layerdirectly in the script.
- Trains a tiny MLP on a toy dataset (
-
CNN
python test_cnn.py
- Expects CIFAR-10 batches unzipped in
./cifar-10-batches-py/. - Loads 100 images from each batch, converts to grayscale via
rgb_to_grey(), then trains a CNN (3×3 filters, 4 filters, 10 classes). - Prints test accuracy (70/30 train-test split).
- Expects CIFAR-10 batches unzipped in
-
RNN
python test_rnn.py
- Uses the
./JulesVerde/folder at the root, which contains example of JulesVerde books (“The Mysterious Island.txt”, “Around the World in Eighty Days.txt”). - Runs both character-level and word-level experiments using an RNN (hidden size 128, seq_length 50, lr 0.01, epochs 100).
- Displays loss vs. epochs plot (requires
matplotlib). - Prints sampled text at epochs {20, 40, 60, 80, 100}.
- Uses the
-
LSTM
python test_lstm.py
- Same folder requirements (
./JulesVerde/). - Runs both character-level and word-level LSTM experiments (hidden size 128, seq_length 50, lr 0.01, epochs 100).
- Displays loss vs. epochs plot (requires
matplotlib). - Prints sampled text at epochs {20, 40, 60, 80, 100}.
- Same folder requirements (
For the MLP, I performed an extensive hyperparameter sweep testing various combinations of epochs and learning rates. Below are the most illustrative loss curves showing how training progresses over epochs.
| Loss Curve (10,000 epochs, lr=0.0001) | Loss Curve (10,000 epochs, lr=0.1) |
|---|---|
![]() |
![]() |
| Loss Curve (8,000 epochs, lr=0.001) | Loss Curve (8,000 epochs, lr=0.1) |
|---|---|
![]() |
![]() |
| Loss Curve (6,000 epochs, lr=1e-6) | Loss Curve (6,000 epochs, lr=0.0001) |
|---|---|
![]() |
![]() |
| Loss Curve (4,000 epochs, lr=0.0001) |
|---|
![]() |
| Loss Curve (2,000 epochs, lr=0.1) | Loss Curve (2,000 epochs, lr=0.0001) |
|---|---|
![]() |
![]() |
| Loss Curve (1,000 epochs, lr=0.0001) |
|---|
![]() |
| Loss Curve (500 epochs, lr=0.001) | Loss Curve (500 epochs, lr=0.1) |
|---|---|
![]() |
![]() |
| Loss Curve (200 epochs, lr=0.01) | Loss Curve (200 epochs, lr=0.1) |
|---|---|
![]() |
![]() |
| Loss Curve (100 epochs, lr=0.01) | Loss Curve (100 epochs, lr=0.1) |
|---|---|
![]() |
![]() |
Observation:
- Smaller learning rates (1e-4) tend to converge steadily over thousands of epochs, whereas larger rates (0.1) cause rapid initial drops but may oscillate or diverge if too large.
- Epochs in the range 1,000–10,000 were needed for stable convergence at small lr.
Dataset: CIFAR-10 (500 images total, 5 batches × 100 images)
Preprocessing: Converted RGB images to grayscale normalized to [0, 1] via rgb_to_grey().
Model: Single convolutional layer (4 filters of size 3×3, stride=1, no pooling), ReLU activation, flattened to a 1×((32-3+1)×(32-3+1)×4) vector, then a softmax output layer for 10 classes.
Hyperparameters:
filter_size=(3,3),num_filters=4,epochs=50,lr=1e-3,num_classes=10.
Result:
- Test Accuracy: Typically in the low 20–30% range on 150–200 test images (70/30 split).
- This demonstrates the correctness of convolution and backprop, albeit accuracy is low due to minimal architecture and small sample size.
Epoch 10/50, Cost: 2.2791
Epoch 20/50, Cost: 2.0504
Epoch 30/50, Cost: 1.8229
Epoch 40/50, Cost: 1.5603
Epoch 50/50, Cost: 1.2269
Test Accuracy: 17.33%
Data: Full‐length novels from Jules Verne (in ./JulesVerde/).
Char-level RNN:
Vocabulary: 256 ASCII characters (one-hot).
Hyperparameters: hidden_size=128, seq_length=50, learning_rate=0.01, epochs=100, sample_len=40.
Loss Curve:
Rapid decline from ~275 to ~150 in first 20 epochs, then plateaus ~140–160.
Sampled Text (epoch 100):
tmdil ttmml ...
Mostly gibberish with occasional character patterns.
Word-level RNN:
Vocabulary: Top 5,000 words (one-hot).
Hyperparameters: hidden_size=128, seq_length=50, learning_rate=0.01, epochs=100, sample_len=200.
Starts ~425, descends to ~320–360 with noisy fluctuations.
Sampled Text (epoch 100):
is should steam such to responded or ages, tempted doubtless anyway.” xi. and after
(Valid English words appear, though grammar is rough.)
Total Running Times:
- Char-level RNN: ~139 seconds
- Word-level RNN: ~2,457 seconds (≈ 41 minutes)
Char-level LSTM:
Vocabulary: 256 ASCII characters.
Hyperparameters: hidden_size=128, seq_length=50, learning_rate=0.01, epochs=100, sample_len=40.
Similar to RNN: rapid drop to ~150 by epoch 20, then fluctuates ~140–180.
Sampled Text (epoch 100):
s a ...
Gibberish with occasional real‐character fragments.
Word-level LSTM:
Vocabulary: Top 5,000 words.
Hyperparameters: hidden_size=128, seq_length=50, learning_rate=0.01, epochs=100, sample_len=50.
Loss Curve:

Drops from ~425 to ~280 by epoch 100 (better than RNNs ~320).
Sampled Text (epoch 100):
hearing. carbines, most peak tom the sea the of one to the moon seventy-eight asMore coherent phrases and article usage.
Total Running Times:
- Char-level LSTM: ~407 seconds
- Word-level LSTM: ~5,776 seconds (≈ 96 minutes)
-
Educational Value:
- Implementing each architecture with plain Python lists and loops highlights every step of forward and backward propagation.
- The manual derivations (
mlp_by_hand.pdf,rnn_by_hand.pdf) were crucial for ensuring correctness, especially for BPTT in RNN/LSTM.
-
MLP Observations:
- Small learning rates (1e-4 to 1e-6) require thousands of epochs to converge on the example dataset.
- Larger learning rates (0.1) converge faster initially but can oscillate or diverge if too large.
-
CNN Observations:
- Without pooling or multiple layers, accuracy on CIFAR-10 remains low (20–30%).
- Demonstrates the core convolution/backprop logic correctly; deeper architectures and more data augmentations would be needed for higher accuracy.
-
RNN vs. LSTM (Char-level):
- Both char-level RNN and LSTM show similar loss curves and sample qualities under these settings.
- LSTM gating does not yield a marked advantage over RNN for short sequences (50 chars) and only 100 epochs.
-
RNN vs. LSTM (Word-level):
- Word-level LSTM outperforms word-level RNN quantitatively (loss ~280 vs. ~320 at epoch 100) and qualitatively (more coherent sampled phrases).
- Larger vocabulary (5,000 words) makes learning harder, and LSTM memory helps capture longer dependencies.
-
Char vs. Word:
- Char-level models learn spelling patterns but struggle to assemble full words or syntax within the given epochs.
- Word-level models generate valid English words from the start, producing more readable (though still choppy) text.
-
Hyperparameter Notes:
- Most loss reduction for sequence models happens by epochs 20–40; diminishing returns beyond epoch 60.
- Hidden size of 128 is sufficient for these toy experiments; increasing to 256 or 512 may improve results at greater compute cost.
- Sequence length of 50 words captures multi-sentence context; sequence length of 50 chars spans only a few words. Consider longer char sequences (100–200) for richer structure.



















