My first ML Project was building a tool that could alert drivers if their eyes were closed for too long. This was inspired by seeing a news article about how Amazon monitors their drivers through in-truck cameras, inspiring me to wishfully turn it into something actually useful. This post will break down the design decisions I made, and explain the tools I used in depth. It will also reflect on what I’ve learned from this experience.

Project Overview

There were essentially x steps:

  1. Stream camera footage of the face
  2. Detect bounding boxes of eyes to get an image crop
  3. Evaluate Open/Closed-ness of crop using a DL model
  4. Signal to driver if eyes are closed for a long time

Camera Stream

I tested my software first using my laptop’s webcam. This was easily done using OpenCV. To get a loop of footage, I simply put it in a while loop.

import cv2

capture = cv2.VideoCapture(0)

while(True):
    # get frame data and store as img (frame is an image type)
    ret, frame = capture.read()
    ...

Finding the Eyes

The first thing I immediately noticed was that the video stream itself was taking a lot of compute resources, especially since I was projecting the live “computer vision” back on my screen. So, this prompted me to find a low-compute algorithm to find the eyes in a given image.

While there were existing facial recognition deep learning models, I chose to use OpenCV’s haar-cascades. Haar-cascades are algorithms that can detect objects in images, irrespective of their scale/location.

How haar-cascade Works

Sample haar features traverse in window-sized across the picture to compute and match features. It works in 2 basic steps:

  1. Calculating Haar Features with an Integral Image
  2. Using Adaboost and Cascading

Calculating Haar Features

Similar to how convolutional filters work to detect edges/patterns within the filter window, Haar features are calculations that are performed on adjacent rectangular regions within the detection window. The calculation involves summing the pixel intensities in each region and calculating the differences between the sums (i.e. the differences between the black and white regions below).

Fig 1. Examples of Haar features. (Image source: Viola et al).

Fig 1. Examples of Haar features. (Image source: Viola et al).

This is costly to do on each pixel, so the algorithm makes advantage of something called an integral image. This is an intermediate representations of the image, that follows:

$$ ii(x, y) = \sum_{x^{\prime} \leq x, y^{\prime} \leq y} i(x^{\prime}, y^{\prime}) $$

which shows the integral image at location $(x, y)$ contains the sum of the pixels above and to the left of $(x, y)$ inclusive. This in practice only requires one pass through the entire image, making it $O(n)$ efficient.

These computed values make the haar feature operations much quicker to compute, by using values from previously calculated sub-rectangles. It’s analogous to how prefix sum arrays make faster calculations.

You can find the original research paper here.

Adaboost and Cascading

Adaboost (like other boosting algorithms) “choose” the best haar features that are aggregated into a stronger classifier. This is akin to ensemble modelling.

Fig 2. Boosting visualized: the first 3 features are “boosted” into a better classifier. (Image source: Akash Desarda).

Fig 2. Boosting visualized: the first 3 features are “boosted” into a better classifier. (Image source: Akash Desarda).

The best classifiers are then trained on a feature set and a training set of positive and negative images.

Advantages and Disadvantages

The advantages aligned well with my objectives:

  • Haar cascades are fast and can work well in real-time (primary objectivve)
  • Simple to implement, less computing power required

The key drawback was that they are not as accurate as SOTA object detection algorithms (YOLO, Fast R-CNN, etc.). It also threw a lot of false positives. However, I made the decision that this was not a significant downside.

Here’s the code segment outlining the haar-cascades in use. I downloaded the pre-trained classifiers from OpenCV.

face_classifier = cv2.CascadeClassifier('haar_cascade_files\haarcascade_frontalface_alt.xml')
lefteye_classifier = cv2.CascadeClassifier('haar_cascade_files\haarcascade_lefteye_2splits.xml')
righteye_classifer = cv2.CascadeClassifier('haar_cascade_files\haarcascade_righteye_2splits.xml')

...

while (True):

    ...

    img = cv2.cvtColor(frame, cv2.COLOR_BGR2RGB)  # convert cam data to RGB
    img = transforms.ToTensor()(img) # now looks like C X H X W
    #print(img.shape)            # img : tensor : 3 x 480 x 640 (C X H X W)
    #print(frame.shape)          # frame: image : H X W X C

    # Detect faces/eyes ROI
    # (25px,25px) set as minimum face size - reduce chance of false positives
    faces = face_classifier.detectMultiScale(frame, minSize = (25,25))
    left_eye = lefteye_classifier.detectMultiScale(frame)
    right_eye =  righteye_classifer.detectMultiScale(frame)
    ...

You actually don’t need to find the face, but I chose to find it + display its bounding box for UI purposes.

Evaluate Eye Crop

I now had bounding boxes of the eyes. I passed these smaller crop images into my DL model for evaluation.

Model

For model engineering, I used PyTorch. I figured this wouldn’t be a complicated task, so I stuck with a simple CNN model:

  • 3 Conv layers that encode the image into a vector which is fed into 2 FC layers
  • Dropout: 25% in the FC layers
  • Max pooling at each convolution step
  • ReLU on hidden layers

The model class is below:

class Net(nn.Module):
    def __init__(self):
        super(Net, self).__init__()
        # conv layer
        # sees 24x24 x3 (RGB)
        self.conv1 = nn.Conv2d(3, 16, 3, padding=1) # in depth = 3, out depth = 16, ksize = 3, padding 1, stride 1 (default)
        # sees 12x12 x16
        self.conv2 = nn.Conv2d(16, 32, 3, padding=1) # in depth = 16, out depth = 32, ksize = 3, padding 1, stride 1
        # sees 6x6 x32
        self.conv3 = nn.Conv2d(32, 64, 3, padding=1)
        # final out of conv: 3x3 x64

        self.fc1 = nn.Linear(3*3*64, 100, bias=True)
        self.fc2 = nn.Linear(100, 2, bias=True)
        self.dropout = nn.Dropout(p=0.25)
        self.pool = nn.MaxPool2d(2, 2)

    def forward(self, x):
        x = self.pool(F.relu(self.conv1(x)))
        x = self.pool(F.relu(self.conv2(x)))
        x = self.pool(F.relu(self.conv3(x)))
        x = x.view(-1, 3*3*64)
        x = self.dropout(F.relu(self.fc1(x)))
        x = self.fc2(x)

        return F.log_softmax(x)
Fig 3. Train vs Dev Set Loss.

Fig 3. Train vs Dev Set Loss.

Heeding Andrew Ng’s advice, I tried something simple and saw how it did before overcomplicating things.

Training Details:

I saved the model every time my dev set performance went down, to ensure I’d have a checkpoint saved in case my model began overfitting. I trained 50 epochs, using mini-batch sizes of 20 images. I did a 75-25 train/test split on 2000 eye images, and took cross-validation dev sets from 20% of the train set. After 50 epochs of training it turned out pretty well - it achieved 98% accuracy on the test set, with 99% performance on the train set. This model was able to be trained on an intel i7-8550u CPU in under 3 minutes. A GPU was not used.

Fig 4. Test set accuracy %.

Fig 4. Test set accuracy %.

False positives meant there was more than one right/left eye detected - I used a bandaid solution of just taking the first eye returned from each haar classifier.

The code for decision making can be found below:

while(True):
    ...
    # Draw rectangle around right eye and classify eye as open/closed
    for (x, y, w, h) in right_eye:
        cv2.rectangle(frame, (x, y), (x+w, y+h), (0, 0, 255), 1)
        r_eye = frame[y:y+h, x:x+w]
        guh = Image.fromarray(r_eye)
        guh.save('test_righteye.jpg')
        #print(r_eye.shape)          # r_eye is an IMG H X W X C
        r_eye = eyetrans(Image.fromarray(r_eye))
        #print(r_eye.shape)          # r_eye is NOW an

        rpred = Model.forward(V(r_eye.unsqueeze(0)))
        ps = torch.exp(rpred)
        #print(ps)
        top_p, top_class = ps.topk(1, dim=1)    # first index is closed true, second is open true
        r_labels = 'closed' if top_class == 0 else 'open'
        print('Right eye: ', r_labels)
        break

    for (x,y,w,h) in left_eye:
        cv2.rectangle(frame, (x, y), (x+w, y+h), (0, 0, 255), 1)
        l_eye = frame[y:y+h, x:x+w]
        guh = Image.fromarray(l_eye)
        guh.save('test_lefteye.jpg')
        #print(l_eye.shape)          # l_eye is an IMG H X W X C
        l_eye = eyetrans(Image.fromarray(l_eye))
        #print(l_eye.shape)          # l_eye is NOW a PIL image

        lpred = Model.forward(V(l_eye.unsqueeze(0)))
        ps = torch.exp(lpred)   # probabilities are model exponentized
        #print(ps)
        top_p, top_class = ps.topk(1, dim=1)    # first index is closed true, second is open true
        l_labels = 'closed' if top_class == 0 else 'open'
        print('Left eye: ', l_labels)
        break

Driver Signal

I used a “score” to distinguish blinks from actual drowsiness. Essentially the driver would lose score the longer their eyes were closed, and “regain” score if they remained open. If the score dropped below an arbitrary threshold, I would sound an alarm noise.

while(True):
    ...

    if(l_labels == 'closed' and r_labels == 'closed'):
        score += 1
        cv2.putText(frame, "Eyes Closed", (10, height - 20), font, 1, (255, 255, 255), 1, cv2.LINE_AA)
    else:
        score -= 1
        cv2.putText(frame, "Eyes Open", (10, height - 20), font, 1, (255, 255, 255), 1, cv2.LINE_AA)

    # Reset score to 0 if eyes remain open for long periods of time
    if(score < 0):
        score = 0

    # Display score
    cv2.putText(frame, 'Drowsy Score:'+str(score), (200, height - 20), font, 1, (255, 255, 255), 1, cv2.LINE_AA)

    # Alarm control
    if(score > 150):
        #person is feeling sleepy so we beep the alarm
        cv2.imwrite(os.path.join(path, 'image.jpg'), frame)
        try:
            sound.play()

Takeaways

Haar-cascade issues

The lack of accuracy in the OpenCV haar cascade classifier made was problematic. For some reason, it would struggle a lot identifying both eyes on me. At times the detected left and right eyes would overlap as well. It would also sometimes detect my nostrils as eyes.

Model issues

Assuming a near 100% ‘human level’ accuracy, this meant there was minimal avoidable bias and variance issues in my model, meaning it was a good model! Something I could experiment with is to see if even smaller models can perform better.

However, despite the strong performance on the model’s metric, in practice it didn’t work too well on me. It would have trouble detecting my eyes unless i was opening them really wide. After inspecting the data, I realized there weren’t too many samples of Asian ethnic eyes on the smaller side (mine lol). So despite performing well on the dataset, the dataset didn’t have enough data of each ethnicity to generalize well into the real world. To fix this data-mismatch problem, something I would do in the future is diversify and change up the % split for my data. I’d use more towards training (perhaps a 90-10 split), and also find more representative data to add to the overall dataset.

Takeaways

  1. Consolidating programming learning is done best by doing. This was my first project after taking Udacity’s Intro to Deep Learning course, and it definitely helped solidify my understanding of CNN architectures by creating something from scratch.
  2. OpenCV has a lot more features than I thought! At first I held the impressions it was a data extracting tool but I realized it has a lot of machine learning modules.
  3. Take time to explore your dataset first to see if is appropriate for the task at hand…

Future steps

As of now, there aren’t planned future steps for this project. But…

  1. An obvious room for improvement is the 89% classification accuracy of the Haar-cascade classifier. I could look into existing literature and find light-weight classification models used today. YOLO-mini comes to mind right away.
  2. It would also be cool if further sensors could be included to make a more accurate classification of driving drowsiness. For example, including data from pressure pads that be placed in the steering wheel can provide grip strength of the driver. This could be used in conjuction with eye openenss tracking to see if their grip is also faltering. Something my program also fails to consider is that people can “zone out” with their eyes open. While acquiring data to train a network that could identify this might be difficult, I’m sure it might be possible with enough digging.
  3. Reaching out to Amazon/truck companies to see if there’s value prop in building this into a scalable product.

Full project on my GitHub here.