About Serving Predictions on Google Kubernetes Engine


Serving RepNet on Google Kubernetes Engine

Often, prediction processing is lightweight enough to be done locally. But when hardware capabilities beyond what is available locally are required, it is necessary to move this processing to a remote platform such as Google Cloud Platform.

There are many ways to serve predictions on GCP, such as the AI Platform Prediction Service. Although the AI Platform Prediction Service is quite simple to use and handles model versioning automatically, it has disadvantages such as requiring communication via the HTTP Prediction API.

In this case, heavy prediction processing needs to be handled by a remote service, and an efficient communication protocol, such as GRPC, is needed to handle multiple requests with a lot of data.

We will be using code from an existing Colab Notebook demoing the RepNet Deep Learning Network. RepNet is a brilliant Neural Network that predicts periodicity and counts occurrences of actions within videos. Periodicity is the likelihood of a video frame being part of a repeating sequence of actions. We will be using RepNet to experiment with hosting Deep Learning on GCP.

We will adapt the RepNet Notebook code so that we can:
  • create a model for saving to a format compatible with TensorFlow Serving
  • run our client, do pre-processing and post-processing of data, and make GRPC requests to our prediction server on Kubernetes
  • you can find most of the code referred to here, in the links in this article

Back by popular demand, my talented feline assistant, Chamo

RepNet counts occurrences within a movie. For this work we will be using a 1.1 MB movie of my cat, Chamo, wagging his tail:

If you run the movie, you will see that Chamo wags his tail just over 6 times. If our RepNet model on Kubernetes can output this information we will have succeeded.

Task Overview:

  1. Create and save the RepNet model in the SavedModel format

  2. Create a Docker image to serve predictions

  3. Create the Client

  4. Create the Kubernetes cluster

  5. Connect the client and Test


Create and save the RepNet model in the SavedModel format

Download the model in the check-point format
import os

PATH_TO_CKPT = './tmp/repnet_ckpt/'
os.system('mkdir {}'.format(PATH_TO_CKPT))
os.system('wget -nc -P {} https://storage.googleapis.com/repnet_ckpt/checkpoint'.format(PATH_TO_CKPT))
os.system('wget -nc -P {} https://storage.googleapis.com/repnet_ckpt/ckpt-88.data-00000-of-00002'.format(PATH_TO_CKPT))
os.system('wget -nc -P {} https://storage.googleapis.com/repnet_ckpt/ckpt-88.data-00001-of-00002'.format(PATH_TO_CKPT))
os.system('wget -nc -P {} https://storage.googleapis.com/repnet_ckpt/ckpt-88.index'.format(PATH_TO_CKPT))

This code downloads the Neural Network checkpoint data from Google Storage. Note, checkpoint data is the model parameter data only; it does not include the model architecture. We will use code from the RepNet Notebook to instantiate the model with the correct architecture and then save all of that information in the SavedModel format that is used by TensorFlow serving. This is shown below.

Extract the code that handles model creation
def get_repnet_model(logdir):
  """Returns a trained RepNet model.

    logdir (string): Path to directory where checkpoint will be downloaded.

    model (Keras model): Trained RepNet model.
  # Check if we are in eager mode.
  assert tf.executing_eagerly()

  # Models will be called in eval mode.

  # Define RepNet model.
  model = ResnetPeriodEstimator()
  # tf.function for speed.
  model.call = tf.function(model.call)

  # Define checkpoint and checkpoint manager.
  ckpt = tf.train.Checkpoint(model=model)
  ckpt_manager = tf.train.CheckpointManager(
      ckpt, directory=logdir, max_to_keep=10)
  latest_ckpt = ckpt_manager.latest_checkpoint
  print('Loading from: ', latest_ckpt)
  if not latest_ckpt:
    raise ValueError('Path does not have a checkpoint to load.')
  # Restore weights.

  # Pass dummy frames to build graph.
  model(tf.random.uniform((1, 64, 112, 112, 3)))
  return model
PATH_TO_CKPT = './tmp/repnet_ckpt/'
PATH_TO_SAVED_MODEL = './tmp/repnet_model/1/'

model = get_repnet_model(PATH_TO_CKPT)

The above code combines the checkpoint data with the RepNet model layer architecture. You can see more of the construction of the network in the save_model.py python script.

Save the model

model = get_repnet_model(PATH_TO_CKPT)

The real goal of the save_model.py program we just saw is to save the model in the SavedModel format so that TensorFlow serving understands it.

Create a Docker image to serve predictions

FROM tensorflow/serving:1.15.0-gpu

COPY ./repnet_model /models/repnet/

ENTRYPOINT ["/usr/bin/tf_serving_entrypoint.sh"]

We use the tensorflow/serving:1.15.0-gpu docker image as a base for our image. Note that we use the GPU enhanced image. This is because a GPU is a requirement of the RepNet Colab Notebook. Because we are using our own docker image we are able to use a GPU architecture on GCP if we choose to.

But don’t worry, if you don’t have a GPU on your machine, you can specify tensorflow/serving:1.15.0 and everything will still work.

This should be saved in a file called Dockerfile-repnet, we will use this in the next step.

See https://www.tensorflow.org/tfx/serving/docker for examples running the standard CPU image directly from docker.

Build the image
# Define some useful constants
export MODEL=repnet
export PROJECT=my-project
export REGION=us-central1
export REPO=$MODEL
export IMAGE=$REGION-docker.pkg.dev/$PROJECT/$REPO/serve
export CLUSTER_NAME=repnet-serving-cluster

docker build -f Dockerfile-repnet --tag=$IMAGE .

Before building the image, I also set some useful variables that I will be using in the next few commands. You will need to customize these variables for your environment.

Create the Client

There are many useful client examples are in the tensorflow_serving git project. They use standard Google client libraries.

Break out the necessary parts of the Notebook
def served_model(curr_frames):
  request = predict_pb2.PredictRequest()
  request.model_spec.name = 'repnet'
  request.model_spec.signature_name = 'serving_default'

  stub = prediction_service_pb2_grpc.PredictionServiceStub(channel)

  # Call the GRPC predict service
  result = stub.Predict(request, 100.0)  # 100 secs timeout

  raw_scores = tf.make_ndarray(result.outputs["output_1"])
  within_period_scores = tf.make_ndarray(result.outputs["output_2"])
  return raw_scores, within_period_scores
def get_counts(model, frames, strides, batch_size,
  """Pass frames through model and conver period predictions to count."""
  seq_len = len(frames)
  raw_scores_list = []
  scores = []
  within_period_scores_list = []

  if fully_periodic:
    within_period_threshold = 0.0

  frames = model.preprocess(frames)

  for stride in strides:
    num_batches = int(np.ceil(seq_len/model.num_frames/stride/batch_size))
    raw_scores_per_stride = []
    within_period_score_stride = []
    for batch_idx in range(num_batches):
      idxes = tf.range(batch_idx*batch_size*model.num_frames*stride,
      idxes = tf.clip_by_value(idxes, 0, seq_len-1)
      curr_frames = tf.gather(frames, idxes)
      curr_frames = tf.reshape(
          [batch_size, model.num_frames, model.image_size, model.image_size, 3])

      # raw_scores, within_period_scores, _ = model(curr_frames)
      raw_scores, within_period_scores = served_model(curr_frames)
      # Extra code deleted for clarity
      # ... 

server = sys.argv[2]
channel = grpc.insecure_channel(server)

video_file = sys.argv[1]
imgs, vid_fps = read_video(video_file)

print('Running RepNet...') 
(pred_period, pred_score, within_period, per_frame_counts, chosen_stride) = get_counts(

# imgs, vid_fps = read_video(PATH_TO_VIDEO_ON_YOUR_DRIVE)

print('pred_period: {}, pred_score: {}, within_period: {}, per_frame_counts: {}, chosen_stride: {}, total_count: {}'.format(pred_period, pred_score, within_period, per_frame_counts, chosen_stride, np.sum(per_frame_counts)))

Here is a link to the full client that connects to our TensorFlow server using GRPC and does pre-processing and post-processing of the video data: main_cli.py

The last output is the sum of the per_frame_counts, this is the actual count of repetitions for the whole video.

Note: this does not exclude per_frame_counts on frames with a low likelihood of periodicity. To do this I would also need to take into account the within_period RepNet output variable. But because I know that my cat, Chamo is wagging his tail throughout the whole video, I don’t need to worry about this step.

Test with a local Docker image

Run the Docker image we already built

docker run -p 8501:8501 -p 8500:8500 ${IMAGE}
Run the client connecting to localhost:8500
❯ python main_cli.py ./videos/tail_wag.mov localhost:8500

video_filename ./videos/tail_wag.mov
Running RepNet...
pred_period: 30.589987653692887, pred_score: 0.8790081739425659, within_period: [0.57247233 0.5955407  0.5723103  0.5649485  0.82702166 0.8717602
 0.8789652  0.88820255 0.9324563  0.9262594  0.93020535 0.94194055

0.03125 0.03125 0.03125 0.03125 0. 0.
0. 0. ], chosen_stride: 2, total_count: 6.41271935403347

RepNet counted correctly, total_count is 6.4 tail wags, it works!

Let’s push our Docker image to our Docker Repository

Create a docker repository
gcloud beta artifacts repositories create $REPO --repository-format=docker --location=$REGION
Push the image
docker push $IMAGE

Remember that the $IMAGE variable was created using the $REPO variable, so docker knows which repository it should push this image to:


Create the Kubernetes cluster

Very simply put Kubernetes is a clustering tool for Docker containers. For a slightly more detailed summary of the architecture see Kubernetes in 5 Minutes

The linked-to video explains the relationship of Pods to Worker Nodes. Pods are the servers required for your application. If your application was a CRUD application your Pod may have a DB, web server, and application server. You can define multiple types of Pods. You can have multiple replicas of your Pods for redundancy and load-balancing. Kubernetes handles distributing all your Pods across the Worker Nodes.

Our Kubernetes cluster configuration simply consists of one Pod with one image for TensorFlow serving, we have 3 replicas for redundancy.
Kubernetes config file
apiVersion: apps/v1
kind: Deployment
  name: repnet-deployment
      app: repnet-server
  replicas: 3
        app: repnet-server
      - name: repnet-container
        image: us-central1-docker.pkg.dev/jiken-dev-tokyo-research/repnet/serve
            memory: "6Gi"
            memory: "3Gi"
        - containerPort: 8500
apiVersion: v1
kind: Service
    run: repnet-service
  name: repnet-service
  - port: 8500
    targetPort: 8500
    app: repnet-server
  type: LoadBalancer
  • requests are checked when scheduling; if a container requests more than is available on a node it won’t be scheduled on that node
  • limits are checked when the container is running; if a container exceeds the limit it will be restricted
For a fuller explanation of resource requests vs limits see: Kubernetes Best Practices: Resource Requests and Limits
Create the cluster from the command line
gcloud container clusters create $CLUSTER --num-nodes 5 --region=$REGION --machine-type n2-standard-4
  • The standard Kubernetes node machine type e2-standard-2 doesn’t have enough memory for our containers, so we use n2-standard-4
Tell the kubectl command about GKE
gcloud config set container/cluster $CLUSTER
gcloud container clusters get-credentials $CLUSTER --region $REGION
Use the kubectl command to configure our cluster using the above Kubernetes config file
kubectl create -f repnet_k8s.yaml
View the status
kubectl get deployments
kubectl get nodes
kubectl get pods
kubectl get services
kubectl describe service repnet-service
kubectl describe pod <Pod ID>

The commands above all check the current status of our Kubernetes deployment on GKE. The describe pod <Pod ID> command is useful to check the status of a Pod. Get the Pod ID with get pods. The pod will not be deployed to a Worker Node if it requests too many resources, you can check if this is happening with describe pod.

Below we show how to get the IP to use for our GRPC client:

$ kubectl describe service repnet-service
Name:                     repnet-service
Namespace:                default
Labels:                   run=repnet-service
Annotations:              <none>
Selector:                 app=repnet-server
Type:                     LoadBalancer
LoadBalancer Ingress:
Port:                     <unset>  8500/TCP
TargetPort:               8500/TCP
NodePort:                 <unset>  31267/TCP
Endpoints:      ,,
Session Affinity:         None
External Traffic Policy:  Cluster
  Type    Reason                Age    From                Message
  ----    ------                ----   ----                -------
  Normal  EnsuringLoadBalancer  5m21s  service-controller  Ensuring load balancer
  Normal  EnsuredLoadBalancer   4m30s  service-controller  Ensured load balancer
Looking at the output above we can see the LoadBalancer Ingress IP,, which we will connect to from our client. And we also see there are 3 Endpoints,,,, which correspond to our 3 replicas specified in the original cluster configuration file, repnet_k8s.yaml.

Run the client and Test

❯ python main_cli.py ./videos/tail_wag.mov
0.03125 0.03125 0.03125 0.03125 0. 0.
0. 0. ], chosen_stride: 2, total_count: 6.41271935403347

If we look at the total_count we see that RepNet counted just over 6, which is the correct amount.

We have successfully built a Kubernetes cluster on Google Kubernetes Engine, running a TensorFlow Server, serving a RepNet model adapted from the RepNet demo Notebook!