Jon Machtynger, Author at Microsoft Industry Blogs - United Kingdom

Building Scalable Data Science Applications using Containers – Part 6

Jon Machtynger and Mark Whitby — Thu, 14 Oct 2021 14:00:28 +0000

Welcome to the sixth part of this blog series around using containers for Data Science. In parts one, two, three, four, and five, we provided a number of building blocks that we’ll use here. If this is the first blog you’ve seen, it may be worth skimming the first five parts, or going back and progressing through them. We make a number of assumptions about your familiarity with Docker, storage, and multi-container applications, which were covered previously.

In this article we convert the previous docker-compose application (part five) to one that capitalises on a Kubernetes approach – scalability, resilience, predefined configuration packages with Helm etc.

Reviewing the previous Docker approach’s structure, almost everything sits in a container mounting shared storage.

Kubernetes brings a different dimension to how you might consider a solution, and our approach builds on this. In this article, we won’t stretch the bounds of what Kubernetes can do, but we will show how to take an existing containers-based application, and slowly migrate that capability to cloud services with Kubernetes used as an orchestration engine.

This is the revised architecture.

Things to note about this:

The processing capability sits in Kubernetes, so you can scale out and up as needed.
This is a very simple scenario, so we won’t need capabilities such as load balancers, private networking, GPUs or sophisticated security approaches.
We replaced the database container with a Postgres PaaS service. This allows us to take advantage of default scaling and resiliency patterns built into Azure.
We use Blob storage instead of container volumes. For a Docker Compose application, or local container approach, volumes make sense. However, blob storage is performant, resilient, flexible in terms of security patterns, and we can share that resource across multiple components.

We won’t use hard-coded passwords and host names etc within our source as we did in the previous instalment, but we will use configurable variables. This is still less secure than it could be as environment variable values are still visible within Kubernetes configuration files. A more secure approach might use Azure Key Vault and say CSI Secrets. However, we want to minimise the length of this blog rather than be distracted by container security. The CSI Secrets link should clarify how to apply this yourself if you need.

For the purposes of this blog, we assume that:

You have the Azure CLI installed in your environment.
You have the Kubernetes CLI installed in your environment.
You have Helm installed in your environment.
You have azcopy installed in your environment.

Let’s Begin

All the code for this tutorial can be downloaded here.

We’ll hold our application under a single directory tree. Create the aks directory, and then beneath that, create sub-directories called containers/iload, and containers/worker.

As with the previous instalment, we will use the same classic CIFAR images set for our testing. There is a GitHub source that has them in jpg form, which can be downloaded here.

Go into your aks directory and clone the repo. You should see something like the following:

$ cd aks

$ git clone https://github.com/YoongiKim/CIFAR-10-images.git
Cloning into 'CIFAR-10-images'...
remote: Enumerating objects: 60027, done.
remote: Total 60027 (delta 0), reused 0 (delta 0), pack-reused 60027
Receiving objects: 100% (60027/60027), 19.94 MiB | 16.28 MiB/s, done.
Resolving deltas: 100% (59990/59990), done.
Updating files: 100% (60001/60001), done.

$ tree -L 1 .
aks
├── CIFAR-10-images
└── containers

2 directories

Blob Storage and your image

Previous, we used container volumes. In this case, we’ll use blob storage and all containers will reference the same content. Copy the script below into a file called initialprep.sh. Modify the first three lines to refer to the names of an Azure Resource Group, Storage Account, and Blob Container. It will create those resources and upload all the CIFAR images to the Storage Container. If you already have a resource group, storage Account and Storage Container, feel free to remove the lines that do the create.

RGNAME=”rg-cifar”
STG=”cifarimages”
CON=”cifarstorage”
EXPIRES=$(date --date='1 days' "+%Y-%m-%d")
IMAGEDIR=”CIFAR-10-images”

# Create environment
az group create -l uksouth -n $RGNAME
az storage account create --name $STG --resource-group $RGNAME --location uksouth --sku Standard_ZRS # Create Storage Account
az storage container create --account-name $STG --name $CON --auth-mode login # Create your storage container

ACCOUNTKEY=$(az storage account keys list --resource-group $RGNAME --account-name $STG | grep -i value | head -1 | cut -d':' -f2 | tr -d [\ \"])

# Generate a temporary SAS key
SAS=$(az storage container generate-sas --account-key $ACCOUNTKEY --account-name $STG --expiry $EXPIRES --name $CON --permissions acldrw | tr -d [\"])

# Determine your URL endpoint
STGURL=$(az storage account show --name $STG --query primaryEndpoints.blob | tr -d [\"])
CONURL="$STGURL$CON"

# Copy the files to your storage container
azcopy cp "$IMAGEDIR" "$CONURL?$SAS" –recursive

When we run this, you should see the resource creation followed by the upload to the repository.

$ initialprep.sh
{
    "id": "/subscriptions/f14bca45-bd2d-42f2-8a45-1248ab77ba72/resourceGroups/rg-cifar2",
    "location": "uksouth",
    "managedBy": null,
    "name": "rg-cifar2",
    "properties": {
        "provis

Job 8b0ccc36-2050-0a44-496e-c09d979f3169 summary
Elapsed Time (Minutes): 0.8001
Number of File Transfers: 60025
Number of Folder Property Transfers: 0
Total Number of Transfers: 60025
Number of Transfers Completed: 60025
Number of Transfers Failed: 0
Number of Transfers Skipped: 0
TotalBytesTransferred: 83127418
Final Job Status: Completed

Database Storage

The previous container-only approach used a Postgres container to record results. Azure provides resilient, scalable services, which are easily configurable, so there’s no need to build our own. Let’s provision one of those services and refer to it later.

Below you can see how to list available Postgres SKU types, where the format is (Model_Generation_Cores), so a Basic single core Gen 5 server would be “B_Gen5_1”.

$ az postgres server list-skus -l uksouth | grep -i id

    "id": "Basic",
        "id": "B_Gen5_1",
        "id": "B_Gen5_2",
    "id": "GeneralPurpose",
        "id": "GP_Gen5_2",
        "id": "GP_Gen5_4",
        "id": "GP_Gen5_8",
        "id": "GP_Gen5_16",
        "id": "GP_Gen5_32",
        "id": "GP_Gen5_64",
    "id": "MemoryOptimized",
        "id": "MO_Gen5_2",
        "id": "MO_Gen5_4",
        "id": "MO_Gen5_8",
        "id": "MO_Gen5_16",
        "id": "MO_Gen5_32",

Choose the smallest server available. We’ll allocate a basic single core server with 50GB of storage. At the time of writing, this cost around £25/month but we could also have chosen much less expensive SQL-DB server for around £5/month with 2GB of storage, but we’d need to change your SQL slightly. We’ve changed as little as necessary from our previous instalment of this blog, but feel free to make your own optimisations.

Here you can see that I’m provisioning a database called cifardb with an administrator name of ‘jon’ and a password of ‘P@ssw0rd123’, It also returns the fully qualified domain name of the server (cifardb.postgres.database.azure.com).

By default, Postgres denies access to all services. You can define private networks to ensure very granular access within and from outside Azure. In this case, we’ll provide default access to any Azure service (e.g. Kubernetes). Note, that this does not provide access to any external public endpoint.

$ az postgres server create --resource-group rg-cifar --name cifardb --location uksouth --admin-user jon --admin-password "P@ssw0rd123" --sku-name B_Gen5_1 --storage-size 51200
Checking the existence of the resource group 'rg-cifar'...
{
.
.
    "administratorLogin": "jon",
    "password": "P@ssw0rd123",
.
    "fullyQualifiedDomainName": "cifardb.postgres.database.azure.com",
.
}
$

# Allow Azure services (e.g. Kubernetes) to access this
$ az postgres server firewall-rule create --resource-group rg-cifar --server-name cifardb --name "AllowAllLinuxAzureIps" --start-ip-address "0.0.0.0" --end-ip-address "0.0.0.0"

{
    "endIpAddress": "0.0.0.0",
.
    "startIpAddress": "0.0.0.0",
    "type": "Microsoft.DBforPostgreSQL/servers/firewallRules"
}

The Kubernetes Cluster

We’re now at the stage where the components to be added are containers. Where we previously used Docker, we’ll now run them on a Kubernetes cluster. The purpose of this article is not to focus on everything Kubernetes. Rather, give a simple example of running Data Science services on Azure Kubernetes.

There are many publicly available guides to understanding the fundamentals of Kubernetes, as well as the Azure approach to implementing it. Microsoft has a set of modules that will introduce you to many of the concepts here.

Create a file called aks.sh containing the following, and place this within the aks directory. Replace the Resource Group, AKS Server Name and Azure Container Repository names with your choices.

RGNAME=rg-cifar
AKSNAME=cifarcluster
ACRNAME=jmcifaracr

# Create an AKS cluster with default settings
az aks create -g $RGNAME -n $AKSNAME --kubernetes-version 1.19.11

# Create an Azure Container Registry
az acr create --resource-group $RGNAME --name $ACRNAME --sku Basic

# Attach the ACR to the AKS cluster
az aks update -n $AKSNAME -g $RGNAME --attach-acr $ACRNAME

What this does is create a Kubernetes cluster and a Container Registry and then gives the cluster permission to pull images from the registry. Execute that script.

$ aks.sh
{
.
    "kubernetesVersion": "1.19.11",
.
    "networkProfile": {
        "dnsServiceIp": "10.0.0.10",
.
}

Now we’ll let our local Kubernetes CLI environment (e.g. laptop / desktop) connect to our Azure Kubernetes cluster and confirm that we can see services running.

$ az aks get-credentials --name cifarcluster --resource-group rg-cifar

$ kubectl get services -A

NAMESPACE   NAME                           TYPE        CLUSTER-IP    EXTERNAL-IP PORT(S)       AGE
default     kubernetes                     ClusterIP   10.0.0.1            443/TCP       27d
kube-system healthmodel-replicaset-service ClusterIP   10.0.243.143        25227/TCP     27d
kube-system kube-dns                       ClusterIP   10.0.0.10           53/UDP,53/TCP 27d
kube-system metrics-server                 ClusterIP   10.0.133.242        443/TCP       27d

This shows the cluster running and that we can control it from our local environment.

Sense Check

Let’s confirm where we are in the overall process.

We created an Azure environment to run our application.
We allocated some Azure storage and uploaded 60,000 images.
We created an Azure (Postgres) database.
We set up a Kubernetes environment to run our application.

The final part is to add the application. The key thing to consider with this new approach is that where previously we built all the services, a cloud platform allows us to take advantage of commodity capabilities that are already designed to be scalable and resilient. Looking at the diagram of the cloud version of this application, there are three components outstanding, and each of these uses containers.

The RabbitMQ service to queue requests.
A process to add new image requests to the queue.
A process to take a request off the queue, categorise it and record the result.

The Queue Process

The next component in our solution is the queueing mechanism. Previously, we built a RabbitMQ container to manage our requests. We’ll do the same here, but not with a Dockerfile. We could, but let’s show you an alternative approach using Helm. Helm is a Kubernetes package manager that allows you to install and configure applications very easily. We could achieve the same by building our own container, but Helm makes the process trivial, and there are many ready-made applications available. The documentation for installing RabbitMQ using Helm can be found here, but the two lines below are all I needed to get RabbitMQ installed and running in my environment.

$ helm repo add bitnami https://charts.bitnami.com/bitnami
$ helm install rabbitmq bitnami/rabbitmq

.
.
Credentials:
    echo "Username : user"
    echo "Password : $(kubectl get secret --namespace default rabbitmq -o jsonpath="{.data.rabbitmq-password}" | base64 --decode)"
    echo "ErLang Cookie : $(kubectl get secret --namespace default rabbitmq -o jsonpath="{.data.rabbitmq-erlang-cookie}" | base64 --decode)"
.
.
.
To Access the RabbitMQ AMQP port:
    echo "URL : amqp://127.0.0.1:5672/"
    kubectl port-forward --namespace default svc/rabbitmq 5672:5672
To Access the RabbitMQ Management interface:
    echo "URL : http://127.0.0.1:15672/"
    kubectl port-forward --namespace default svc/rabbitmq 15672:15672

There is some interesting information to note here:

You can delete the same deployment using ‘helm delete rabbitmq’
It provided you with a means of finding out default credentials if you didn’t provide them as part of the initial configuration.
The ‘port forward’ command shown here allows you to access the RabbitMQ service contained in your Azure container, from your local browser, and a local IP address. You will see later that there is actually no external IP exposed in this environment. This elegantly provides you with a means of interacting with your service.

$ echo "Username : user"
Username : user
$ echo "Password : $(kubectl get secret --namespace default rabbitmq -o jsonpath="{.data.rabbitmq-password}" | base64 --decode)"
Password : 7TrP8KOVdC

We’ll need these credentials in a minute. In the meantime, let’s see what was deployed in our environment:

$ kubectl get services
NAME                TYPE        CLUSTER-IP     EXTERNAL-IP PORT(S)                     AGE
kubernetes          ClusterIP   10.0.0.1       443/TCP                                 27d
rabbitmq            ClusterIP   10.0.180.137   5672/TCP,4369/TCP,25672/TCP,15672/TCP   15m
rabbitmq-headless   ClusterIP   None           4369/TCP,5672/TCP,25672/TCP,15672/TCP   15m

$ kubectl get pods
NAME         READY   STATUS    RESTARTS   AGE
rabbitmq-0   1/1     Running   0          16m

As there is no external IP address, use the port forward command and let’s interact with RabbitMQ.

$ kubectl port-forward --namespace default svc/rabbitmq 15672:15672 &
[1] 88032
Forwarding from 127.0.0.1:15672 -> 15672
Forwarding from [::1]:15672 -> 15672

If we now add the credentials extracted earlier, we can see our running RabbitMQ environment.

The Initial Load

This process performs two functions. First, it connects to our Postgres environment and creates the CATEGORY_RESULTS table if it doesn’t already exist, and then it queues all the images that were uploaded to the storage account earlier so they can be classified. In this example we’re running this as a one-off, but you could also take a more sophisticated approach using a location argument for daily, or ad-hoc batches of images.

Go into the containers/iload directory and create a file called iload.py containing the following:

#!/usr/bin/env python
import sys, os, json, pika
import psycopg2
from azure.storage.blob import ContainerClient

# Get Environment Vars
RMQ_USER=os.environ["RMQ_USER"] # RabbitMQ Username
RMQ_PASS=os.environ["RMQ_PASS"] # RabbitMQ Password
RMQ_HOST=os.environ["RMQ_HOST"] # RabbitMQ Hostname
SQL_HOST=os.environ["SQL_HOST"] # SQL Hostname
SQL_DB=os.environ["SQL_DB"] # SQL Database
SQL_USER=os.environ["SQL_USER"] # SQL Username
SQL_PASS=os.environ["SQL_PASS"] # SQL Password
STG_ACNAME=os.environ["STG_ACNAME"] # Storage Account Name
STG_ACKEY=os.environ["STG_ACKEY"] # Storage Account Key

# Set up database table if needed
cmd = """
                CREATE TABLE IF NOT EXISTS CATEGORY_RESULTS (
                FNAME VARCHAR(1024) NOT NULL,
                CATEGORY NUMERIC(2) NOT NULL,
                PREDICTION NUMERIC(2) NOT NULL,
                CONFIDENCE REAL);
      """
pgconn = psycopg2.connect(user=SQL_USER, password=SQL_PASS,
                host=SQL_HOST, port="5432", database=SQL_DB)
cur = pgconn.cursor()
cur.execute(cmd)
cur.close()
pgconn.commit()

# Load all images in defined storage account
CONNECTION_STRING="DefaultEndpointsProtocol=https" + \
    ";EndpointSuffix=core.windows.net" + \
    ";AccountName="+STG_ACNAME+";AccountKey="+STG_ACKEY
ROOT="/CIFAR-10-images" # This is where the images are held
container = ContainerClient.from_connection_string(CONNECTION_STRING, container_name="cifar")

rLen = len(ROOT)
classes = ('airplane', 'automobile', 'bird', 'cat', 'deer', 'dog', 'frog', 'horse', 'ship', 'truck')

# Determine the expected category by parsing the directory (after the root path)
def fnameToCategory(fname):
    for c in classes:
        if (fname.find(c) > rLen):
            return (classes.index(c))
    return -1 # This should never happen

IMGS=[]
blob_list = container.list_blobs()
for blob in blob_list:
    if blob.name.endswith(('.png', '.jpg', '.jpeg')):
        cat = fnameToCategory(blob.name)
        data = {"image" : blob.name, "category": cat, "catName": classes[cat]}
        message = json.dumps(data)
        IMGS.append(message)
print("Number of Images to add to queue = ", len(IMGS))

# Now write them into the queue
credentials = pika.PlainCredentials(RMQ_USER, RMQ_PASS)
parameters = pika.ConnectionParameters(RMQ_HOST, 5672, '/', credentials)
connection = pika.BlockingConnection(parameters)
channel = connection.channel()
channel.queue_declare(queue='image_queue', durable=True)

for i in IMGS:
    channel.basic_publish( exchange='', routing_key='image_queue', body=i,
        properties=pika.BasicProperties(delivery_mode=2,)
    )
    print("Queued ", i)

connection.close()

As with the previous version of this application, the script extracts all image names in our storage location and adds them to a queue to be classified. The first key difference with this version is that our images aren’t stored in a container’s local disk, but in an Azure storage account so we’ll need our blob storage credentials.

The second thing to note is that we’re using environment variables within the code. This means that the script can refer to customised and changing services without a need to continually modify the code. You can use the same code against different data sources, queues, or storage accounts.

In the containers/iload directory create a file called Dockerfile containing the following.

FROM ubuntu

RUN apt-get update
RUN apt-get install -y python3 python3-pip

RUN apt-get update && apt-get install -y poppler-utils net-tools vim
RUN pip install azureml-sdk
RUN pip install azureml-sdk[notebooks]
RUN pip install azure.ai.formrecognizer
RUN pip install azure.storage.blob
RUN pip install jsonify
RUN pip install pika
RUN pip install psycopg2-binary

ADD iload.py /

CMD ["python3", "./iload.py" ]

This simply defines a container with Python installed, and relevant libraries to access Azure storage, Postgres, and RabbitMQ.

Within that directory, build the container, and then we’ll then move it to our Azure Container Registry.

$ docker build -t iload .
.
.
=> writing image sha256:4ef19e469755572da900ec15514a4a205953a457c4f06f2795b150db3f2b11eb 
=> naming to docker.io/library/iload

Now we’ll log in to our Azure Container Registry, tag our local image against a target image in the remote repository, and then push it to Azure. We’ll also confirm that it is there, by doing an Azure equivalent of a docker images (az acr repository list…). Note that we are prefixing the image tag with the name of the Azure Container Registry (jmcifaracr.azurecr.io).

# Login to the Azure Container Repository
$ az acr login -n rg-cifar -n jmcifaracr
Login Succeeded

$ docker tag iload jmcifaracr.azurecr.io/iload:1.0

$ docker images
REPOSITORY                    TAG      IMAGE ID       CREATED          SIZE
iload                         latest   4ef19e469755   32 minutes ago   1.23GB
jmcifaracr.azurecr.io/iload   1.0      4ef19e469755   32 minutes ago   1.23GB

$ docker push jmcifaracr.azurecr.io/iload:1.0
The push refers to repository [jmcifaracr.azurecr.io/iload]
6dfdee2e824f: Pushed
e35525d1f4bf: Pushed
.
.
4942a1abcbfa: Pushed
1.0: digest: sha256:e9d606e50f08c682969afe4f59501936ad0706c4a81e43d281d66073a9d4ef28 size: 2847

$ az acr repository list --name jmcifaracr --output table
Result
--------
Iload

We’re almost there.

Kubernetes has a number of ways of executing workload. The two we’re interested in specifically are deployments and jobs. The key difference is that a job is executed once, whereas a deployment is expected to remain operational, and if anything happens to the process, then Kubernetes will attempt to keep that resource operational. In other words, if a container dies, then it will be restarted.

For the iload process, we only want this to load our 60,000 images and then terminate. We don’t want to load the images, and restart the container, only to load them again, and again etc. To run this job, we’ll provide a configuration file containing the job details and submit it to Kubernetes.

In the containers/iload directory, create a file called iload-job.yml with the following:

apiVersion: batch/v1
kind: Job
metadata:
    name: iload
spec:
    template:
        spec:
            containers:
            - name: iload
                image: jmcifaracr.azurecr.io/iload:1.0
                imagePullPolicy: Always
                env:
                    - name: RMQ_USER
                      value: "user"
                    - name: RMQ_PASS
                      value: "7TrP8KOVdC"
                    - name: RMQ_HOST
                      value: "rabbitmq"
                    - name: SQL_HOST
                      value: "cifardb.postgres.database.azure.com"
                    - name: SQL_DB
                      value: "postgres"
                    - name: SQL_USER
                      value: "jon@cifardb.postgres.database.azure.com"
                    - name: SQL_PASS
                      value: "P@ssw0rd123"
                    - name: STG_ACNAME
                      value: "cifarimages"
                    - name: STG_ACKEY
                      value: "xxxxxxxxxxxxxxxx"
                resources:
                    requests:
                      cpu: 500m
                      memory: 512Mi
                    limits:
                      cpu: 500m
                      memory: 512Mi
           restartPolicy: Never

Let’s spend some time looking at this.

The job is going to process the images just uploaded to the container repository. All variables in the script are defined here. We could run this using different values and keep our source code stable. We are using the RabbitMQ and Postgres credentials shown earlier. In addition, we’re referencing our blob storage key and container derived earlier.

Note that the passwords are shown here in clear text, and ideally, we would use something like Azure Key Vault where none of this information is visible. You might consider a more secure approach using CSI Secrets, where none of this information is exposed outside of the container.

If we kick off that job using kubectl, you will see it being deployed, and a pod created. Once the job completes, you can also see that the container logs show the job’s progress.

$ kubectl apply -f iload-job.yml
job.batch/iload created

$ kubectl get pods
NAME READY STATUS RESTARTS AGE
iload-gpgqg 1/1 Running 0 41s
rabbitmq-0 1/1 Running 0 159m

$ kubectl get jobs
NAME COMPLETIONS DURATION AGE
iload 1/1 62s 17m

$ kubectl logs iload-gpgqg
.
.
.
Queued {"image": "CIFAR-10-images/train/truck/4992.jpg", "category": 9, "catName": "truck"}
Queued {"image": "CIFAR-10-images/train/truck/4993.jpg", "category": 9, "catName": "truck"}
Queued {"image": "CIFAR-10-images/train/truck/4994.jpg", "category": 9, "catName": "truck"}
Queued {"image": "CIFAR-10-images/train/truck/4995.jpg", "category": 9, "catName": "truck"}
Queued {"image": "CIFAR-10-images/train/truck/4996.jpg", "category": 9, "catName": "truck"}
Queued {"image": "CIFAR-10-images/train/truck/4997.jpg", "category": 9, "catName": "truck"}
Queued {"image": "CIFAR-10-images/train/truck/4998.jpg", "category": 9, "catName": "truck"}
Queued {"image": "CIFAR-10-images/train/truck/4999.jpg", "category": 9, "catName": "truck"}

If you return to the RabbitMQ dashboard, you will see the queue contents increase from zero to 60,000 items. At its peak, the job added around 3,500 requests per second.

The final component in our application is the worker process. Its role is to take an item off the queue, classify it, and then record accuracy of predictions.

Go into the containers/worker directory and create a file called worker.py containing the following:

#!/usr/bin/env python

from mxnet import gluon, nd, image
import mxnet as mx
from mxnet.gluon.data.vision import transforms
from gluoncv import utils
from gluoncv.model_zoo import get_model
import psycopg2
import pika, time, os, json
from azure.storage.blob import ContainerClient

import cv2
import numpy as np

# Get Environment Vars
RMQ_USER=os.environ["RMQ_USER"] # RabbitMQ Username
RMQ_PASS=os.environ["RMQ_PASS"] # RabbitMQ Password
RMQ_HOST=os.environ["RMQ_HOST"] # RabbitMQ Hostname
SQL_HOST=os.environ["SQL_HOST"] # SQL Hostname
SQL_DB=os.environ["SQL_DB"] # SQL Database
SQL_USER=os.environ["SQL_USER"] # SQL Username
SQL_PASS=os.environ["SQL_PASS"] # SQL Password
STG_ACNAME=os.environ["STG_ACNAME"] # Storage Account Name
STG_ACKEY=os.environ["STG_ACKEY"] # Storage Account Key
LOGTODB=os.environ["LOGTODB"] # Log data to Database?

# Location of Images on blob storage
CONNECTION_STRING="DefaultEndpointsProtocol=https" + \
    ";EndpointSuffix=core.windows.net" + \
    ";AccountName="+STG_ACNAME+";AccountKey="+STG_ACKEY

container = ContainerClient.from_connection_string(CONNECTION_STRING, container_name="cifar")

class_names = ['airplane', 'automobile', 'bird', 'cat', 'deer', 'dog', 'frog', 'horse', 'ship', 'truck']
net = get_model('cifar_resnet110_v1', classes=10, pretrained=True)

transform_fn = transforms.Compose([
        transforms.Resize(32), transforms.CenterCrop(32), transforms.ToTensor(),
        transforms.Normalize([0.4914, 0.4822, 0.4465], [0.2023, 0.1994, 0.2010])
    ])

def predictCategory(fname):
    blob_client = container.get_blob_client(fname)
    imgStream = blob_client.download_blob().readall()
    img = mx.ndarray.array(cv2.imdecode(np.frombuffer(imgStream, np.uint8), -1))
    img = transform_fn(img)
    
    pred = net(img.expand_dims(axis=0))
    ind = nd.argmax(pred, axis=1).astype('int')
    print('%s is classified as [%s], with probability %.3f.'%
        (fname, class_names[ind.asscalar()], nd.softmax(pred)[0][ind].asscalar()))
    return ind.asscalar(), nd.softmax(pred)[0][ind].asscalar()

def InsertResult(connection, fname, category, prediction, prob):
    count=0
    try:
        cursor = connection.cursor()
        qry = """ INSERT INTO CATEGORY_RESULTS (FNAME, CATEGORY, PREDICTION, CONFIDENCE) VALUES (%s,%s,%s,%s)"""
        record = (fname, category, prediction, prob)
        cursor.execute(qry, record)

        connection.commit()
        count = cursor.rowcount

    except (Exception, psycopg2.Error) as error :
        if(connection):
            print("Failed to insert record into category_results table", error)
    finally:
        cursor.close()
        return count

# Routine to pull message from queue, call classifier, and insert result to the DB
def callback(ch, method, properties, body):
    data = json.loads(body)
    fname = data['image']
    cat = data['category']
    pred, prob = predictCategory(fname)
    if (LOGTODB == 1):
        count = InsertResult(pgconn, fname, int(cat), int(pred), float(prob))
    else:
        count = 1 # Ensure the message is ack'd and removed from queue
    
    if (count > 0):
        ch.basic_ack(delivery_tag=method.delivery_tag)
    else:
        ch.basic_nack(delivery_tag=method.delivery_tag)

pgconn = psycopg2.connect(user=SQL_USER, password=SQL_PASS,
                          host=SQL_HOST, port="5432", database=SQL_DB)
credentials = pika.PlainCredentials(RMQ_USER, RMQ_PASS)
parameters = pika.ConnectionParameters(RMQ_HOST, 5672, '/', credentials)
connection = pika.BlockingConnection(parameters)

channel = connection.channel()

channel.queue_declare(queue='image_queue', durable=True)
print(' [*] Waiting for messages. To exit press CTRL+C')

channel.basic_qos(prefetch_count=1)
channel.basic_consume(queue='image_queue', on_message_callback=callback)

channel.start_consuming()

The main function of this hasn’t changed since the previous version of this instalment. It takes a request from the queue, containing an image’s physical location, and its expected category returning a predicted category and a confidence value. It also stores these values in a database if desired.

Like the iload process, the key differences here are as follows:

The configuration is based on environment variables, where previously they were hard coded.
The images are stored in blob storage, and not on local disk.

We also added the ability to log results depending on the value of an environment variable, so you might want to play with this to determine the performance impact of logging.

In the containers/worker directory create a file called Dockerfile containing the following.

FROM ubuntu

RUN apt-get update
RUN apt-get install -y python3 python3-pip

RUN pip3 install --upgrade mxnet gluoncv pika
RUN pip3 install psycopg2-binary

RUN pip install azureml-sdk
RUN pip install azureml-sdk[notebooks]
RUN pip install azure.ai.formrecognizer
RUN pip install azure.storage.blob
RUN pip install opencv-python

ARG DEBIAN_FRONTEND=noninteractive
RUN apt-get install ffmpeg libsm6 libxext6 -y

# Add worker logic necessary to process queue items
ADD worker.py /

# Start the worker
CMD ["python3", "./worker.py" ]

Again, this is relatively straight forward. You build a container with the requisite Azure, Python, RabbitMQ, and machine learning libraries installed.

As with the iload process, you need to build a local container, tag it against a target image in the Azure Container Registry and then push it to Azure.

$ docker build -t worker .
.
.
=> [12/12] ADD worker.py 
/

=> exporting to 
image

=> => exporting 
layers

=> => writing image 
sha256:9716e1e98687cfc3dd5f66640e441e4aa24131ffb3b3bd4c5d0267a06abcc802

=> => naming to 
docker.io/library/worker

$ docker tag worker jmcifaracr.azurecr.io/worker:1.0
$ docker images
REPOSITORY                     TAG      IMAGE ID       CREATED              SIZE
worker                         latest   9716e1e98687   About a minute ago   2.24GB
jmcifaracr.azurecr.io/worker   1.0      9716e1e98687   About a minute ago   2.24GB
iload                          latest   4ef19e469755   3 hours ago          1.23GB
jmcifaracr.azurecr.io/iload    1.0      4ef19e469755   3 hours ago          1.23GB

$ docker push jmcifaracr.azurecr.io/worker:1.0
The push refers to repository [jmcifaracr.azurecr.io/worker]
.
.

$ az acr repository list --name jmcifaracr --output table
Result
--------
iload
worker

Now we need to provide a deployment file for the worker process. This defines how it is run within Kubernetes.

In the containers/worker directory, create a file called worker-deployment.yml containing the following:

apiVersion: apps/v1
kind: Deployment
metadata:
    name: worker
spec:
    replicas: 1
    selector:
        matchLabels:
            app: worker
    template:
        metadata:
            labels:
                app: worker
        spec:
            containers:
            - name: worker
                image: jmcifaracr.azurecr.io/worker:1.0
                imagePullPolicy: Always
                env:
                    - name: RMQ_USER
                      value: "user"
                    - name: RMQ_PASS
                      value: "7TrP8KOVdC"
                    - name: RMQ_HOST
                      value: "rabbitmq"
                    - name: SQL_HOST
                      value: "cifardb.postgres.database.azure.com"
                    - name: SQL_DB
                      value: "postgres"
                    - name: SQL_USER
                      value: "jon@cifardb.postgres.database.azure.com"
                    - name: SQL_PASS
                      value: "P@ssw0rd123"
                    - name: STG_ACNAME
                      value: "cifarimages"
                    - name: STG_ACKEY
                      value: “xxxxxxxx”
                    - name: LOGTODB
                      value: "1"
                resources:
                    requests:
                        cpu: 100m
                        memory: 128Mi
                    limits:
                        cpu: 150m
                        memory: 128Mi

Let’s spend a bit of time going through this as well.

First, this is a deployment, and it ensures that there is always a defined number of replicas (or pods in this case) running. This configuration uses a single pod, but when we increase this number later, you’ll see how it affects the environment and performance. Second, each pod is allocated an amount of memory and CPU. Some processes are memory intensive, and others compute centric. You can decide how much to dedicate to each pod type.

Let’s deploy that container and evaluate the performance.

$ kubectl apply -f worker-deployment.yml
deployment.apps/worker created

$ kubectl get deployments
NAME     READY   UP-TO-DATE   AVAILABLE   AGE
worker   1/1     1            1           52s

$ kubectl get pods
NAME                      READY   STATUS      RESTARTS   AGE
iload-gpgqg               0/1     Completed   0          110m
rabbitmq-0                1/1     Running     0          4h29m
worker-5df6cb8cb7-qnwtq   1/1     Running     0          54s

You can see that there is an active deployment and a single worker running. This is the view from the RabbitMQ dashboard – 1.8 requests on average per second.

Increase the number of parallel workers to 5 by modifying the replica count in the worker-deployment.yml file and redeploying it. You will then have 5 pods. Each worker takes a request from the queue, performs the image classification, and writes the content to Postgres.

$ kubectl apply -f worker-deployment.yml
deployment.apps/worker configured

$ kubectl get deployments
NAME     READY   UP-TO-DATE   AVAILABLE   AGE
worker   1/1     1            1           52s

$ kubectl get pods
NAME                      READY   STATUS      RESTARTS   AGE
iload-gpgqg               0/1     Completed   0          112m
rabbitmq-0                1/1     Running     0          4h32m
worker-5df6cb8cb7-flqp4   1/1     Running     0          51s
worker-5df6cb8cb7-hsl2p   1/1     Running     0          51s
worker-5df6cb8cb7-qnwtq   1/1     Running     0          3m32s
worker-5df6cb8cb7-v9t6p   1/1     Running     0          51s
worker-5df6cb8cb7-x4dt4   1/1     Running     0          51s

Performance has now increased to an average of 8.8 requests per second.

Here is a view of performance after increase the replica count even further to 20 (35 requests per second).

And then 35 workers (55 requests per second).

This isn’t linear scalability, nor is it an invitation to simply increase the number of workers to 500. Each Kubernetes node has a limited amount of physical resource. During our tests, we achieved 70 requests per second after playing with how much memory and CPU were allocated to each pod. This is an exercise for you to consider with your own workloads. What should be understood though, is that you can scale your service as needed with the underlying Kubernetes architecture to support that. More pods, nodes, clusters etc as needed.

Conclusions and Considerations

This article showed how to take an existing multi-container Docker application and migrate it to the Azure Kubernetes Service. Where possible, commodity PaaS capabilities were considered (database, storage etc.). We also showed how to use a publicly available configuration using Helm.

The previous instalment of this blog solely used containers writing the results to Postgres. We did the same here, but there’s nothing to suggest a need to immediately query the results. If this were performance critical, we might consider writing the results to a file, and then batch uploading those to a database at some point for analysis – much more efficient.

Our application is tiny, and arguably too small to justify an entire Kubernetes environment. However, a Kubernetes environment normally runs many different applications simultaneously within private networks, using well defined security, performance monitoring, and with much more flexibility in terms of scalability and cost optimisation. Since you are only charged for the Kubernetes environment not the number of pods, you can run as many or as few applications as you like in that environment subject to capacity.

You might also want to consider adding a node pool for GPU nodes that will dramatically change your performance where your applications are able to use a service’s underlying GPU. More information can be found here.

The articles in this series have focused on the basics of containers on Azure to address some data science patterns with an assumed current interest in on-premises containers to deliver data science solutions.

We haven’t considered the use of MLOps where you might approach machine learning and data science with the same rigour, governance, and outcome transparency offered to software development. It hasn’t considered the use of Azure Machine Learning where you might want to replace some of your historical code with PaaS machine learning capabilities, and optimised compute.

Future instalments may look at these, incorporating your containers with these prebuilt Azure capabilities.

Note: If you’ve finished this tutorial and created a specific resource group to try it, then you may want to remove it to ensure you’re no longer being charged for resources that are no longer needed.

About the authors

Jon is a Microsoft Cloud Solution Architect specialising in Advanced Analytics & Artificial Intelligence with over 30 years of experience in understanding, translating and delivering leading technology to the market. He currently focuses on a small number of global accounts helping align AI and Machine Learning capabilities with strategic initiatives. He moved to Microsoft from IBM where he was Cloud & Cognitive Technical Leader and an Executive IT Specialist.

Jon has been the Royal Academy of Engineering Visiting Professor for Artificial Intelligence and Cloud Innovation at Surrey University since 2016, where he lectures on various topics from machine learning, and design thinking to architectural thinking.

Mark has worked at Microsoft for five and a half years with a focus on helping customers adopt cloud native technologies. Before Microsoft, he spent around twenty years in the financial services industry, primarily at major UK banks, where he worked in various roles across operations, engineering and architecture. He loves discovering new technologies, learning them in depth and teaching others.

The post Building Scalable Data Science Applications using Containers – Part 6 appeared first on Microsoft Industry Blogs - United Kingdom.

Building Scalable Data Science Applications using Containers – Part 5

Jon Machtynger — Fri, 27 Aug 2021 15:35:06 +0000

Welcome to the fifth part of this blog series around using containers for Data Science. In parts one, two, three, and four, I covered a number of building blocks that we’ll use here. If this is the first blog you’ve seen, it’s worth skimming the first four parts, or even going back and progressing through them. I make a number of assumptions about your familiarity with Docker, storage, and multi-container applications, covered in these previous blogs.

The objective of this blog is to:

Build a common data science pattern using multiple components that will be held in containers.
Provide some considerations for scalability and resilience.
Use this as the foundation for an Azure Kubernetes Service deployment in a subsequent post.

This blog will not always demonstrate good data science practice; I’d rather focus on exposing patterns that are worth being aware of that help provide a catalyst for learning. There are many other sources for performance optimisation, and architectural robustness, but this requires a broader level of understanding than I assume in this article. However, I will usually point out when poor practice is being demonstrated.

For example, some performance patterns that aren’t ideal – the database in the diagram below is a bottleneck and will constrain performance. The remit of this blog is to slowly build on core principles, show how to work with them and use these as a basis for further understanding.

This will be a two-part post. In the first part, we will build the environment locally using docker-compose and make some observations about limitations. In the second, we will migrate the functionality across to Azure Kubernetes Service.

We’ll use a simple image classification scenario requiring a number of technical capabilities. These form a simple process that classifies a pipeline of images into one of 10 categories.

The scenario we’ll be building assumes many typical project constraints:

We have no control of how many or how fast the images arrive.
The classification model has been pretrained.
Every image must be classified. In other words, we cannot just ignore errors or crashes.
We need to record classification results in a resilient data store.
As we can’t control our incoming workload, we need to scale our classification as required – up to accommodate throughput, or down to control costs.
We will monitor throughput, performance, and accuracy to allow us to scale our resource and in potentially detect statistical drift.

The overall application is outlined in the diagram above. We want to provide some resilience; For example, a failed categorisation due to a crashed process will be retried. At this stage we won’t provide a highly available database, or an event queue with automatic failover, but we may consider this when we move it to Kubernetes.

Let’s begin

We’ll hold our application under a single directory tree. Create the containers directory, and then beneath that, create four sub directories named postgres, python, rabbitmq and worker.

$ mkdir -p containers/postgres/ containers/python/ containers/rabbitmq/ containers/worker
$ tree containers

containers/
├── postgres
├── python
├── rabbitmq
└── worker

Create Your Persistent storage

Containers are designed to be disposable processes and stateless. We’ll need to ensure that whenever a container terminates, its state can be remembered. We’ll do that using Docker volumes for persistent storage. As our overall architecture diagram shows, we’ll need this for Postgres, RabbitMQ and to hold our images.

Create the docker volumes and then confirm they’re there.

$ docker volume create scalable-app_db_data         # for Postgres
scalable-app_db_data
$ docker volume create scalable-app_image_data      # to hold our images
scalable-app_image_data
$ docker volume create scalable-app_mq_data         # for Rabbit data
scalable-app_mq_data
$ docker volume create scalable-app_mq_log          # for Rabbit logs
scalable-app_mq_log
$ docker volume ls
DRIVER              VOLUME NAME
local               scalable-app_db_data
local               scalable-app_image_data
local               scalable-app_mq_data
local               scalable-app_mq_log
$

Load the Source Images

I’ve used a publicly available set of images for classification – the classic CIFAR data set. Data sets are often already post-processed to allow for easy inclusion into machine learning code. I found a source that has them in jpg form, which can be downloaded here.

We’ll first clone the CIFAR image repository, then load those images into a volume using a tiny alpine container and show that they have been copied to the persistent volume. We’ll also use this volume as part of the process to queue and categorise each image. Note that in the text below, you can refer to a running container by the prefix of its identity if it is unique. Hence ‘343’ below refers to the container with an ID uniquely beginning with ‘343’.

$ mkdir images
$ cd images
$ git clone https://github.com/YoongiKim/CIFAR-10-images.git
Cloning into 'CIFAR-10-images'...
remote: Enumerating objects: 60027, done.
remote: Counting objects: 100% (60027/60027), done.
remote: Compressing objects: 100% (37/37), done.
remote: Total 60027 (delta 59990), reused 60024 (delta 59990), pack-reused 0
Receiving objects: 100% (60027/60027), 19.94 MiB | 2.75 MiB/s, done.
Resolving deltas: 100% (59990/59990), done.
Checking out files: 100% (60001/60001), done.
$
$ docker run --rm -itd -v scalable-app_image_data:https://www.microsoft.com/images alpine
343b5e3ad95a272810e51ada368c1c6e070f83df1c974e88a583c17462941337
$
$ docker cp CIFAR-10-images 343:https://www.microsoft.com/images
$ docker exec -it 343 ls -lr https://www.microsoft.com/images/CIFAR-10-images/test/cat | head
total 4000
-rw-r--r--    1 501      dialout        954 Dec 22 12:50 0999.jpg
-rw-r--r--    1 501      dialout        956 Dec 22 12:50 0998.jpg
-rw-r--r--    1 501      dialout        915 Dec 22 12:50 0997.jpg
-rw-r--r--    1 501      dialout        902 Dec 22 12:50 0996.jpg
-rw-r--r--    1 501      dialout        938 Dec 22 12:50 0995.jpg
-rw-r--r--    1 501      dialout        957 Dec 22 12:50 0994.jpg
-rw-r--r--    1 501      dialout        981 Dec 22 12:50 0993.jpg
-rw-r--r--    1 501      dialout        889 Dec 22 12:50 0992.jpg
-rw-r--r--    1 501      dialout        906 Dec 22 12:50 0991.jpg
$ docker stop 343
343

The Queueing Service

We’ll process images by adding them to a queue and letting worker processes simply take them from the queue. This allows us to scale our workers and ensure some resilience around the requests. I’ve chosen RabbitMQ as it’s very easy to use and accessible from many programming languages.

To create the RabbitMQ service, create a Dockerfile in the containers/rabbitmq directory with the following:

FROM rabbitmq:3-management

EXPOSE 5672
EXPOSE 15672

Now go into that directory and build it:

$ docker build -t rabbitmq .
Sending build context to Docker daemon  14.85kB
Step 1/3 : FROM rabbitmq:3-management
3-management: Pulling from library/rabbitmq
.
.
.
Digest: sha256:e1ddebdb52d770a6d1f9265543965615c86c23f705f67c44f0cef34e5dc2ba70
Status: Downloaded newer image for rabbitmq:3-management
---> db695e07d0d7
Step 2/3 : EXPOSE 5672
---> Running in 44098f35535c
Removing intermediate container 44098f35535c
---> 7406a95c39b3
Step 3/3 : EXPOSE 15672
---> Running in 388bcbf65e3f
Removing intermediate container 388bcbf65e3f
---> db76ef2233d1
Successfully built db76ef2233d1
Successfully tagged rabbitmq:latest
$

Now start a container based on that image:

$ docker run -itd  -v "scalable-app_mq_log:/var/log/rabbitmq" -v "scalable-app_mq_data:/var/lib/rabbitmq" --name "rabbitmq" --hostname rabbitmq -p 15672:15672 -p 5672:5672 rabbitmq
f02ae9d41778968ebcd2420fe5cfd281d9b5df84f27bd52bd23e1735db828e18
$

If you open up a browser and go to localhost:15672, you should see the following:

This will allow us to monitor queues.

Go to the containers/python directory and create a new file called fill_queue.py. The code below finds a list of all the images to be categorised and adds it to our queue.

I start at the mounted directory of images, and do a tree walk finding every image (ending in png, jpg, or jpeg). I use the location in the full path to define the expected category is (fNameToCategory), and build up an array of JSON payloads.

I then connect to the Rabbit Server. Note that in this case, that HOSTNAME is defined as your Docker host’s IP address – in this case, the Docker host’s ‘localhost’, because the python container has a different localhost than the RabbitMQ container.

I declare a new channel, and queue and publish each IMGS entry as a separate message.

There is a debugging print to show the number of images. If all goes well, you shouldn’t see this as it will scroll off the screen. Hopefully, you see thousands of messages showing progress.

#!/usr/bin/env python
import pika
import sys
import os
import json

ROOT="https://www.microsoft.com/images"
rLen = len(ROOT)
classes = ('airplane', 'automobile', 'bird', 'cat', 'deer', 'dog', 'frog', 'horse', 'ship', 'truck')
HOSTNAME=""

# Determine the expected category by parsing the directory (after the root path)
def fnameToCategory(fname):
    for c in classes:
        if (fname.find(c) > rLen):
            return (classes.index(c))
    return -1 # This should never happen

IMGS=[]
for root, dirs, files in os.walk(ROOT):
    for filename in files:
        if filename.endswith(('.png', '.jpg', '.jpeg')):
            fullpath=os.path.join(root, filename)
            cat = fnameToCategory(fullpath)
            data = {
                "image" : fullpath,
                "category": cat,
                "catName": classes[cat]
            }
            message = json.dumps(data)
            IMGS.append(message)

connection = pika.BlockingConnection(pika.ConnectionParameters(host=HOSTNAME))
channel = connection.channel()

channel.queue_declare(queue='image_queue', durable=True)

print("Number of Images = ", len(IMGS))

for i in IMGS:
    channel.basic_publish( exchange='', routing_key='image_queue', body=i,
        properties=pika.BasicProperties( delivery_mode=2,  )
    )
    print("Queued ", i)

connection.close()

In the same containers/python directory, create a Dockerfile for your python engine:

FROM python:3.7-alpine

# Add core OS requirements
RUN apk update && apk add bash vim

# Add Python Libraries
RUN pip install pika

ADD  fill_queue.py /

Now build the Docker image:

$ docker build -t python .
Sending build context to Docker daemon   27.8MB
Step 1/4 : FROM python:3.7-alpine
---> 459651397c21
Step 2/4 : RUN apk update && apk add bash vim
---> Running in dc363417cf12
.
.
.
Successfully installed pika-1.1.0
Removing intermediate container b40f1782f0c1
---> 35891fccb860
Step 4/4 : ADD  fill_queue.py /
---> 17cd19050b21
Successfully built 17cd19050b21
Successfully tagged python:latest
$

Now, run the container, mounting the volume containing our images and executing our script:

$ docker run --rm -v scalable-app_image_data:https://www.microsoft.com/images  -it python python /fill_queue.py

Number of Images =  60000
Queued  {"image": "https://www.microsoft.com/images/CIFAR-10-images/test/dog/0754.jpg", "category": 5, "catName": "dog"}
Queued  {"image": "https://www.microsoft.com/images/CIFAR-10-images/test/dog/0985.jpg", "category": 5, "catName": "dog"}
.
.
.
.

While this is running, you should see the queued messages increase until it reaches 60,000.

Now click on the ‘Queues’ link in the RabbitMQ management console, and you will see that those messages are now in the ‘image_queue‘ queue waiting to be requested.

If you now click on the image_queue link, you’ll get a more detailed view of activity within that queue.

Providing a Database Store

Now provision the database environment, which will simply record categorisation results.

In the containers/postgres directory, create a Dockerfile containing the following:

FROM postgres:11.5

COPY pg-setup.sql /docker-entrypoint-initdb.d/

EXPOSE 5432

CMD ["postgres"]

In the same directory, create a file called pg-setup.sql containing the following:

CREATE TABLE CATEGORY_RESULTS (
    FNAME         VARCHAR(1024) NOT NULL,
    CATEGORY      NUMERIC(2) NOT NULL,
    PREDICTION    NUMERIC(2) NOT NULL,
    CONFIDENCE    REAL);

And build the Postgres container image:

$ docker build -t postgres .
Sending build context to Docker daemon  4.096kB
Step 1/4 : FROM postgres:11.5
---> 5f1485c70c9a
Step 2/4 : COPY pg-setup.sql /docker-entrypoint-initdb.d/
---> e84511216121
.
.
.
Removing intermediate container d600e2f45564
---> 128ad35a028b
Successfully built 128ad35a028b
Successfully tagged postgres:latest
$

Start the Postgres service. Note that here we’re mounting a docker volume to hold the persistent data when the container terminates.

$ docker run --name postgres --rm -v scalable-app_db_data:/var/lib/postgresql/data -p 5432:5432 -e POSTGRES_PASSWORD=password -d postgres
dfc9bbffd83de9bca35c54ed0d3f4afd47c0d03f351c87988f827da15385b4e6
$

If you now connect to the database, you should see that a table has been created for you. This will contain our categorisation results. Note, the password in this case is ‘password’ as we specified in the POSTGRES_PASSWORD environment variable when starting the container.

$ psql -h localhost -p 5432 -U postgres
Password for user postgres:
psql (11.5)
Type "help" for help.

postgres=# \d
            List of relations
 Schema |       Name       | Type  |  Owner
--------+------------------+-------+----------
 public | category_results | table | postgres
(1 row)
 
postgres=# \d category_results
                Table "public.category_results"
   Column   |          Type           | Collation | Nullable | Default
------------+-------------------------+-----------+----------+---------
 fname      | character varying(1024) |           | not null |
 category   | numeric(2,0)            |           | not null |
 prediction | numeric(2,0)            |           | not null |
 confidence | real                    |           |          |

The Classification Process

The final function will request something off the queue, classify it, and record a result. This is the worker process and uses a pretrained CIFAR model from Gluon together with our pika library that we used to add to the RabbitMQ queue. One design principle for this application is that we should be able to scale up the number of classifiers to support demand. This is possible because the queue is accessible by many workers simultaneously. The workers request messages in a round-robin fashion, meaning that the process can be parallelised to increase throughput.

In your containers/worker directory, create the following Dockerfile:

FROM ubuntu

RUN apt-get update
RUN apt-get install -y python3 python3-pip

RUN pip3 install --upgrade mxnet gluoncv pika
RUN pip3 install psycopg2-binary

# Add worker logic necessary to process queue items
ADD  worker.py /

# Start the worker
CMD ["python3", "./worker.py" ]

Also create a file called worker.py with the following content:

#!/usr/bin/env python

from mxnet import gluon, nd, image
from mxnet.gluon.data.vision import transforms
from gluoncv import utils
from gluoncv.model_zoo import get_model
import psycopg2
import pika
import time
import json

def predictCategory(fname):
    img = image.imread(fname)

    class_names = ['airplane', 'automobile', 'bird', 'cat', 'deer', 'dog', 'frog', 'horse', 'ship', 'truck']

    transform_fn = transforms.Compose([
        transforms.Resize(32), transforms.CenterCrop(32), transforms.ToTensor(),
        transforms.Normalize([0.4914, 0.4822, 0.4465], [0.2023, 0.1994, 0.2010])
    ])
    img = transform_fn(img)
    net = get_model('cifar_resnet110_v1', classes=10, pretrained=True)

    pred = net(img.expand_dims(axis=0))
    ind = nd.argmax(pred, axis=1).astype('int')
    print('The input picture is classified as [%s], with probability %.3f.'%
        (class_names[ind.asscalar()], nd.softmax(pred)[0][ind].asscalar()))
    return ind.asscalar(), nd.softmax(pred)[0][ind].asscalar()

def InsertResult(connection, fname, category, prediction, prob):
    count=0
    try:
        cursor = connection.cursor()

        qry = """ INSERT INTO CATEGORY_RESULTS (FNAME, CATEGORY, PREDICTION, CONFIDENCE) VALUES (%s,%s,%s,%s)"""
        record = (fname, category, prediction, prob)
        cursor.execute(qry, record)

        connection.commit()
        count = cursor.rowcount

    except (Exception, psycopg2.Error) as error :
        if(connection):
            print("Failed to insert record into category_results table", error)

    finally:
        cursor.close()
        return count

#
# Routine to pull message from queue, call classifier, and insert result to the DB
#
def callback(ch, method, properties, body):
    data = json.loads(body)
    fname = data['image']
    cat = data['category']
    print("Processing", fname)
    pred, prob = predictCategory(fname)
    if (logToDB == 1):
        count = InsertResult(pgconn, fname, int(cat), int(pred), float(prob))
    else:
        count = 1  # Ensure the message is ack'd and removed from queue

    if (count > 0):
        ch.basic_ack(delivery_tag=method.delivery_tag)
    else:
        ch.basic_nack(delivery_tag=method.delivery_tag)

logToDB=1    # Set this to 0 to disable storing data in the database

pgconn = psycopg2.connect(user="postgres", password="password",
                host="", port="5432", database="postgres")

connection = pika.BlockingConnection(pika.ConnectionParameters(host=’’))
channel = connection.channel()

channel.queue_declare(queue='image_queue', durable=True)
print(' [*] Waiting for messages. To exit press CTRL+C')

channel.basic_qos(prefetch_count=1)
channel.basic_consume(queue='image_queue', on_message_callback=callback)

channel.start_consuming()

Let’s pick this apart a little. After importing the required libraries, I define a function predictCategory that takes as an argument, a filename identifying an image to classify. This then uses a pretrained model from the gluon library, and returns a classification, and a classification confidence.

The next function InsertResult writes a single record into the database containing the path of the image being processed, the category it should have been, what category it was predicted to be, and a prediction confidence.

The final function is a callback function that pulls these together. It deconstructs the message’s JSON payload, calls the function to categorise the image, and then calls the function recording the result. If there are no functional errors, then we’ll acknowledge (basic_ack) receipt of the message and it will be removed from the queue. If there are functional errors, then we’ll do a basic_nack, and place the message back on the queue. If there is another worker available, then it can take it, or we can retry it later. This ensures that if a worker process dies or is interrupted for some reason, that everything in the queue can eventually be processed.

There is also a variable logToDB, which you can set to 0 or 1 to disable or enable logging to the database. It might be useful to see whether the database is a significant bottleneck by testing performance with and without logging.

I create a connection to the database, a connection to RabbitMQ using the host’s IP address, and a channel using the image_queue queue. Once again, be aware that the hosts IP address will reroute any message requests to the underlying container hosting our RabbitMQ service.

I then wait on messages to appear forever, processing queue items one by one.

$ docker build -t worker .
Sending build context to Docker daemon  5.632kB
Step 1/6 : FROM ubuntu
---> 94e814e2efa8
Step 2/6 : RUN apt-get update
---> Running in 3cbb2343f94f
.
.
.
Step 6/6 : ADD  worker.py /
---> bc96312e6352
Successfully built bc96312e6352
Successfully tagged worker:latest
$

We can start a worker to begin the process of categorising our images.

$ docker run --rm -itd -v scalable-app_image_data:https://www.microsoft.com/images worker
061acbfcf1fb4bdf43b90dd9b77c2aca67c4e1d012777f308c5f89aecad6aa00
$
$ Docker logs 061a
[*] Waiting for messages. To exit press CTRL+C
Processing https://www.microsoft.com/images/CIFAR-10-images/test/dog/0573.jpg
Model file is not found. Downloading.
Downloading /root/.mxnet/models/cifar_resnet110_v1-a0e1f860.zip from https://apache-mxnet.s3-accelerate.dualstack.amazonaws.com/gluon/models/cifar_resnet110_v1-a0e1f860.zip...
6336KB [00:04, 1374.69KB/s]
The input picture is classified as [dog], with probability 0.998.
Processing https://www.microsoft.com/images/CIFAR-10-images/test/dog/0057.jpg
The input picture is classified as [dog], with probability 0.996.
Processing https://www.microsoft.com/images/CIFAR-10-images/test/dog/0443.jpg
The input picture is classified as [deer], with probability 0.953.
.
.

Clearly, it’s not great practice to use a training set as part of a testing process. However, we’re not measuring model effectiveness and accuracy here. We’re simply seeking to understand how to categorise many thousands of images with a scalable approach, so actually any images will do no matter where they came from. The first thing the worker does is download a pretrained model. There’s no need to train it. In your own environment, you may consider doing something similar by using the latest stable model to support the data being tested. It then takes an item from the queue, categorises it, removes it from the queue, and progresses to the next item.

If we now query the database, it’s clear that the worker has been busy:

$ psql -h localhost -p 5432 -U postgres
Password for user postgres:
psql (11.5)
Type "help" for help.

postgres=# select * from category_results ;
                   fname                   | category | prediction | confidence
-------------------------------------------+----------+------------+------------
 https://www.microsoft.com/images/CIFAR-10-images/test/dog/0826.jpg |        5 |          5 |   0.999194
 https://www.microsoft.com/images/CIFAR-10-images/test/dog/0333.jpg |        5 |          5 |   0.992484
.
.
.

Let’s look at the queue itself:

As you can see, there is a processing rate of 12 requests per second. Let’s kick off a couple more workers:

$ for w in 1 2 ; do docker run --rm -itd -v scalable-app_image_data:https://www.microsoft.com/images worker; done
ee1732dd3d4a1abcd8ab356262603d8a24523dca237ea1102c3a953c86a221bf
a26c14a28b5605345ed6d09cd4d21d2478d34a8ce22668d0aac37a227af21c3e
$

Look at the queue again:

And now the ack rate has increased to 22 per second. You might at this point be thinking that increasing containers here is the next logical step. However, you shouldn’t expect linear scalability. RabbitMQ has its own bottlenecks, as does the database and the python code. There are many public resources that discuss improving RabbitMQ performance including the use of prefetch counts clustering, reduced queue size, multiple queues, or using CPU affinity. For that matter, changing the code to use threads, parallelise certain functions, or even removing the durable flag are also likely to help. This article isn’t going to focus on any of those, so I’ll leave it to you to do your own research on what works for your code and scenarios. One other thing you might like to try at some point is to use RabbitMQ clusters with an HA-Proxy load balancer, which may improve your performance. A non-Docker example can be found here.

In any case, let’s convert what we have into a multi-container application using docker-compose. We can then use that as the basis for a Kubernetes environment.

$ tree containers

containers/
├── postgres
│   ├── Dockerfile
│   ├── pg-setup.sql
├── python
│   ├── Dockerfile
│   ├── fill_queue.py
├── rabbitmq
│   └── Dockerfile
└── worker
    ├── Dockerfile
    └── worker.py

We can convert all the work done so far into fewer steps with docker-compose and a couple of scripts. In the directory holding the containers directory, create a new file called docker-compose.yml:

version: '3'

services:
    sa_postgres:
        build: containers/postgres
        ports:
            - "5432:5432"
        volumes:
            - scalable-app_db_data:/var/lib/postgresql/data
        environment:
            - POSTGRES_PASSWORD=password

    sa_rabbitmq:
        build: containers/rabbitmq
        hostname: rabbitmq
        ports:
            - 5672:5672
            - 15672:15672
        volumes:
            - scalable-app_mq_log:/var/log/rabbitmq
            - scalable-app_mq_data:/var/lib/rabbitmq

    sa_worker:
        build: containers/worker
        depends_on:
            - sa_postgres
            - sa_rabbitmq
        volumes:
            - scalable-app_image_data:https://www.microsoft.com/images
        restart: always
# number of containers?

volumes:
    scalable-app_db_data:
    scalable-app_image_data:
    scalable-app_mq_data:
    scalable-app_mq_log:

Now Build the composite application:

$ docker-compose build
Building sa_postgres
Step 1/4 : FROM postgres:11.5
---> 5f1485c70c9a
Step 2/4 : COPY pg-setup.sql /docker-entrypoint-initdb.d/
---> 2e57fe31a9ab
Step 3/4 : EXPOSE 5432
---> Running in 6f02f7f92a19
Removing intermediate container 6f02f7f92a19
.
.
.

Before you start the composite application, make sure you do a ‘docker ps -a’ to see the currently running containers and stop/remove them. When that’s done, start the application and specify how many worker containers you want to service the queue.

$ docker-compose up -d --scale sa_worker=2
Creating network "scalable-app_default" with the default driver
Creating volume "scalable-app_scalable-app_db_data" with default driver
Creating volume "scalable-app_scalable-app_image_data" with default driver
Creating volume "scalable-app_scalable-app_mq_data" with default driver
Creating volume "scalable-app_scalable-app_mq_log" with default driver
Creating scalable-app_sa_python_1   ... done
Creating scalable-app_sa_postgres_1 ... done
Creating scalable-app_sa_rabbitmq_1 ... done
Creating scalable-app_sa_worker_1   ... done
Creating scalable-app_sa_worker_2   ... done
.
.

There are a couple of things to note here. First, there is now a network shared between all containers, so we won’t have to refer to our host network within the code. We can now change our hostnames to refer to our other containers. Secondly, when we start and stop our application, everything is brought up together, or if needed, in an order to support dependencies. Lastly, the constituent volumes and images are created with names that are prefixed by the application name, which helps identify how they’re used, and helping remove conflict with other resources.

Let’s bring the service down and make those changes.

$ docker-compose down
Stopping scalable-app_sa_worker_2   ... done
Stopping scalable-app_sa_worker_1   ... done
Stopping scalable-app_sa_rabbitmq_1 ... done
Stopping scalable-app_sa_postgres_1 ... done
Removing scalable-app_sa_worker_2   ... done
Removing scalable-app_sa_worker_1   ... done
Removing scalable-app_sa_rabbitmq_1 ... done
Removing scalable-app_sa_postgres_1 ... done
Removing scalable-app_sa_python_1   ... done
Removing network scalable-app_default
(base) JLIEM-SB2:containers jon$

In the containers/worker/worker.py file, make the following changes to your host identifiers:

.
.
logToDB=1    # Set this to 0 to disable storing data in the database

pgconn = psycopg2.connect(user="postgres", password="password",
                          host="sa_postgres", port="5432", database="postgres")

connection = pika.BlockingConnection(pika.ConnectionParameters(host='sa_rabbitmq'))
channel = connection.channel()
.
.

In your containers/python/fill_queue.py file, change your hostname:

HOSTNAME="sa_rabbitmq"

And restart again:

$ docker-compose up -d --scale sa_worker=2
Creating network "scalable-app_default" with the default driver
Creating volume "scalable-app_scalable-app_db_data" with default driver
Creating volume "scalable-app_scalable-app_image_data" with default driver
Creating volume "scalable-app_scalable-app_mq_data" with default driver
Creating volume "scalable-app_scalable-app_mq_log" with default driver
Creating scalable-app_sa_python_1   ... done
Creating scalable-app_sa_postgres_1 ... done
Creating scalable-app_sa_rabbitmq_1 ... done
Creating scalable-app_sa_worker_1   ... done
Creating scalable-app_sa_worker_2   ... done
.
.

You can now populate the message queue with images to process. The following script mounts the image volume on a temporary container, copies the images to the volume, and then starts a process to populate the queue.

# clone the CIFAR images, if they're not already there
if [ ! -d "CIFAR-10-images" ]; then
    git clone https://github.com/YoongiKim/CIFAR-10-images.git
fi

# Start a small container to hold the images
CID=$(docker run --rm -itd -v scalable-app_scalable-app_image_data:https://www.microsoft.com/images alpine)
echo "Copying content to container $CID:https://www.microsoft.com/images"

# Copy the content
docker cp CIFAR-10-images $CID:https://www.microsoft.com/images
docker stop $CID

docker run --rm -v scalable-app_scalable-app_image_data:https://www.microsoft.com/images  -it python python /fill_queue.py

And as expected, we can see that the queue is both being populated and being processed by the worker nodes that are sitting in the background.

Conclusions

This post outlined how to containerise a multi-component application reflecting a typical data science classification process. It ingests images and provides a scalable mechanism for classifying them and recording the results. As mentioned, the focus here is not on good data science practice, or good containerisation practice but to reflect on options available to support learning around containerisation with a data science frame of reference.

This post will be used as a foundation for the next part in this series, which will be to convert it to use Kubernetes and PaaS services.

About the author

The post Building Scalable Data Science Applications using Containers – Part 5 appeared first on Microsoft Industry Blogs - United Kingdom.

Using Containers to run R/Shiny workloads in Azure: Part 4

Jon Machtynger — Thu, 03 Oct 2019 09:00:02 +0000

In the parts one, two and three of this series, I described how to build containers using Dockerfiles, then how to share and access them from Azure. I then introduced data persistence using managed volumes and shared file systems, effectively developing locally with a globally accessible persistent state. Finally, I showed how to deploy multi-container applications using docker-compose.

In this part, we’ll move briefly away from Python to look at R together with Shiny as a dynamic reporting and visualisation capability pulling data from a Postgres database. R is very popular with a number of clients I work with, and they also have an interest in being able to move from on-premises environments to a containerised deployment.

I wrote this as a simple guide, because I’ve had a number of requests for guidance on how to test this exact pattern within their own Azure environments. As with previous parts to this series, I won’t be focusing on best practice at this point. I’m more concerned about showing more moving parts, each of which can act as learning points.

By the end of this blog, you’ll be able to create your own R/Shiny container that pulls data from an Azure Postgres service, and then access that container in a remote Azure deployment from a browser.

The first thing we’re going to do is to create a resource group to host everything. I’m going to call this rshiny-rg and host it in eastus.

$ az group create --name rshiny-rg --location eastus
{
    "id": "/subscriptions//resourceGroups/rshiny-rg",
    "location": "eastus",
    "managedBy": null,
     "name": "rshiny-rg",
    "properties": {
        "provisioningState": "Succeeded"
    },
    "tags": null,
    "type": null
}

Next, we’ll need to create a postgres database to hold some candidate data. Rather than manage our own Postgres server, I’m going to use Azure’s own Postgres service. First, create the Postgres server, provide it with an admin username and password, a server size and a version.

$ az postgres server create --resource-group rshiny-rg --name shiny-pg --location eastus --admin-user 
jonadmin --admin-password [MyPassword] --sku-name B_Gen5_1 --version 9.6

{
    "administratorLogin": "jonadmin",
    .
    "fullyQualifiedDomainName": "shiny-pg.postgres.database.azure.com",
    .
    .
    "sslEnforcement": "Enabled",
    .
}
$

There are some initial security checks to ensure that you’re not using standard administrator names (e.g. admin), or non-trivial passwords. Keep a note of these, as you’ll need to use them later in this exercise.

Note that I used a very small basic server (B_Gen5_1) to host this. With the size of data being stored, and our performance expectations for this exercise, we only need a single core. It’s also very inexpensive  – $0.03/hour at the time of writing this.

Note the fully qualified domain name for your server, because we’ll need it to connect later.

We also now need to create a firewall rule so we can interact with the server from an external environment. You will need to find out your IP address for this. This is the IP address that the internet thinks you have (not the one you might have on your machine). If you don’t know it, you can continue, and wait for the error message later and come back to this point.

$ az postgres server firewall-rule create --resource-group rshiny-rg --server shiny-pg --name AllowMyIP 
--start-ip-address x.x.x.x --end-ip-address x.x.x.x

It should come back with a confirmation that this has been set up

{
    "endIpAddress": "x.x.x.x",
    "id": "/subscriptions//resourceGroups/rshiny-rg/providers/Microsoft.DBforPostgreSQL/servers/shiny-pg/firewallRules/AllowMyIP",
    "name": "AllowMyIP",
    "resourceGroup": "rshiny-rg",
    "startIpAddress": "x.x.x.x",
    "type": "Microsoft.DBforPostgreSQL/servers/firewallRules"
}
$

I happen to have a postgres client on my Mac, but if you don’t, you can get one from here. When I do this, it’s clear that I can now connect to the postgres server, and it provides a prompt to enter my password:

$ psql --host=shiny-pg.postgres.database.azure.com --port=5432 --username=jonadmin@shiny-pg --dbname=postgres

Password for user jonadmin@shiny-pg:
psql (11.5, server 9.6.14)
SSL connection (protocol: TLSv1.2, cipher: ECDHE-RSA-AES256-GCM-SHA384, bits: 256, compression: off)
Type "help" for help.

postgres=>

There are a couple of things to note here:

Your username will be the name you provided earlier, followed by an @ and then your server name. In my case, the username is jonadmin@shiny-pg.
You’re connecting to the external host over SSL. Therefore, your Postgres client needs to support that.

We need to create a database to hold our data and we’re going to use the Northwind database, which has more typically been used for Access and SQL Server. In this case, we’re going to use the same content for Postgres. There are a number of places to find this, but I found two files that gave me what I needed.

northwind_ddl.sql creates the tables
northwind_data.sql loads the sample data

Please look at the DDL and INSERT statements to ensure you’re happy to use them.

Let’s create the database:

postgres=> create database northwind;
CREATE DATABASE
postgres=>

You should be able to see that this exists by using the ‘\l’ command:

Now let’s connect to it using ‘\c’ followed by the database name:

postgres=> \c northwind;
psql (11.5, server 9.6.14)
SSL connection (protocol: TLSv1.2, cipher: ECDHE-RSA-AES256-GCM-SHA384, bits: 256, compression: off)
You are now connected to database "northwind" as user "jonadmin@shiny-pg".
northwind=>

And now let’s create the schema and load the content. I placed the northwind_ddl.sql and northwind_data.sql files in the same directory where I executed my psql command. I can therefore import the DDL commands using the ‘\i’ command, and then import the data the same way. Note that the data import will take a bit more time to complete.

northwind=> \i northwind_ddl.sql
SET
.
.
SET
DROP TABLE
.
.
DROP TABLE
CREATE TABLE
.
.
CREATE TABLE
northwind=>

northwind=> \i northwind_data.sql
SET
.
.
SET
INSERT 0 1
.
.
INSERT 0 1
northwind=> \q
$

When it’s done, you can quit using ‘\q’.

Now we’ll create a containerised R/Shiny server. We’ll initially do this on our local machine and then move that entire container to the cloud. I’ll use a very simple application that takes the input from some sliders that provides a tabular output and a histogram based on those values. This is going to be dynamic, so when the sliders are moved, the data on the screen will also update. Let’s look at the files we’re using:

r-shiny/
├── App
│    ├── global.R
│    ├── server.R
│    └── ui.R
└── Dockerfile

Within the r-shiny directory, we have a Dockerfile, and a sub-directory called App, which contains the actual Shiny application.

The application is composed of three parts:

1. A globals.R file defines libraries, our database connection details and some dynamic SQL queries based on slider values.

library(RPostgreSQL)
library(DT)
library(plotly)
library(rjson)
library(pool)

pool <- dbPool(
    drv = dbDriver("PostgreSQL", max.con = 100),
    dbname = "northwind",
    host = "shiny-pg.postgres.database.azure.com",
    user = "jonadmin@shiny-pg",
    password = “[Your Password Here]”,
    idleTimeout = 3600000,
    minSize = 5
)

freight_maxmin <- c(round(dbGetQuery(pool, "SELECT MAX(freight), MIN(freight) from orders;"), 2))
order_id_maxmin <- c(dbGetQuery(pool, "SELECT MAX(order_id), MIN(order_id) from orders;"))

2. A server.R file that refreshes the tabular data and a histogram on changes to the sliders

function(input, output, session) {
    output$table <- DT::renderDataTable({
        SQL <- paste(
                     "SELECT order_id, customer_id, order_date, freight from orders ",
                     "WHERE order_id BETWEEN ?oid_min AND ?oid_max AND freight BETWEEN ?min and ?max;",
                     sep=" ")
        query <- sqlInterpolate(ANSI(), SQL,
                            oid_min = input$order_id_selector[1], oid_max = input$order_id_selector[2],
                            min = input$freight_selector[1], max = input$freight_selector[2])
        outp <- dbGetQuery(pool, query)
        ret <- DT::datatable(outp)
        return(ret)
    })

    output$distPlot <- renderPlot({
        SQL <- paste(
                     "SELECT freight from orders ",
                     "WHERE order_id BETWEEN ?oid_min and ?oid_max and freight BETWEEN ?min and ?max;",
                     sep=" ")

        histQry <- sqlInterpolate(ANSI(), SQL,
                              oid_min = input$order_id_selector[1], oid_max = input$order_id_selector[2],
                              min = input$freight_selector[1], max = input$freight_selector[2])
        histOp <- dbGetQuery(pool, histQry)

        freight_cost <- histOp$freight
        bins <- seq(min(freight_cost), max(freight_cost), length.out = 11) 
        hist(freight_cost, breaks = bins, col = 'darkgray', border = 'white')
   }) 
}

3. A ui.R file describes the user interface composed of two slider inputs, a panel for the tabular output, and a histogram showing the distribution of freight costs.

fluidPage( 
    sidebarLayout( 
        sidebarPanel( 
            sliderInput("order_id_selector","Select Order ID", 
                min = order_id_maxmin$min, 
                max = order_id_maxmin$max, 
                value = c(order_id_maxmin$min, order_id_maxmin$max), step = 1), 
        sliderInput("freight_selector","Select Freight Ranges", 
                min = freight_maxmin$min, 
                max = freight_maxmin$max, 
                value = c(freight_maxmin$min, freight_maxmin$max), step = 1) 
        , plotOutput("distPlot", height=250) 

    ), mainPanel( 
        DT::dataTableOutput("table") 
    ) 
  ) 
)

Now let’s look at our Dockerfile:

FROM rocker/shiny-verse:latest 

RUN apt-get update && apt-get install -y \ 
    sudo \ 
    pandoc \ 
    pandoc-citeproc \ 
    libcurl4-gnutls-dev \ 
    libcairo2-dev \ 
    libxt-dev \ 
    libssl-dev \ 
    libssh2-1-dev 

RUN R -e "install.packages(c('shinydashboard','shiny', 'plotly', 'dplyr', 'magrittr', 'RPostgreSQL', 'DT', 'rjson', 'pool'))" 
RUN R -e "install.packages('gpclib', type='source')" 
RUN R -e "install.packages('rgeos', type='source')" 
RUN R -e "install.packages('rgdal', type='source')" 

COPY ./App /srv/shiny-server/App 

EXPOSE 3838 

RUN sudo chown -R shiny:shiny /srv/shiny-server 

CMD ["/usr/bin/shiny-server.sh"]

We start from a base image of rocker/shiny-verse. After installing some software as well as some R libraries, we then copy our application code from the local App directory to the /srv/shiny-server/App directory. This means that we’ll be able to access this from the App directory within the browser. We expose port 3838 and then start our shiny server.

First build the container image.

$ docker build -t r-shiny r-shiny 

Sending build context to Docker daemon 6.656kB 
Step 1/10 : FROM rocker/shiny-verse:latest 
---> 1d686b061097
.
.
.
Step 9/10 : RUN sudo chown -R shiny:shiny /srv/shiny-server
---> Running in 1b84bf06dfd2
Removing intermediate container 1b84bf06dfd2
---> 474a71119ccf
Step 10/10 : CMD ["/usr/bin/shiny-server.sh"]
---> Running in 98f5a8d6240a
Removing intermediate container 98f5a8d6240a
---> 6032d43c7703
Successfully built 6032d43c7703
Successfully tagged r-shiny:latest
$

With the container image built, we can now start it.

$ docker run --rm -d -p 3838:3838 r-shiny
8485ff49a982f479acb279359c15311fc61c9e27b277e597f3a57220bdd856e4
$

If I now go to a browser page, pointing at http://localhost:3838/App, I see the following:

This shows every order in our database, but we can also play with the sliders in the panel on the left, which dynamically changes the content of interest, showing both different orders and a different histogram of freight costs. Note that this is a very simple application with no error checking. So, if you play with the values, there will be times when the data returned is insufficient to provide histogram data. Ignore any errors and try other ranges.

But we’re not finished yet. While the data sits in Azure, the container is still running on my laptop and I’d like everything to be held in Azure so that it is accessible to multiple people.

Let’s first create a container repository. Note the loginServer. You’ll need this later.

$ az acr create --resource-group rshiny-rg --name jmshinyreg --sku Basic --admin-enabled

{
"adminUserEnabled": true,
.
"loginServer": "jmshinyreg.azurecr.io",
.
"tags": {},
"type": "Microsoft.ContainerRegistry/registries"
}

Now tag the container image we just built against our Azure repository. This allows us to push it to the cloud.

$ docker tag r-shiny jmshinyreg.azurecr.io/shiny:1.0
$ docker images
REPOSITORY TAG IMAGE ID CREATED SIZE
r-shiny latest 250764631cac 29 minutes ago 1.91GB
jmshinyreg.azurecr.io/shiny 1.0 250764631cac 29 minutes ago 1.91GB

Log in to Azure Container Registry and push the tagged image to it. Then confirm that the image is there by querying the repository

$ az acr login --name jmshinyreg
Login Succeeded
$
$ docker push jmshinyreg.azurecr.io/shiny:1.0
The push refers to repository [jmshinyreg.azurecr.io/shiny]
54a8dd859e33: Pushed
a43915702c3c: Pushed
.
.
.
1fe356c64d3b: Pushed
e2a8a00a83b2: Pushed
1.0: digest: sha256:df221659fee38be8930b7740a903c4cdc7c61173f2d65da8fe786e0af5497ca5 size: 3268
$

$ az acr repository list --name jmshinyreg --output table
Result
--------
shiny
$

In Part 1 of this series, I shared a containerised Linux host through ssh. Now we’re exposing a browser-based application. In order to allow containers to be created, we’ll need a service principle. The following script creates this for us and provides us with credentials. Modify the values of ACR_NAME and SERVICE_PRINCIPAL_NAME. I used jmshinyreg and jon-shiny-sp for mine.

#!/bin/bash

# Modify for your environment.
# ACR_NAME: The name of your Azure Container Registry
# SERVICE_PRINCIPAL_NAME: Must be unique within your AD tenant
ACR_NAME="jmshinyreg"
SERVICE_PRINCIPAL_NAME=jm-shiny-sp

# Obtain the full registry ID for subsequent command args
ACR_REGISTRY_ID=$(az acr show --name $ACR_NAME --query id --output tsv)

# Create the service principal with rights scoped to the registry.
# Default permissions are for docker pull access. Modify the '--role'
# argument value as desired:
# acrpull: pull only
# acrpush: push and pull
# owner: push, pull, and assign roles
SP_PASSWD=$(az ad sp create-for-rbac --name http://$SERVICE_PRINCIPAL_NAME --scopes $ACR_REGISTRY_ID --role acrpull --query password --output tsv)
SP_APP_ID=$(az ad sp show --id http://$SERVICE_PRINCIPAL_NAME --query appId --output tsv)

# Output the service principal's credentials; use these in your services and
# applications to authenticate to the container registry.
echo "Service principal ID: $SP_APP_ID"
echo "Service principal password: $SP_PASSWD"

When I run this, I’m returned a secure principle ID and a password. I’ve obscured mine for obvious reasons, but you will need these for later, so make a copy.

Service principal ID: 6xxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxx2
Service principal password: 6xxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxf

Now let’s create and deploy a container, which I’ve call rshiny. I’ve given it a relatively unique name jm-shiny-svr, which will form part of the fully qualified name later.

$
az container create --resource-group rshiny-rg --name rshiny --image jmshinyreg.azurecr.io/shiny:1.0 --cpu 1 --registry-login-server jmshinyreg.azurecr.io --dns-name-label jm-shiny-svr --ports 3838 --registry-username 4xxxxxxxx-xxxxx-xxxxx-xxxxx-xxxxxxxxxxxx1 --registry-password fxxxxxxxx-xxxxx-xxxxx-xxxx-xxxxxxxxxxx5
{
.

.
}
$

The container’s fully qualified domain name is part of that JSON response, but you can also find the fully qualified DNS name for your new container using the following:

$ az container show --resource-group rshiny-rg --name rshiny --query ipAddress.fqdn
"jm-shiny-svr.eastus.azurecontainer.io"
$

So, let’s go to the container, which is now running in the cloud. Remember that you need to reference port 3838, and that the shiny application is underneath the App directory.

What happened? Fortunately, we can understand this in a little more detail using the az container logs command.

$ az container logs --resource-group rshiny-rg --name rshiny

.
.
.

[2019-08-26T09:20:22.819] [INFO] shiny-server - Error getting worker: Error: The application exited during initialization.
Error in postgresqlNewConnection(drv, ...) :
RS-DBI driver: (could not connect jonadmin@shiny-pg@shiny-pg.postgres.database.azure.com:5432 on dbname "northwind": FATAL: no pg_hba.conf entry for host "40.76.197.161", user "jonadmin", database "northwind", SSL on
FATAL: SSL connection is required. Please specify SSL options and retry.
)
Calls: runApp ... -> -> postgresqlNewConnection
Execution halted

It seems that because the container can’t reach the Postgres server, the application failed. This is easy to address by simply providing a firewall rule to allow entry to the IP address referenced in the error log.

$ az postgres server firewall-rule create --resource-group rshiny-rg --server shiny-pg --name AllowMyIP --start-ip-address 40.76.197.161 --end-ip-address 40.76.197.161
{
"endIpAddress": "40.76.197.161",
"id": "/subscriptions//resourceGroups/rshiny-rg/providers/Microsoft.DBforPostgreSQL/servers/shiny-pg/firewallRules/AllowMyIP",
"name": "AllowMyIP",
"resourceGroup": "rshiny-rg",
"startIpAddress": "40.76.197.161",
"type": "Microsoft.DBforPostgreSQL/servers/firewallRules"
}

If you now refresh the page, you should see a working app.

Once you’ve finished playing with this, you may now like to clear up your environment by removing the resource group holding your Postgres database, and container repository.

$ az group delete --name rshiny-rg

Conclusion

This concludes the fourth part of our series. We diverted from a pure Python approach and touched on containerising R payloads. In addition, we started integrating with PaaS services such as Postgres and looked at some of the implications including security.

There is still more to focus on in this series including being able to scale containers out using capabilities such as Kubernetes. In addition, I’ll demonstrate how to integrate Cognitive services and start considering the operationalisation of the data science pipeline.

The post Using Containers to run R/Shiny workloads in Azure: Part 4 appeared first on Microsoft Industry Blogs - United Kingdom.

How to use containers in data science with Docker and Azure: Part 3

Jon Machtynger — Fri, 07 Jun 2019 09:00:03 +0000

In the first two parts of this series, I described how to build containers using Dockerfiles, and then how to share and access them from Azure. I then introduced data persistence using managed volumes and shared file systems, effectively developing locally with a globally accessible persistent state.

In this part, we’ll extend the container, persistence, and data science concept using multiple containers to create a more complex application. We’ll combine Python, a database, and an external service (Twitter) as a basis for social analysis. We’ll package these components into a docker application and move this to Azure.

Multi-Container Applications

If you’re using containers, you are effectively using small self-contained services that when combined with other containers provide greater flexibility than when all the services are held and managed within a single large virtual machine. How then, do you ensure that these containers are treated as part of a single larger application? That they all start up together, in the right order, and that they are shut down as a single unit. This is where docker-compose is useful.

Docker Compose is a tool that manages multi-container applications. It uses a docker-compose.yml file to define constituent containers, services, storage volumes, container behaviour, Dockerfiles, common configuration and data files (among other things) – together encompassing a multi-service application. A benefit of this approach is that the application is built, started, stopped, and removed using a single command. Another benefit is that these containers get added to a common network (and local DNS service), so it is possible for each container service to refer to the others simply by their container name.

If you don’t have docker-compose in your docker environment, you’ll need to install it.

Before we start, let’s create a working directory for our application with some predefined directories and files. We’ll modify each of the files as we go.

$ mkdir -p config containers/jupyter
$ touch config/jupyter.env containers/jupyter/Dockerfile docker-compose.yml
$ tree
.
├── config
│   └── jupyter.env
├── containers
│   └── jupyter
│       └── Dockerfile
└── docker-compose.yml

Let’s start with the docker-compose.yml file:

version: '3'
services:
  jon_jupyter:
    build: containers/jupyter
    ports:
      - "8888:8888"
    volumes:
      - jupyter_home:/home/jovyan
    env_file:
      - config/jupyter.env
  jon_mongo:
    image: mongo
    ports:
      - "27017:27017"
    volumes:
      - mongo_data:/data/db
volumes:
  mongo_data:
  jupyter_home:

The version here relates to Docker Compose syntax. All versions are backward compatible. If no version is specified, then Version 1 is used.

This will build an application comprised of two containers:

jon_jupyter will be built based on the Dockerfile contained under the local directory containers/jupyter. It exposes port 8888 and locally mounts a managed Docker volume called jupyter_home to an internal container directory called /home/jovyan. An environment file called env contains a number of variables in the format VAR=VAL to allow us to parameterise scripts and reference variables within containers.
jon_mongo is based on a standard mongo image. It mounts a Docker managed volume mongo_data onto an internal container directory called /data/db, which is mongodb’s default data file location. I’ve also exposed port 27017 on this so that I can use tools on my host machine to query the database.

Lastly, the volumes mongo_data and jupyter_home will be created automatically if they don’t already exist.

Here are the contents of containers/jupyter/dockerfile:

FROM jupyter/scipy-notebook
USER root
RUN conda install --yes --name root python-twitter pymongo
RUN conda install --yes --name root spacy numpy
RUN python -m spacy download en_core_web_sm
RUN python -m spacy download en
USER jovyan
ENTRYPOINT ["jupyter", "notebook"]

We could build our python environment from scratch including the underlying operating system, library configurations, and then selective python packages. What I’ve done here however is base mine on a pre-configured data science environment. For this, we’ll start with a scipy environment using docker-stacks. Jupyter Docker Stacks provide ready-to-run Docker images containing Jupyter applications and interactive computing tools where many of the necessary packages, and library combinations have already been thought about. We then install some additional python packages and start our Jupyter notebook service.

We use docker-compose to build and start the constituent containers. The ‘-d’ flag starts this as a detached service.

$ docker-compose up -d --build
Building jon_jupyter
.
.
.
Successfully built e95ac5aefb45
Successfully tagged mc-1_jon_jupyter:latest
mc-1_jon_jupyter_1 is up-to-date
mc-1_jon_mongo_1 is up-to-date

 $ docker-compose ps
       Name                    Command           State   Ports
-------------------------------------------------------------------------------
mc-1_jon_jupyter_1  jupyter notebook             Up     0.0.0.0:8888->8888/tcp
mc-1_jon_mongo_1    docker-entrypoint.sh mongod  Up     0.0.0.0:27017->27017/tcp

 $ docker exec mc-1_jon_jupyter_1 jupyter notebook list
Currently running servers:
http://0.0.0.0:8888/?token=a5b519010b0a37d52129f4f5084210b18fb7e14798ea586b :: /home/jovyan
$

Once the application has been built, I then check the status of each container using the docker-compose ps command – each service has a status of up.

I also query the Jupyter container for the token associated with the Jupyter notebook, I can use the URL returned (http://0.0.0.0:8888/?token=a5b519010b0a37d52129f4f5084210b18fb7e14798ea586b) to access the service directly from a browser on my host machine. Clicking on this should also open a browser with the service running.

We’re going to extract some content from Twitter, so before you continue, you’ll need some API credentials to permit this. Here is a good step-by-step guide on how to get these. When you have these, place them in your config/jupyter.env file.

Here is my jupyter.env file. For obvious reasons, I’ve hidden the values. For the same reasons we’re going to use environment variables to reference these rather than have them hard coded in our notebook. This example assumes that some code might be stored in say a GitHub account, but that the values themselves are only available within your relatively secure container. If you pull that notebook it can then automatically use your credentials. In a later part of this series, I’ll describe how to use Azure Key Vault to store and access sensitive data much more securely:

API_KEY=xxxxxxxxxxxxxxxxxxxxxxxxx
API_SECRET=xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx
ACCESS_TOKEN=xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx
ACCESS_SECRET=xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx

The docker-compose.yml file defines these as part of the Jupyter environment, and not the Mongo environment. If you wanted to share common environment variables, you could reference a common file in an env_file section within each container service. Let’s also confirm that those environment variables are present for us to use

$ docker ps
CONTAINER ID     IMAGE               COMMAND                  CREATED       STATUS          . . .
51e59897b9d5     mc-1_jon_jupyter    "jupyter notebook"     15 seconds ago   Up 7 seconds   . . .
a8df6b8d330a     mongo               "docker-entrypoint.s…" 15 seconds ago   Up 7 seconds   . . .
$
$ docker exec -it 51e env | grep -iE "api|access"
API_KEY= xxxxxxxxxxxxxxxxxxxxxxxxx
API_SECRET= xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx
ACCESS_TOKEN= xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx
ACCESS_SECRET= xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx
$

From the main Jupyter page, let’s create a new Python 3 notebook and we can start to work with all our services.

In my new notebook, I assign local notebook variables to the environment variables that were created as part of the container build process:

I can now use these credentials to connect to the Twitter service. VerifyCredentials() shows that I have successfully connected.

Let’s progress one step further. From the first container (hosting Jupyter/scipy etc.), I connect to the second container holding the Mongo service. I search Twitter for 100 tweets containing the word ‘Humous’ and insert them into the database.

There are a few things to note here: Firstly, before I run this, there is no database called twitter_db nor a collection called tweets. These will be created after the first call to insert_one(). Secondly, the mongo client is connecting to the database service on ‘jon_mongo’. It knows how to find that network point because docker-compose packaged both containers inside a local network allowing each of them to refer to each other by their service name (from the docker-compose.yml file).

I’ll now pull those tweets from the database and apply some basic textual analyses. But first, I’ll create a simple function that identifies nouns, verbs, and entities within text.

I can now analyse each tweet as they’re read from the database:

This is a very simple example, but you can see that from here we can extend the analysis. We stored the entire Tweet payload, which includes a lot of additional metadata, not just the textual content. So, we may also want to consider what a network graph of humous eaters looks like. Geographically, where do people refer to humous in a positive or negative manner? Is there a Big-Five personality grouping for them? That is perhaps something for another blog, but I think you can see that the foundations for these sorts of questions are now in place and we’ve been able to combine completely different services packed in self-contained environments (containers).

Let’s now transfer the environment to a cloud-based docker environment. There are other ways of achieving this, for example using Kubernetes, but we’ll cover Kubernetes in a later part of this series. The idea here is to show a very simple way of making a cloud-based data science service available based on a pattern that you already know works well in-house.

If you have been following along to the first two parts of this series and already have a resource group set aside for this project, then feel free to use it. Otherwise, it’s advisable to create a new resource group so that you can ring-fence the activities for this series.

In my case, I created a resource group called docker-rg

$ az group create --name docker-rg --location eastus
{
    "id": "/subscriptions//resourceGroups/docker-rg",
    "location": "eastus",
    .
    .
}
$

Now create a file called docker-init.txt with a single line in it:

#include https://get.docker.com

This provides everything you need to build a Docker environment within a virtual machine. Now we’ll provision a virtual machine with az vm create to hold that Docker environment and open a port to allow you to access it remotely. You’ll need to note the value of publicIpAddress. I’m choosing a default VM configuration here, but you can customise this to add more memory, disk or CPU as you wish. At the time of writing this, I can create a VM of size ‘B1ls’ with 1 CPU and 512MB of RAM for under $4 per month.

One nice thing with standard Azure VMs is that they come with a number of pre-configured services such as ssh already installed and running.

$ az vm create --resource-group docker-rg --name jm-docker-vm --image UbuntuLTS --admin-username jon --generate-ssh-keys --custom-data docker-init.txt

{
    .
    .
    "publicIpAddress": "xxx.xxx.xxx.xxx",
    "resourceGroup": "docker-rg",
    .
}
$
$ az vm open-port --port 8888 --resource-group docker-rg --name jm-docker-vm
{
    .
    .
}
$

Now ssh into the remote machine using the publicIpAddress from earlier and then install compose into it. We also have an option of providing an externally available Fully Qualified Domain Name (FQDN). Note that if we shut-down and restart our VM then the public IP address is likely to change and we’ll have to rediscover the new public IP address. Giving it a FQDN means that you should be able to reference the VM irrespective that its address is. More information about how to set this can be found here.

$ ssh jon@
.
.
.
$ sudo apt install docker-compose.
.
.
$

Before we can move our application to the cloud, we’ll need to backup the local environment. All we really need are the contents from the storage volumes, and the configuration items. Remember that the containers, volumes, networks etc will all be rebuilt from scratch. This is the beauty of a containerised approach. We don’t need to back any of that up.

Stop each of the running containers noting the container name.

$ docker-compose ps
       Name                    Command             State            Ports
-----------------------------------------------------------------------------------
mc-1_jon_jupyter_1   jupyter notebook              Up      0.0.0.0:8888->8888/tcp
mc-1_jon_mongo_1     docker-entrypoint.sh mongod   Up      0.0.0.0:27017->27017/tcp
$
$ docker stop mc-1_jon_mongo_1
mc-1_jon_mongo_1
$ docker stop mc-1_jon_jupyter_1
mc-1_jon_jupyter_1
$ docker-compose ps
       Name                    Command             State    Ports
-----------------------------------------------------------------
mc-1_jon_jupyter_1   jupyter notebook              Exit 0
mc-1_jon_mongo_1     docker-entrypoint.sh mongod   Exit 0
$

We’re going to create a location for our backups and then run a container, whose sole purpose is to copy the contents of the volume’s mount point to that location and then exit. You can find the full mounted paths for each of these volumes in the docker-compose.yml file.

$ mkdir backup
$ docker run --rm --volumes-from mc-1_jon_mongo_1 -v $(pwd)/backup:/backup ubuntu bash -c 'tar cvf /backup/mongo-volume.tar /data/db'
tar: Removing leading `/' from member names
/data/db/
/data/db/WiredTiger.turtle
.
.
$
$ ls -l backup
total 328372
-rw-r--r-- 1 jon staff 333230080 Jun  1 11:26 mongo-volume.tar
$

Let’s also backup the Jupyter home directory

$ docker run --rm --volumes-from mc-1_jon_jupyter_1 -v $(pwd)/backup:/backup ubuntu bash -c 'tar cvf /backup/jupyter-volume.tar /home/jovyan'
tar: Removing leading `/' from member names
/home/jovyan/
/home/jovyan/.yarn/
.
.
$
$ ls -l backup
total 328592
-rw-r--r-- 1 jon staff    225280 Jun  1 11:30 jupyter-volume.tar
-rw-r--r-- 1 jon staff 333230080 Jun  1 11:26 mongo-volume.tar
$

Now backup the code used to build the environment:

$ tar zcvf backup/build.tgz config containers docker-compose.yml
a config
a config/jupyter.env
a containers
a containers/jupyter
a containers/jupyter/Dockerfile
a docker-compose.yml
$
$ ls -l backup/
total 328596
-rw-r--r-- 1 jon staff       766 Jun  1 11:47 build.tgz
-rw-r--r-- 1 jon staff    225280 Jun  1 11:30 jupyter-volume.tar
-rw-r--r-- 1 jon staff 333230080 Jun  1 11:26 mongo-volume.tar

We now have everything we need to rebuild our environment. However, In order to do that, we need to copy those files to our cloud VM. On our cloud VM, create a directory called backup in the home directory. Now, from our local machine, copy the backup directory to our target directory:

$ scp backup/* jon@xxx.xxx.xxx.xxx:backup
Enter passphrase for key '/Users/jon/.ssh/id_rsa':
build.tgz                         100%  766     9.2KB/s   00:00
jupyter-volume.tar                100%  220KB 421.6KB/s   00:00
mongo-volume.tar                  100%  318MB 996.2KB/s   05:26
$

And we can now extract the contents to build our cloud application

$ tar zxvf backup/build.tgz
config/
config/jupyter.env
containers/
containers/jupyter/
containers/jupyter/Dockerfile
docker-compose.yml
$

For security reasons, docker is not generally available to non-privileged users. It therefore requires using sudo for each call. We can avoid this by adding our current user to the docker group and switching to that new group. This group should have been automatically created for you.

$ sudo gpasswd -a $USER docker
$ newgrp docker
$
$ docker-compose up -d --build
Creating network "jon_default" with the default driver
Creating volume "jon_jupyter_home" with default driver
Creating volume "jon_mongo_data" with default driver
Building jon_jupyter
.
.
.
ab4327c34933: Pull complete
80003bc32b79: Pull complete
Digest: sha256:93bd5412f16f3b9f7e12eb94813087f195dad950807a8ca74aa2db080c203990
Status: Downloaded newer image for mongo:latest
Creating jon_jon_jupyter_1 ...
Creating jon_jon_mongo_1 ...
Creating jon_jon_jupyter_1
Creating jon_jon_mongo_1 ... done
jon@jm-docker-vm:~$ docker-compose ps
      Name                  Command          State   Ports
--------------------------------------------------------------------------
jon_jon_jupyter_1 jupyter notebook             Up   0.0.0.0:8888->8888/tcp
jon_jon_mongo_1   docker-entrypoint.sh mongod  Up   0.0.0.0:27017->27017/tcp
$

Let’s test whether the notebook is accessible by going to the external IP address on port 8888

We can reach it, but for security reasons, we’ll need to know the token before we can do so. As we did before, we can find out that value, but we don’t want to have to do this every time the server comes up or if we restart the notebook. Let’s find our token and then set a password. Make sure to use the change password section below, and not the login section at the top. On your remote VM, do the following:

$ docker-compose ps

       Name                    Command           State   Ports
-------------------------------------------------------------------------------
jon_jon_jupyter_1  jupyter notebook             Up     0.0.0.0:8888->8888/tcp
jon_jon_mongo_1    docker-entrypoint.sh mongod  Up     0.0.0.0:27017->27017/tcp

$ docker exec jon_jon_jupyter_1 jupyter notebook list

Currently running servers:
http://0.0.0.0:8888/?token=?d336b5cb3bf35476952c59ea566e74cf1b66a692e307e146:: /home/jovyan

In future, you should be able to log in using just the password.

In order to restore the contents of our volumes, we’ll first need to know what those volumes are called in our Azure VM.

$ docker volume ls
DRIVER      VOLUME NAME
local       2e6cd60d75b1cc213ed23f9b2f46644d6215baafdca9be77f2fea8409a34ecd
local       jon_jupyter_home
local       jon_mongo_data
$

Stop the cloud VM application and we’ll write the backup contents over our pre-created volumes.

$ docker-compose down
Stopping jon_jon_mongo_1   ... done
Stopping jon_jon_jupyter_1 ... done
Removing jon_jon_mongo_1   ... done
Removing jon_jon_jupyter_1 ... done
Removing network jon_default
$

First I’ll restore the Jupyter contents. Here, we mount our home directory and a temporary location from our backup, and then copy the contents from one to the other. This is all done in a temporary container whose sole task is to do the copy.

$ docker run --rm -v jon_jupyter_home:/home/jovyan -v $(pwd)/backup:/backup ubuntu bash -c 'cd / && tar xvf /backup/jupyter-volume.tar'
home/jovyan/
home/jovyan/.yarn/
home/jovyan/.yarn/bin/
home/jovyan/.cache/
home/jovyan/.cache/matplotlib/
home/jovyan/.cache/matplotlib/tex.cac
.
.

$

Now we’ll do the same with the Mongo database. I’m taking the added precaution of removing the contents from the target volume here.

Now start the application:

$ docker-compose up -d
Creating network "jon_default" with the default driver
Creating jon_jon_jupyter_1 ...
Creating jon_jon_mongo_1 ...
Creating jon_jon_jupyter_1
Creating jon_jon_mongo_1 ... done
$

If I now go back to my Jupyter environment, you can see that our previous Untitled.ipynb file has been restored.

Clicking on this will now show the state of my work as it was in my local environment.

I can also go through each of these cells and test that the Jupyter environment behaves exactly as it did locally, and that my Mongo database is working properly.

Conclusion

This brings the third part of our series to an end. You saw how to create a multi-container application to support a data science scenario and then how to transfer the environment to the cloud. While we used a single cloud VM to host one multi-container application, that VM is capable of hosting multiple applications now.

There are clearly more efficient ways of achieving this, but I’ve taken the approach of delving more into the principles in the early stages than in focusing on best practice.

In future parts of our series, I’ll look at using PaaS services instead of having to maintain container-based ones. I’ll also introduce how to combine cognitive services, and additional security to make the data science discovery process a little faster.

About the author

Jon is a Microsoft Cloud Solution Architect specialising in Advanced Analytics & Artificial Intelligence.

With over 30 years of experience in understanding, translating and delivering leading technology to the market. He currently focuses on a small number of global accounts helping align AI and Machine Learning capabilities with strategic initiatives. He moved to Microsoft from IBM where he was Cloud & Cognitive Technical Leader and an Executive IT Specialist.

The post How to use containers in data science with Docker and Azure: Part 3 appeared first on Microsoft Industry Blogs - United Kingdom.

Master data science with Docker’s storage and data persistence: Part 2

Jon Machtynger — Fri, 31 May 2019 09:00:45 +0000

In part 1 of this series, I covered the basics of using Docker on a local environment and building a simple image that could then be customised to include your own requirements. Next, I showed you how to move this image to Microsoft Azure, execute it there, and pull from your cloud registry to a local execution model.

I’d like to highlight that the early articles in this series focus less on good practice than they do on exploring the nature of the capabilities. The reason for this is that when you focus purely on good practice, you risk missing many useful lessons that you can learn while exploring capabilities. I’ll discuss good practice towards the end of the series, but the key objective in the early stages is to understand more about how a container approach supports data science patterns.

A core part of the data science process is collating and cleaning data to build plausible models of the world and evaluate hypotheses. Another vital requirement is working with others across systems and platforms, often over long timescales. In this article, I’ll mostly focus on data persistence at the container file level. To keep this relatively short, I’ll cover databases and other state managers in later articles.

In the first article in this series, I outlined two core assumptions of working with containers:

Containers are expected to be stateless and disposable. They provide compute, and, on exit, everything inside that container disappears with it.
A container should focus on as few things as possible. This supports a microservices approach and allows designers to combine primitive containers in interesting combinations without worrying too much about interdependencies.

The statelessness of a container provides a basis for scalability and resilience. Multiple copies can run in parallel and if they fail or crash, they can be restarted. Having an external state also allows those microservices to share information with minimal concern about how that data is managed or needing to own its veracity or performance. This is contrary to typical application design.

So, if containers are expected to be stateless, where is that state held? There are a number of options, including:

In an external directory—for example, locally accessible disk holding files
In a Docker-defined volume—this can be shared between multiple running containers
In an external database or network resource—for example, IP:port, message queue

I’m going to show you the first two options by running the container (jon/alpine:1.0) that I created in the previous article. If you need to create it, follow this link to get the Dockerfile and build it locally.

Creating a volume

First, let’s create a volume in the Docker environment. A volume is a Docker-managed resource that containers can easily use as external storage.

$ docker volume create v-common
$ docker volume ls
DRIVER              VOLUME NAME
local               v-common
$

I also have some local files that I want to put into that container. There are several automated ways of doing this, but I want the principles to make sense. I created a small directory called standard-dir containing some open data sets and the source for a script to run keyframe extraction.

Feel free to use any sample directory. I’m only simulating how you might want to share a set of common libraries, data sets and standard utilities across containers and developers.

It looks like this:

standard-dir/
├── code
│   └── extractkeyframe
│       ├── LICENSE
│       └── xkf.sh
└── data
    ├── open-data
    │   ├── LFB-Incident-data-from-Jan2009-Dec2012.zip
    │   ├── breast-cancer-wisconsin-data.zip
    │   ├── dow_jones_index.zip
    │   └── seattle-crime-stats.zip
    └── test-data

Let’s run a container mounting that directory, and the Docker volume that I just created so that I can copy the content across. I do this interactively in the ash shell. On exit, it will automatically remove the stopped container. Focus on the -v flag.

Here, I’m mounting the v-common volume to a directory in the container called /dest, and the current directory (`pwd`) to a directory in the container called /src. Then, copy everything from /src to /dest. Note that any changes that take place in mounted volumes will persist after the container exits.

$ cd standard-dir
$ docker run --rm -v v-common:/dest -v `pwd`:/src -it jon/alpine:1.0 ash
# ls -l /dest /src
/dest:
total 0

/src:
total 0
drwxr-xr-x    4 root     root           128 Apr 22 16:42 data
drwxr-xr-x    3 root     root            96 Apr 22 16:56 code
# cp -R /src/* /dest
# ls -l /dest /src
/dest:
total 8
drwxr-xr-x    4 root     root          4096 Apr 22 17:28 data
drwxr-xr-x    3 root     root          4096 Apr 22 17:28 code

/src:
total 0
drwxr-xr-x    4 root     root           128 Apr 22 16:42 data
drwxr-xr-x    3 root     root            96 Apr 22 17:27 code
#

In that container, create a new file in our /src directory called new-file. Confirm it’s there:

# touch /src/new-file
# ls -l /src
total 0
drwxr-xr-x    4 root     root           128 Apr 22 16:42 data
-rw-r--r--    1 root     root             0 Apr 22 17:45 new-file
drwxr-xr-x    3 root     root            96 Apr 22 17:27 source
#

From another local terminal, list the files in standard-dir. Note that the new file will appear in the host directory.

standard-dir/
├── code
│   └── extractkeyframe
│       ├── LICENSE
│       └── xkf.sh
├── data
│   ├── open-data
│   │   ├── LFB-Incident-data-from-Jan2009-Dec2012.zip
│   │   ├── breast-cancer-wisconsin-data.zip
│   │   ├── dow_jones_index.zip
│   │   └── seattle-crime-stats.zip
│   └── test-data
└── new-file

If I now remove the file from within the host directory (not within the container):

$ rm new-file

When I go back to the container, and list the files, this is what I see:


# ls -l /src
total 0
drwxr-xr-x    4 root     root           128 Apr 22 16:42 data
drwxr-xr-x    3 root     root            96 Apr 22 17:27 source
#

I’ve shown two ways to change state in a container that remains after the container dies. The first maintains the state in a separate Docker volume (v-common, in this case) that can be remounted on another container in the future. The second is where the state is changed on the host file system. A future container could also access those changes.

It seems like a recipe for disaster if containers can cause damage on a host. However, you can also mount containers in read-only mode. Note the ‘ro’ suffix at the end of the /src path. Here, I try to create a new empty file and this is denied.

$ docker run --rm -v `pwd`:/src:ro -it jon/alpine:1.0 ash
# cd /src
# ls -l
total 0
drwxr-xr-x    3 root     root            96 Apr 22 17:27 code
drwxr-xr-x    4 root     root           128 Apr 22 16:42 data
# touch new-file
touch: new-file: Read-only file system

From two different terminal windows, mount the v-common storage volume on two concurrent containers:

Container 1

$ docker run --rm -v v-common:/src -it jon/alpine:1.0 ash
# cd /src
# ls -l
total 8
drwxr-xr-x    3 root     root          4096 Apr 22 17:28 code
drwxr-xr-x    4 root     root          4096 Apr 22 17:28 data
# touch fred

Container 2

$ docker run --rm -v v-common:/src -it jon/alpine:1.0 ash
# cd /src
# ls -l
total 8
drwxr-xr-x    3 root     root          4096 Apr 22 17:28 code
drwxr-xr-x    4 root     root          4096 Apr 22 17:28 data
-rw-r--r--    1 root     root             0 Apr 28 12:08 fred
# touch bloggs

Now go back to container 1 and look at the content.

Container 1

# ls -l
total 8
-rw-r--r--    1 root     root             0 Apr 28 12:08 bloggs
drwxr-xr-x    3 root     root          4096 Apr 22 17:28 code
drwxr-xr-x    4 root     root          4096 Apr 22 17:28 data
-rw-r--r--    1 root     root             0 Apr 28 12:08 fred
#

They both have access to and could trample over each other’s actions. I did say this wasn’t necessarily good practice, but I think how volumes work should be clearer by now.

Let’s expand on this a little and reference storage in Azure itself. Here, I create a new Azure storage account:

$ az storage account create --location eastus --name jondockerstorage --resource-group docker-rg --sku "Standard_LRS"
{
  .
 
  .
}
$

Let’s create a shared file area and get a URL for it:

$ az storage share create --account-name jondockerstorage --name jondockerfileshare
{
  "created": true
}
$ az storage share url --account-name jondockerstorage --name jondockerfileshare
"https://jondockerstorage.file.core.windows.net/jondockerfileshare"
$

Now, mount it locally. How you do this will be different depending on whether you’re using Windows, Mac, or Linux. After mounting, it has a local path:

$ mount
//jondockerstorage@jondockerstorage.file.core.windows.net/jondockerfileshare on /Volumes/jondockerfileshare (smbfs, nodev, nosuid, mounted by jon)
$

Now start a container, mounting that remote share, and make a change within it:

$ docker run --rm -v /Volumes/jondockerfileshare:/remote -it jon/alpine:1.0 ash
# cd /remote/
# touch put-this-in-the-cloud.txt
# exit
$ ls -l /Volumes/jondockerfileshare/
total 0
-rwx------  1 jon  staff  0 27 Apr 17:32 put-this-in-the-cloud.txt
$

Note that the change is reflected locally, but to show that this is really in the cloud, I installed the Azure Storage Explorer so that I can interact with Azure storage natively using drag and drop. I’ve avoided it so far to show that almost everything is scriptable and something you can automate. You can get more information about the Azure Storage Explorer here.

Add your Azure account details to sign in to your Azure subscription, and it should give you similar functionality to Windows Explorer, Mac Finder, or Linux File Manager.

On my Mac, this is what I see after navigating to my shared file area:

This clearly isn’t the most performant of solutions, but, again, you can imagine a scenario where a cloud-based SMB share needs to be available globally, perhaps read-only, to numerous processes. A co-located container would also benefit from it.

$ az storage share url --account-name jondockerstorage --name jondockerfileshare
"https://jondockerstorage.file.core.windows.net/jondockerfileshare"
$

I’d now like to access that same storage from containers running in the cloud. This needs to be done securely. Let’s find out what the storage keys are for the storage account:

$ az storage account keys list --resource-group docker-rg --account-name jondockerstorage
[
  {
    "keyName": "key1",
    "permissions": "Full",
    "value": "xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx"
  },
  {
    "keyName": "key2",
    "permissions": "Full",
    "value": "xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx"
  }
]
$

If you still have the service principal details from the previous article, then you can use them. If you’ve forgotten them, then you can update them using the following script.

You’ll need to change the container registry name and service principal name for your own. The link also shows how to create your Azure container registry and push the local Docker image to that registry.

#!/bin/bash

# Modify for your environment.
# ACR_NAME: The name of your Azure Container Registry
# SERVICE_PRINCIPAL_NAME: Must be unique within your AD tenant
ACR_NAME="joncreg"
SERVICE_PRINCIPAL_NAME=jon-acr-sp

# Obtain the full registry ID for subsequent command args
ACR_REGISTRY_ID=$(az acr show --name $ACR_NAME --query id --output tsv)

# Create the service principal with rights scoped to the registry.
# Default permissions are for docker pull access. Modify the '--role'
# argument value as desired:
# acrpull:     pull only
# acrpush:     push and pull
# owner:       push, pull, and assign roles
SP_PASSWD=$(az ad sp create-for-rbac --name http://$SERVICE_PRINCIPAL_NAME --scopes $ACR_REGISTRY_ID --role acrpull --query password --output tsv)
SP_APP_ID=$(az ad sp show --id http://$SERVICE_PRINCIPAL_NAME --query appId --output tsv)

# Output the service principal's credentials; use these in your services and
# applications to authenticate to the container registry.
echo "Service principal ID: $SP_APP_ID"
echo "Service principal password: $SP_PASSWD"

This should provide you with a service principal ID and password. Now use them to start up a container, mounting the file share on /mnt/azfile:

$ az container create --registry-username  --registry-password  --resource-group docker-rg --name alpine-ssh --image joncreg.azurecr.io/alpine:1.0  --cpu 1 --memory 0.1 --dns-name-label jm-alpine-ssh-2 --ports 22 --azure-file-volume-share-name jondockerfileshare --azure-file-volume-account-name jondockerstorage --azure-file-volume-account-key >   --azure-file-volume-mount-path /mnt/azfile
{
  .
 
  .
}
$

You should see a JSON response with details confirming your deployment, but you can also find the fully qualified DNS name for your new container using the following:

$ az container show --resource-group docker-rg --name alpine-ssh --query ipAddress.fqdn
"jm-alpine-ssh-2.eastus.azurecontainer.io"
$

Now let’s sign in and interact with the shared area. I’ll create an empty file and test whether I can see it from my local Mac:

$ ssh root@jm-alpine-ssh-2.eastus.azurecontainer.io
root@jm-alpine-ssh-2.eastus.azurecontainer.io's password:
Welcome to Alpine!

The Alpine Wiki contains a large amount of how-to guides and general
information about administrating Alpine systems.
See .

You can setup the system with the command: setup-alpine

You may change this message by editing /etc/motd.

# cd /mnt/azfile/
# ls -l
total 0
-rwxrwxrwx    1 root     root             0 Apr 27 16:32 put-this-in-the-cloud.txt
# touch this-was-created-in-a-container.txt
# exit
Connection to jm-alpine-ssh-2.eastus.azurecontainer.io closed.
$ ls -l /Volumes/jondockerfileshare/
total 0
-rwx------  1 jon  staff  0 27 Apr 17:32 put-this-in-the-cloud.txt
-rwx------  1 jon  staff  0 27 Apr 18:32 this-was-created-in-a-container.txt
$

These are the basics of keeping state outside your container using locally mapped directories and Docker volumes. But when would you use a Docker volume over a locally mapped directory? The Docker documentation suggests that volumes are the preferred mechanism for persisting data generated by and used by Docker containers.

That said, bind mounts—mounting local file system resource—depend on the directory structure of the host machine. Docker is a convenient way of using content that would normally exist on your local environment anyway. The documentation states that volumes have some advantages over bind mounts:

Volumes are easier to back up or migrate than bind mounts
You can manage volumes using Docker CLI commands or the Docker API
Volumes work on both Linux and Windows containers
Volumes can be more safely shared among multiple containers
Volume drivers let you store volumes on remote hosts or cloud providers, to encrypt the contents of volumes, or to add other functionality
New volumes can have their content pre-populated by a container

A read-only volume is also a good way of packaging content that you want to ensure isn’t clobbered. Remember that if you mount a local directory as read-only, you can still damage the content from outside the container environment—for instance, by removing or changing files within the directory that was mounted on a container.

Conclusion

This article touched on using container storage, but it didn’t seem like we spoke much about the data science angle. We will explore this more in future articles, after we’ve covered the basics.

Let’s recap what we’ve learnt here:

On the assumption that containers are disposable and should deal with minimal scope, we now understand how to retain state outside the container itself.
We know three different ways of holding that state outside a container: local directories, Docker volumes, and with remotely mounted Azure storage.
We can access that same storage on containers running in Azure. In short, cloud-based resource can access content that could be maintained from someone mounting that content locally.

What’s next

In the next part of this article, I’m going to extend the container, data persistence, and data science concept using multiple containers to create a more complex application and then move this to the cloud.

About the author

Jon is a Microsoft Cloud Solution Architect specialising in Advanced Analytics & Artificial Intelligence.

The post Master data science with Docker’s storage and data persistence: Part 2 appeared first on Microsoft Industry Blogs - United Kingdom.

How to use containers in data science with Docker and Azure: Part 1

Jon Machtynger — Wed, 22 May 2019 08:00:04 +0000

In the first part of this introduction to containerisation, Jon Machtynger, Cloud Solution Architect for Advanced Analytics and AI at Microsoft, reveals what it can do for data scientists

Data science is a broad church, but there are common themes, and most practitioners have a general interest in how to operationalise data science processes. With this in mind, we’ve created a no-nonsense guide to getting with containerisation, which is increasingly being used by data scientists looking to standardise projects, and port easily from a local to a cloud environment.

In this containerisation primer I’ll be using Docker, an application that enables you to build, test and ship containerised applications from your PC or Mac. Other apps are available, but Docker is accessible, easy to install and is currently used by millions of developers.

Docker scales and provides resilience using orchestration capabilities such as Kubernetes. And it’s also relatively easy to move to the cloud, where it both scales and provides a mechanism for collaborating across many more people in a consistent fashion.

In this first instalment you will learn how to:

Use a simple base image, build on it, and interact with it securely.
Upload this to Azure and then remotely execute the same functionality there.
Show how others might deploy that same functionality in their local environment.

We won’t cover the basics of how to install Docker, as you can find details on how to do this online, either via the official Docker support site, or elsewhere. I also assume that you already know the basics of what containers are, and how they differ from virtual machines (VMs).

There are many articles, such as About Docker, What is a Container, Installing Docker on Linux, Installing Docker on Windows, and Dockerfile Reference, that show how to install Docker, download containers or build a container.

But with this series, we will approach using Docker from a data science angle. And I’ll focus on the following data science inhibitors:

Minimising conflicting library versions for different development projects.
Consistency across environments and developers against specific design criteria.
Avoiding a need to reinstall everything on new hardware after a refresh or failure.
Maximise collaboration clarity across groups. Consistent libraries, result sets, etc.
Extending on-premises autonomy/agility to cloud scale and reach.

We’ll interact with a container as though it was a separate, self-contained process you have complete access to, because that is – after all – what it is. This is a key concept to appreciate.

Getting started

Let’s start with some assumptions about working with containers:

Containers are expected to be stateless and disposable. They provide compute and when they exit, everything inside that container could disappear with it. Because they’re stateless, they also provide a basis for scalability and resilience.
A container should focus on as few things as possible. This supports a micro-services approach and allows designers to combine primitive containers in interesting combinations without overly worrying about interdependencies.

Building a simple Container

For the examples in this article, you should create a working directory. In that directory, we’re going to build a new container based on a very small image called alpine.

$ mkdir -p docker4ds
$ cd docker4ds
$ touch Dockerfile

Now edit that Dockerfile to hold the following:

FROM alpine

RUN apk --update add --no-cache openssh
RUN echo 'root:rootpwd' | chpasswd

# Modify sshd_config items to allow login
RUN sed -i 's/#PermitRootLogin.*/PermitRootLogin\ yes/' /etc/ssh/sshd_config && \
    sed -ie 's/#Port 22/Port 22/g' /etc/ssh/sshd_config && \
    sed -ri 's/#HostKey \/etc\/ssh\/ssh_host_key/HostKey \/etc\/ssh\/ssh_host_key/g' /etc/ssh/sshd_config && \
    sed -ir 's/#HostKey \/etc\/ssh\/ssh_host_rsa_key/HostKey \/etc\/ssh\/ssh_host_rsa_key/g' /etc/ssh/sshd_config && \
    sed -ir 's/#HostKey \/etc\/ssh\/ssh_host_dsa_key/HostKey \/etc\/ssh\/ssh_host_dsa_key/g' /etc/ssh/sshd_config && \
    sed -ir 's/#HostKey \/etc\/ssh\/ssh_host_ecdsa_key/HostKey \/etc\/ssh\/ssh_host_ecdsa_key/g' /etc/ssh/sshd_config && \
    sed -ir 's/#HostKey \/etc\/ssh\/ssh_host_ed25519_key/HostKey \/etc\/ssh\/ssh_host_ed25519_key/g' /etc/ssh/sshd_config

# Generate new keys
RUN /usr/bin/ssh-keygen -A && ssh-keygen -t rsa -b 4096 -f  /etc/ssh/ssh_host_key

CMD ["/usr/sbin/sshd","-D"]  # Start the ssh daemon

This starts FROM a base Linux image called alpine, and then adds some custom functionality. I’m going to create a small SSH server, but in practice, I could add anything. The base image is pulled from the public Docker registry, but later I’ll show you how you can also pull your content from a private registry in Azure.

With the first RUN, I add the openssh package to the image and then assign a new password to root. Doing this in clear text isn’t secure practice, but I’m only showing the flexibility of a Dockerfile.

A Dockerfile can have many RUN steps, each of which, is an image build step that provides a layer of committed change to the eventual docker image. I then modify the sshd_config file to allow remote login and generate new SSH keys.

Lastly, I start the SSH daemon using CMD (CMD is the command that the container executes by default when you launch the build image, and a Dockerfile can only have one CMD).

You can find some Dockerfile best practices here.

Now let’s build the image:

$ docker build -t jon/alpine:1.0 .
Sending build context to Docker daemon  109.1kB
Step 1/6 : FROM alpine
latest: Pulling from library/alpine
bdf0201b3a05: Pull complete
Digest: sha256:28ef97b8686a0b5399129e9b763d5b7e5ff03576aa5580d6f4182a49c5fe1913
Status: Downloaded newer image for alpine:latest
 ---> cdf98d1859c1
Step 2/6 : RUN apk --update add --no-cache openssh
 ---> Running in d9aa96d42532
fetch http://dl-cdn.alpinelinux.org/alpine/v3.9/main/x86_64/APKINDEX.tar.gz
fetch http://dl-cdn.alpinelinux.org/alpine/v3.9/community/x86_64/APKINDEX.tar.gz
(1/10) Installing openssh-keygen (7.9_p1-r4)
(2/10) Installing ncurses-terminfo-base (6.1_p20190105-r0)
(3/10) Installing ncurses-terminfo (6.1_p20190105-r0)
(4/10) Installing ncurses-libs (6.1_p20190105-r0)
(5/10) Installing libedit (20181209.3.1-r0)
(6/10) Installing openssh-client (7.9_p1-r4)
(7/10) Installing openssh-sftp-server (7.9_p1-r4)
(8/10) Installing openssh-server-common (7.9_p1-r4)
(9/10) Installing openssh-server (7.9_p1-r4)
(10/10) Installing openssh (7.9_p1-r4)
Executing busybox-1.29.3-r10.trigger
OK: 17 MiB in 24 packages
Removing intermediate container d9aa96d42532
 ---> 1acd76f36c6b
Step 3/6 : RUN echo 'root:rootpwd' | chpasswd
 ---> Running in 8e4a2d38bd60
chpasswd: password for 'root' changed
Removing intermediate container 8e4a2d38bd60
 ---> 4e26a17c921e
Step 4/6 : RUN 	sed -i 's/#PermitRootLogin.*/PermitRootLogin\ yes/' /etc/ssh/sshd_config && 	sed -ie 's/#Port 22/Port 22/g' /etc/ssh/sshd_config && 	sed -ri 's/#HostKey \/etc\/ssh\/ssh_host_key/HostKey \/etc\/ssh\/ssh_host_key/g' /etc/ssh/sshd_config && 	sed -ir 's/#HostKey \/etc\/ssh\/ssh_host_rsa_key/HostKey \/etc\/ssh\/ssh_host_rsa_key/g' /etc/ssh/sshd_config && 	sed -ir 's/#HostKey \/etc\/ssh\/ssh_host_dsa_key/HostKey \/etc\/ssh\/ssh_host_dsa_key/g' /etc/ssh/sshd_config && 	sed -ir 's/#HostKey \/etc\/ssh\/ssh_host_ecdsa_key/HostKey \/etc\/ssh\/ssh_host_ecdsa_key/g' /etc/ssh/sshd_config && 	sed -ir 's/#HostKey \/etc\/ssh\/ssh_host_ed25519_key/HostKey \/etc\/ssh\/ssh_host_ed25519_key/g' /etc/ssh/sshd_config
 ---> Running in 3c85a906e8cd
Removing intermediate container 3c85a906e8cd
 ---> 116defd2d657
Step 5/6 : RUN 	/usr/bin/ssh-keygen -A && ssh-keygen -t rsa -b 4096 -f  /etc/ssh/ssh_host_key
 ---> Running in dba2ff14a17c
ssh-keygen: generating new host keys: RSA DSA ECDSA ED25519
Generating public/private rsa key pair.
Enter passphrase (empty for no passphrase): Enter same passphrase again: Your identification has been saved in /etc/ssh/ssh_host_key.
Your public key has been saved in /etc/ssh/ssh_host_key.pub.
The key fingerprint is:
SHA256:T4/Z8FLKLdEwZsTTIhx/g0A5DDuejhcl7B9FdsjFyCc root@dba2ff14a17c
The key's randomart image is:
+---[RSA 4096]----+
|      .=+*+*o    |
|     . .**Eo+    |
|      = .oO=o    |
|     o = + = .   |
|      = S + o    |
|     o o = %     |
|    . o . O =    |
|     .     o     |
|                 |
+----[SHA256]-----+
Removing intermediate container dba2ff14a17c
 ---> 49ee4b262ae4
Step 6/6 : CMD ["/usr/sbin/sshd","-D"]  # Start the ssh daemon
 ---> Running in 2a074ec11e30
Removing intermediate container 2a074ec11e30
 ---> cf85e38faa5e
Successfully built cf85e38faa5e
Successfully tagged jon/alpine:1.0
$

If I now look at my available images, I have the core alpine image used as a basis for my image, and my custom image, which includes the SSH service. Look how tiny my image is – a working SSH server in under 13MB:

$ docker images
REPOSITORY             TAG            IMAGE ID            CREATED             SIZE
jon/alpine             1.0            cf85e38faa5e        52 seconds ago      12.5MB
alpine                 latest         cdf98d1859c1        8 days ago          5.53MB

Notice that during the build of jon/alpine:1.0, the base alpine image had an ID of cdf98d1859c1, which also shows up in the Docker image lists. Let’s use it.

The following creates, and then runs a container based on the jon/alpine:1.0 image. It also maps port 2222 on my local machine to port 22 within the running container, which is the default SSH login port. When I create that container, it returns a unique identifier, and if I list running containers, it shows a unique container ID that prefixed that text:

$ docker run -d -p 2222:22 jon/alpine:1.0 
db50da6f71ddeb69f1f3bdecc4b0a01c48fcda93f68ee21f2c14032e995d49ff 
$
$ docker ps -a
CONTAINER ID  IMAGE            COMMAND               CREATED         STATUS        PORTS                 NAMES
db50da6f71dd  jon/alpine:1.0   "/usr/sbin/sshd -D"   5 minutes ago   Up 5 minutes  0.0.0.0:2222->22/tcp  obj_austin

I should now also be able to SSH into that container. I log in through port 2222, which maps to the container’s SSH port 22. The hostname for that container is the container ID:

$ ssh root@localhost -p 2222
The authenticity of host '[localhost]:2222 ([::1]:2222)' can't be established.
ECDSA key fingerprint is SHA256:IxFIJ25detXF9HTc5CHffkO2DmhBzBe6EFRqFVj5H6w.
Are you sure you want to continue connecting (yes/no)? yes
Warning: Permanently added '[localhost]:2222' (ECDSA) to the list of known hosts.
root@localhost's password:
Welcome to Alpine!

The Alpine Wiki contains a large amount of how-to guides and general
information about administrating Alpine systems.
See .

You can setup the system with the command: setup-alpine

You may change this message by editing /etc/motd.

db50da6f71dd:~#

Multiple containers often work together owning different capabilities or scaling compute power by running in parallel. Let’s start more of these, each with a different mapped port:

$ docker run -d -p 2223:22 jon/alpine:1.0
5821d6a9e8c73ae0d64f7c59199a948cd43e87d6019b05ff54f01df83557b0f3
$ docker run -d -p 2224:22 jon/alpine:1.0
0305ea0aaf5142c6a89f8802fb67c3b1a768094a81be2bf15578b933c3385f87
$ docker run -d -p 2225:22 jon/alpine:1.0
1e0f3f2ac16f5fcd9a1bb169f07930061d42daea6aec8afeb08132ee5dd5c896
$
$ docker ps -a
CONTAINER ID    IMAGE            COMMAND              CREATED          STATUS          PORTS                  NAMES
1e0f3f2ac16f    jon/alpine:1.0   "/usr/sbin/sshd -D"  34 seconds ago   Up 33 seconds   0.0.0.0:2225->22/tcp   loving_kare
0305ea0aaf51    jon/alpine:1.0   "/usr/sbin/sshd -D"  39 seconds ago   Up 38 seconds   0.0.0.0:2224->22/tcp   nab_ritchie
5821d6a9e8c7    jon/alpine:1.0   "/usr/sbin/sshd -D"  44 seconds ago   Up 42 seconds   0.0.0.0:2223->22/tcp   det_feynman
db50da6f71dd    jon/alpine:1.0   "/usr/sbin/sshd -D"  12 minutes ago   Up 12 minutes   0.0.0.0:2222->22/tcp   obj_austin

After each container starts, it tells me its container ID. You can actually use any of a container’s initial uniquely identifying characters to work with it.

Let’s find out more about container 1e0f3f2ac16f using docker inspect 1e0. I’m interested in some network details. I only use the first three characters because they’re enough to uniquely identify it. In fact, I could have identified it with just ‘1’ as no other container ID started with that:

$ docker inspect 1e0 | grep -i address
            "LinkLocalIPv6Address": "",
            "SecondaryIPAddresses": null,
            "SecondaryIPv6Addresses": null,
            "GlobalIPv6Address": "",
            "IPAddress": "172.17.0.5",
            "MacAddress": "02:42:ac:11:00:05",
                    "IPAddress": "172.17.0.5",
                    "GlobalIPv6Address": "",
                    "MacAddress": "02:42:ac:11:00:05",
$

Now let’s interact with it directly:

$ docker exec -it 1e0 ifconfig
eth0      Link encap:Ethernet  HWaddr 02:42:AC:11:00:05
          inet addr:172.17.0.5  Bcast:172.17.255.255  Mask:255.255.0.0
          UP BROADCAST RUNNING MULTICAST  MTU:1500  Metric:1
          RX packets:21 errors:0 dropped:0 overruns:0 frame:0
          TX packets:0 errors:0 dropped:0 overruns:0 carrier:0
          collisions:0 txqueuelen:0
          RX bytes:1558 (1.5 KiB)  TX bytes:0 (0.0 B)

lo        Link encap:Local Loopback
          inet addr:127.0.0.1  Mask:255.0.0.0
          UP LOOPBACK RUNNING  MTU:65536  Metric:1
          RX packets:0 errors:0 dropped:0 overruns:0 frame:0
          TX packets:0 errors:0 dropped:0 overruns:0 carrier:0
          collisions:0 txqueuelen:1
          RX bytes:0 (0.0 B)  TX bytes:0 (0.0 B)
$

Here, I executed the ipconfig command interactively with the container identified (uniquely) by the identifier 1e0. Note that both approaches provided the same network and MAC address. Once the command finishes, control returns to your host. Let’s interact with another container and look at its network details using both approaches:

$ docker exec -it 030 ifconfig
eth0      Link encap:Ethernet  HWaddr 02:42:AC:11:00:04
          inet addr:172.17.0.4  Bcast:172.17.255.255  Mask:255.255.0.0
          UP BROADCAST RUNNING MULTICAST  MTU:1500  Metric:1
          RX packets:21 errors:0 dropped:0 overruns:0 frame:0
          TX packets:0 errors:0 dropped:0 overruns:0 carrier:0
          collisions:0 txqueuelen:0
          RX bytes:1558 (1.5 KiB)  TX bytes:0 (0.0 B)

lo        Link encap:Local Loopback
          inet addr:127.0.0.1  Mask:255.0.0.0
          UP LOOPBACK RUNNING  MTU:65536  Metric:1
          RX packets:0 errors:0 dropped:0 overruns:0 frame:0
          TX packets:0 errors:0 dropped:0 overruns:0 carrier:0
          collisions:0 txqueuelen:1
          RX bytes:0 (0.0 B)  TX bytes:0 (0.0 B)

$ docker inspect 030 | grep -i address
            "LinkLocalIPv6Address": "",
            "SecondaryIPAddresses": null,
            "SecondaryIPv6Addresses": null,
            "GlobalIPv6Address": "",
            "IPAddress": "172.17.0.4",
            "MacAddress": "02:42:ac:11:00:04",
                    "IPAddress": "172.17.0.4",
                    "GlobalIPv6Address": "",
                    "MacAddress": "02:42:ac:11:00:04",$

This shows that the second container is running, and shares a common Docker network.

Now let’s move this capability to the cloud. If you don’t have an Azure account, you can get a free trial here. You’ll also need to install the Azure CLI client, which you can get here.

Now let’s create a resource group to hold (and ringfence) all our resources. This will allow us to cleanly remove it all once we’re done playing with it.

I’m going to use the resource group docker-rg, so do the same, or choose one that works for you. I’m choosing ‘East US’ as my location. Feel free to choose one close to you, but for the purposes of this tutorial, it should make no difference:

$ az group create --name docker-rg --location eastus
{
  "id": "/subscriptions//resourceGroups/docker-rg",
  "location": "eastus",
  "managedBy": null,
  "name": "docker-rg",
  "properties": {
    "provisioningState": "Succeeded"
  },
  "tags": null,
  "type": null
}
$

Within your resource group, you can now create an Azure container registry with the az acr create command. The container registry name must be unique within Azure. I’m using joncreg as my registry name, within my docker-rg resource group.

$ az acr create --resource-group docker-rg --name joncreg --sku Basic --admin-enabled true
{
  "adminUserEnabled": true,
  "creationDate": "2019-04-17T10:31:54.591280+00:00",
  "id": "/subscriptions//resourceGroups/docker-rg/providers/Microsoft.ContainerRegistry/registries/joncreg",
  "location": "eastus",
  "loginServer": "joncreg.azurecr.io",
  "name": "joncreg",
  "networkRuleSet": null,
  "provisioningState": "Succeeded",
  "resourceGroup": "docker-rg",
  "sku": {
    "name": "Basic",
    "tier": "Basic"
  },
  "status": null,
  "storageAccount": null,
  "tags": {},
  "type": "Microsoft.ContainerRegistry/registries"
}
$

This provides you with quite a bit of information, but if you log in to your resource group, you can also find the name of your registry login server. We’ll use this later as we start storing and retrieving content from that repository:

$ az acr login --name joncreg
Login Succeeded
$
$ az acr show --name joncreg --query loginServer --output table
Result
------------------
joncreg.azurecr.io
$

Now let’s tag your local machine against that Azure login server. This will allow us to refer to and manage it as a cloud resource. Notice that the Image IDs for both your local image and the Azure image are currently identical:

$ docker tag jon/alpine:1.0 joncreg.azurecr.io/alpine:1.0
$ docker images
REPOSITORY                            TAG              IMAGE ID            CREATED             SIZE
joncreg.azurecr.io/alpine             1.0              cf85e38faa5e        10 hours ago        12.5MB
jon/alpine                            1.0              cf85e38faa5e        10 hours ago        12.5MB
$

And now let’s push the tagged image out to the Azure Container Registry, and then confirm that the image is there:

$ docker push joncreg.azurecr.io/alpine:1.0
The push refers to repository [joncreg.azurecr.io/alpine]
06c6815029d6: Pushed
0b334f069f0f: Pushed
438f073b5999: Pushed
9d8531f069a1: Pushed
a464c54f93a9: Pushed
1.0: digest: sha256:d5080107847050caa2b21124142c217c10b38c776b3ce3a6611acc4116dcabb0 size: 1362
$
$ az acr repository list --name joncreg --output table
Result
--------
alpine
$
$ az acr repository show-tags --name joncreg --repository alpine --output table
Result
--------
1.0
$

So, what can we do with this container in the cloud?

We can execute it, share it with others, or use it as a base for building other containers. But in order to access it, you’ll need to do it securely. Let’s create a service principal, which we’ll use to later. In this script, I add my container registry name (joncreg) and specify a unique service principal name (jon-acr-sp):

#!/bin/bash

# Modify for your environment.
# ACR_NAME: The name of your Azure Container Registry
# SERVICE_PRINCIPAL_NAME: Must be unique within your AD tenant
ACR_NAME="joncreg"
SERVICE_PRINCIPAL_NAME=jon-acr-sp

# Obtain the full registry ID for subsequent command args
ACR_REGISTRY_ID=$(az acr show --name $ACR_NAME --query id --output tsv)

# Create the service principal with rights scoped to the registry.
# Default permissions are for docker pull access. Modify the '--role'
# argument value as desired:
# acrpull:     pull only
# acrpush:     push and pull
# owner:       push, pull, and assign roles
SP_PASSWD=$(az ad sp create-for-rbac --name http://$SERVICE_PRINCIPAL_NAME --scopes $ACR_REGISTRY_ID --role acrpull --query password --output tsv)
SP_APP_ID=$(az ad sp show --id http://$SERVICE_PRINCIPAL_NAME --query appId --output tsv)

# Output the service principal's credentials; use these in your services and
# applications to authenticate to the container registry.
echo "Service principal ID: $SP_APP_ID"
echo "Service principal password: $SP_PASSWD"

When I run this, I get returned a secure principle ID and a password. I’ve obscured mine for obvious reasons, but you will need to remember these for later, so make a copy:

Service principal ID: 6xxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxx2
Service principal password: 6xxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxf

Now let’s create and deploy a container, which I’ll call it alpine-ssh. It doesn’t do much, so I’ll only allocate a single virtual CPU and 100MB of RAM to it:

$ az container create --resource-group docker-rg --name alpine-ssh --image joncreg.azurecr.io/alpine:1.0 --cpu 1 --memory 0.1 --registry-login-server joncreg.azurecr.io --registry-username 6xxxxxxx-xxxx-xxxx-xxxb-xxxxxxxxxxx2 --registry-password 6xxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxf --dns-name-label jm-alpine-ssh-314159 --ports 22
{
  .
 
  .
}
$

You should see a JSON response with loads of details confirming your deployment, but you can also find the fully qualified DNS name for your new container using the following:

$ az container show --resource-group docker-rg --name alpine-ssh --query ipAddress.fqdn
"jm-alpine-ssh-314159.eastus.azurecontainer.io"
$

It’s deployed. And I can now SSH to that address to log in:

$ ssh root@jm-alpine-ssh-314159.eastus.azurecontainer.io
The authenticity of host 'jm-alpine-ssh-314159.eastus.azurecontainer.io (20.185.98.127)' can't be established.
ECDSA key fingerprint is SHA256:IxFIJ25detXF9HTc5CHffkO2DmhBzBe6EFRqFVj5H6w.
Are you sure you want to continue connecting (yes/no)? yes
Warning: Permanently added 'jm-alpine-ssh-314159.eastus.azurecontainer.io,20.185.98.127' (ECDSA) to the list of known hosts.
root@jm-alpine-ssh-314159.eastus.azurecontainer.io's password:
Welcome to Alpine!

The Alpine Wiki contains a large amount of how-to guides and general
information about administrating Alpine systems.
See .

You can setup the system with the command: setup-alpine

You may change this message by editing /etc/motd.

wk-caas-edf9b7736da8406395657de1be9212b0-f3c593a7045e59fd38bbf0:~#

Simple, right? But I’d like to make it easy for other people to use this image as well. Stop and remove all your running container images:

$ docker ps -q
1e0f3f2ac16f
0305ea0aaf51
5821d6a9e8c7
db50da6f71dd
$
$ RUNNING=$(docker ps -q)
$ docker stop $RUNNING
1e0f3f2ac16f
0305ea0aaf51
5821d6a9e8c7
db50da6f71dd
$ docker rm $RUNNING
1e0f3f2ac16f
0305ea0aaf51
5821d6a9e8c7
db50da6f71dd
$

Lastly, remove the images holding your original SSH server:

$ docker images
REPOSITORY                            TAG              IMAGE ID            CREATED             SIZE
joncreg.azurecr.io/alpine             1.0              cf85e38faa5e        10 hours ago        12.5MB
jon/alpine                            1.0              cf85e38faa5e        10 hours ago        12.5MB
$
$ docker rmi jon/alpine:1.0
$ docker rmi joncreg.azurecr.io/alpine:1.0

There is now no local copy of your Docker image, but if you still need to run it locally, then this isn’t a problem. You shouldn’t have to rebuild it. You – and others you allow – can simply get it directly from Azure. So, let’s run one – it will see that there isn’t one locally, download it, and then run the container locally:

$ docker run -d -p 2222:22 joncreg.azurecr.io/alpine:1.0
Unable to find image 'joncreg.azurecr.io/alpine:1.0' locally
1.0: Pulling from alpine
bdf0201b3a05: Already exists
9cb9180e5bb6: Pull complete
6425579f73e9: Pull complete
b7eda421926c: Pull complete
163c36e4f93a: Pull complete
Digest: sha256:d5080107847050caa2b21124142c217c10b38c776b3ce3a6611acc4116dcabb0
Status: Downloaded newer image for joncreg.azurecr.io/alpine:1.0
65fef6f65c6c9fc766e13766d7db4316825170e7ff7923db824127ed78ad0970

$ docker ps
CONTAINER ID        IMAGE                           COMMAND               CREATED              STATUS              PORTS                  NAMES
65fef6f65c6c        joncreg.azurecr.io/alpine:1.0   "/usr/sbin/sshd -D"   About a minute ago   Up About a minute   0.0.0.0:2222->22/tcp   gracious_khorana

We’ve pulled that image from your Azure container registry. If you get an error saying that you need to log in to your Azure Docker container environment, use the following (with your service principal username and password) and retry the previous docker run command:

$ az acr login --name joncreg --username 6xxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx2 --pass 6xxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxf
Login Succeeded
$

Note: Once you’ve finished, and you’ve created a specific resource group to test this, you might want to consider removing the resource group. It will also stop you being charged for things you’re no longer going to use. In my case, I did the following:

$ az group delete --name docker-rg

Conclusion

This brings the first phase of our project to an end. Using a standards-based cross platform environment (Docker), you created a virtual compute (Alpine Linux) resource, which was then moved to the cloud (we used Alpine, but you could do the same with other environments such as Ubuntu, CentOS, Windows, etc.).

We then made the image available to be shared as a standard environment, with the option to maintain multiple versions of it (e.g. 1.0, 1.1, 2.0 etc.). Assuming a user has permission, any Docker environment can pull and run that standard image. This means you can package a very specific combination of libraries, environments, and tools, to develop, test, and run in a consistent fashion.

Next time

In the next part of this series, we’ll look at data persistence using containers and how this can be provided locally and in the cloud. This starts to provide a foundation for something more usable from a data science perspective such as shared code, database access, and shared development frameworks.

Find out more

[msce_cta layout=”image_center” align=”center” linktype=”blue” imageurl=”http://approjects.co.za/?big=en-gb/industry/blog/wp-content/uploads/sites/22/2019/05/CLO19_azureKinectDK_024.jpg” linkurl=”https://azure.microsoft.com/en-gb/product-categories/containers/” linkscreenreadertext=”Accelerate your apps with containers” linktext=”Accelerate your apps with containers” imageid=”11460″ ][/msce_cta]

About the author

Jon is a Microsoft Cloud Solution Architect specialising in Advanced Analytics & Artificial Intelligence.

The post How to use containers in data science with Docker and Azure: Part 1 appeared first on Microsoft Industry Blogs - United Kingdom.