Kubeflow Training Operator
Kubeflow Training Operator V1 Documentation
| WARNING |
| Old Version! This page is about Kubeflow Training Operator V1, for the latest information check the Kubeflow Trainer V2 documentation. Follow this guide for migrating to Kubeflow Trainer V2. |
Simple Example
---
apiVersion: kubeflow.org/v1beta1
kind: Experiment
metadata:
namespace: kubeflow-user-example-com
name: tfjob-mnist-with-summaries
spec:
parallelTrialCount: 3
maxTrialCount: 12
maxFailedTrialCount: 3
objective:
type: maximize
goal: 0.99
objectiveMetricName: accuracy
algorithm:
algorithmName: random
metricsCollectorSpec:
source:
fileSystemPath:
path: /mnist-with-summaries-logs/test
kind: Directory
collector:
kind: TensorFlowEvent
parameters:
- name: learning_rate
parameterType: double
feasibleSpace:
min: "0.01"
max: "0.05"
- name: batch_size
parameterType: int
feasibleSpace:
min: "32"
max: "64"
trialTemplate:
primaryContainerName: tensorflow
# In this example we can collect metrics only from the Worker pods.
primaryPodLabels:
training.kubeflow.org/replica-type: worker
trialParameters:
- name: learningRate
description: Learning rate for the training model
reference: learning_rate
- name: batchSize
description: Batch Size
reference: batch_size
trialSpec:
apiVersion: kubeflow.org/v1
kind: TFJob
spec:
tfReplicaSpecs:
Worker:
replicas: 2
restartPolicy: OnFailure
template:
metadata:
annotations:
sidecar.istio.io/inject: "false"
spec:
containers:
- name: tensorflow
image: ghcr.io/kubeflow/katib/tf-mnist-with-summaries:latest
command:
- "python"
- "/opt/tf-mnist-with-summaries/mnist.py"
- "--epochs=1"
- "--learning-rate=${trialParameters.learningRate}"
- "--batch-size=${trialParameters.batchSize}"
- "--log-path=/mnist-with-summaries-logs"
실행하고 나면 다음과 같이 tfjob-mnist-with-summaries-* pods 가 생기면서 Trial 이 돌아간다.
$ kubectl get pods -n kubeflow-user-example-com
NAME READY STATUS RESTARTS AGE
ktfms-7786d5c68f-hh64k 2/2 Running 0 2d22h
random-experiment-enas-6fc65474d4-d7f85 1/1 Running 0 2d22h
tfjob-mnist-with-summaries-7gm58mwn-worker-0 2/2 Running 0 11s
tfjob-mnist-with-summaries-7gm58mwn-worker-1 2/2 Running 0 11s
tfjob-mnist-with-summaries-dkr246f7-worker-0 2/2 Running 0 11s
tfjob-mnist-with-summaries-dkr246f7-worker-1 2/2 Running 0 10s
tfjob-mnist-with-summaries-dnh6tkpk-worker-0 2/2 Running 0 11s
tfjob-mnist-with-summaries-dnh6tkpk-worker-1 2/2 Running 0 11s
tfjob-mnist-with-summaries-random-776f9bb45d-rczlr 1/1 Running 0 4m
tutorial-jupyter-lab-01-0 2/2 Running 0 2d23h