AI Task

When you create a model in Model, you must define the standardized AI Task that the model addresses.

In a data pipeline, a model serves as a critical component designed to tackle a specific AI Task. By standardizing the data format of model outputs into AI tasks, models become modular: you can interchange different model sources as an AI component in a Pipeline as long as they're designed for the same AI Task. Model also adheres to the standard format of AI Tasks for data integration in Pipelines.

Currently, Model outlines the data interface for popular tasks:

Image Classification: Categorizing images into predefined classes.
Object Detection: Identifying and localizing multiple objects in images.
Keypoint Detection: Identifying and localizing multiple keypoints of objects in images.
OCR (Optical Character Recognition): Identifying and recognizing text in images.
Instance Segmentation: Identifying, localizing, and outlining multiple objects in images.
Semantic Segmentation: Categorizing image pixels into predefined classes.
Text to Image: Generating images from input text prompts.
Chat: Generating text responses from multimodal conversation inputs, texts, or images.
Completion: Generating text that completes the input texts.
Embedding: Generating embeddings from multimodal inputs, texts, or images.
Custom: Custom task type that allows freeform input and output data formats.
The list is expanding ... 🌱

All supported tasks follow our Standardized AI Task Spec, which strictly defines the input and output formats.

The tasks listed above focus on analyzing and understanding the content of unstructured data in a manner similar to human cognition. The objective is to enable a computer/device to provide a description for the data that is as comprehensive and accurate as possible. These primitive tasks form the basis for building numerous real-world industrial AI applications. Each task is elaborated in the respective section below.

Image Classification

Image Classification is a vision task that assigns a single predefined category label to an entire input image. Generally, an Image Classification model takes an image as input and outputs a prediction about what category this image belongs to, along with a confidence score (usually between 0 and 1) representing the likelihood that the prediction is correct.

{
  "task": "TASK_CLASSIFICATION",
  "taskOutputs": [
    {
      "data": {
        "category": "golden retriever",
        "score": 0.98
      }
    }
  ]
}

Available models

Model	Sources	Framework	CPU	GPU
MobileNet v2	GitHub	ONNX	✅	✅

Object Detection

Object Detection is a vision task to localize multiple objects of predefined categories in an input image. Generally, an Object Detection model receives an image as input and outputs bounding boxes with category labels and confidence scores for detected objects.

{
  "task": "TASK_DETECTION",
  "taskOutputs": [
    {
      "data": {
        "objects": [
          {
            "category": "dog",
            "score": 0.97,
            "bounding_box": {
              "top": 102,
              "left": 324,
              "width": 208,
              "height": 405
            }
          },
          ...
        ]
      }
    }
  ]
}

Available models

Model	Sources	Framework	CPU	GPU
YOLOv7	GitHub	ONNX	✅	✅

Keypoint Detection

Keypoint Detection is a vision task to localize multiple objects by identifying their predefined keypoints, for example, identifying the keypoints of the human body: nose, eyes, ears, shoulders, elbows, wrists, hips, knees, and ankles. Normally, a Keypoint Detection model takes an image as input and outputs the coordinates and visibility of keypoints, along with bounding boxes and confidence scores for detected objects.

{
  "task": "TASK_KEYPOINT",
  "taskOutputs": [
    {
      "data": {
        "objects": [
          {
            "keypoints": [
              {
                "v": 0.53722847,
                "x": 542.82764,
                "y": 86.63817
              },
              {
                "v": 0.634061,
                "x": 553.0073,
                "y": 79.440636
              },
              ...
            ],
            "score": 0.94,
            "bounding_box": {
              "top": 86,
              "left": 185,
              "width": 571,
              "height": 203
            }
          },
          ...
        ]
      }
    }
  ]
}

Available models

Coming soon!

Optical Character Recognition (OCR)

OCR is a vision task to localize and recognize text in an input image. The task can be done in two steps by multiple models: a text detection model to detect bounding boxes containing text, and a text recognition model to process typed or handwritten text within each bounding box into machine-readable text. Alternatively, there are deep learning models that can accomplish the task in a single step.

{
  "task": "TASK_OCR",
  "taskOutputs": [
    {
      "data": {
        "objects": [
          {
            "text": "ENDS",
            "score": 0.99,
            "bounding_box": {
              "top": 298,
              "left": 279,
              "width": 134,
              "height": 59
            }
          },
          {
            "text": "PAVEMENT",
            "score": 0.99,
            "bounding_box": {
              "top": 228,
              "left": 216,
              "width": 255,
              "height": 65
            }
          }
        ]
      }
    }
  ]
}

Available models

Coming soon!

Instance Segmentation

Instance Segmentation is a vision task to detect and delineate multiple objects of predefined categories in an input image. Normally, the task takes an image as input and outputs uncompressed run-length encoding (RLE) representations (a variable-length comma-delimited string), with bounding boxes, category labels, and confidence scores for detected objects.

Run-length encoding (RLE) is an efficient form to store binary masks. It is commonly used to encode the location of foreground objects in segmentation. We adopt the uncompressed RLE definition used in the COCO dataset. It divides a binary mask (must be in column-major order) into a series of piecewise constant regions, and for each piece simply stores the length of that piece.

The above image shows examples of encoding masks into RLEs and decoding masks encoded via RLEs. Note that the odd counts in the RLEs are always the numbers of zeros.

📘
Check out functions to encode masks into RLEs and decode masks encoded via RLEs.

{
  "task": "TASK_INSTANCE_SEGMENTATION",
  "taskOutputs": [
    {
      "data": {
        "objects": [
          {
            "rle": "2918,12,382,33,...",
            "score": 0.99,
            "bounding_box": {
              "top": 95,
              "left": 320,
              "width": 215,
              "height": 406
            },
            "category": "dog"
          },
          {
            "rle": "34,18,230,18,...",
            "score": 0.97,
            "bounding_box": {
              "top": 194,
              "left": 130,
              "width": 197,
              "height": 248
            },
            "category": "dog"
          }
        ]
      }
    }
  ]
}

Available models

Model	Sources	Framework	CPU	GPU
YOLOV7 Stomata	GitHub	PyTorch	✅	✅

Semantic Segmentation

Semantic Segmentation is a vision task of assigning a class label to every pixel in the image. Normally, the task takes an image as input and outputs segmentation mask (RLE) representations (a variable-length comma-delimited string) for each group of pixel objects and the category of the group objects.

{
  "task": "TASK_SEMANTIC_SEGMENTATION",
  "taskOutputs": [
    {
      "data": {
        "stuffs": [
          {
            "rle": "2918,12,382,33,...",
            "category": "person"
          },
          {
            "rle": "34,18,230,18,...",
            "category": "sky"
          },
          ...
        ]
      }
    }
  ]
}

Available models

Coming soon!

Text to Image

Text to Image is a Generative AI Task to generate images from text inputs. Generally, the task takes descriptive text prompts as the input, and outputs generated images in Base64 format based on the text prompts.

{
  "task": "TASK_TEXT_TO_IMAGE",
  "taskOutputs": [
    {
      "data": {
        "choices": [
          {
            "finish-reason": "successful",
            "image": "/9j/4AAQSkZJRgABAQAAAQABAAD/..."
          }
          ...
        ]
      }
    }
  ]
}

Decode Base64 images

In above example, the generated images is a list of Base64 encoded images. To obtain the images, we need to decode Base64 as below snippet code.

import base64
import numpy as np

# Decode the first image result

base64_image = out['data']['choices'][0]["image"]
image = base64.b64decode(base64_image)

# Save the decoded image

filename = 'text_to_image.jpg'
with open(filename, 'wb') as f:
f.write(image)

Available models

Model	Sources	Framework	CPU	GPU
Stable Diffusion XL	GitHub	PyTorch	✅	✅

Completion

Completion is a Generative AI Task to generate new text from text inputs. Generally, the task takes incomplete text prompts as the input, and produces new text based on the prompts. The task can fill in incomplete sentences or even generate full stories given the first words.

{
  "task": "TASK_COMPLETION",
  "taskOutputs": [
    {
      "data": {
        "choices": [
          {
            "created": 1728894452,
            "finish-reason": "length",
            "index": 0,
            "content": "The winds of change a re blowing string, bring new beginnings, righting worngs. The world a round us is constantly turning, and with each sunrise, our spirits are yearning."
          }
        ]
      }
    }
  ]
}

Available models

Model	Sources	Framework	CPU	GPU
Code Llama	GitHub	Transformer	✅	✅

📘
Depending on your internet speed, importing LLM models will take a while. Some models only supports GPU deployment. By default, Model can access all your GPUs.

Chat

Chat is a Generative AI Task to generate new text from text inputs in chat style. Generally, the task takes a series of conversation with multimodal media type as the input, and produces new text response. The task can perform conversation and even answer question based on previous context.

{
  "task": "TASK_CHAT",
  "taskOutputs": [
    {
      "data": {
        "choices": [
          {
            "created": 1728894509,
            "finish-reason": "length",
            "index": 0,
            "message": {
              "content": "In the image, there is a person riding a sculpture of a white llama. The person is wearing a white shirt and blue jeans, and is holding onto the llama's neck. The llama sculpture is standing on a metal pole. The background shows a clear blue sky and a barren landscape with some structures in the distance.",
              "role": "assistant"
            }
          }
        ]
      }
    }
  ]
}

Available models

Model	Sources	Framework	CPU	GPU
Llama2 Chat	GitHub	Transformer	✅	✅
Llama3 Instruct	GitHub	Transformer	✅	✅
Zephyr	GitHub	Transformer	✅	✅
TinyLlama	GitHub	Transformer	✅	✅
Llava	GitHub	Transformer	✅	✅
Phi3.5 Vision	GitHub	Transformer	✅	✅

📘
Depending on your internet speed, importing LLM models will take a while. Some models only supports GPU deployment. By default, Model can access all your GPUs.

Embedding

Embedding is a task that generates embeddings for different input types that are optimized for generating similarity scores down the line.

{
  "task": "TASK_EMBEDDING",
  "taskOutputs": [
    {
      "data": {
        "embeddings": [
          {
            "created": 1728895998,
            "index": 0,
            "vector": [
              -0.07706715911626816, -0.006987401749938726, 0.010063154622912407,
              ...
            ]
          },
          {
            "created": 1728896002,
            "index": 1,
            "vector": [
              -0.07486195862293243, 0.026177942752838135, 0.014957821927964687,
              ...
            ]
          }
        ]
      }
    }
  ]
}

Available models

Model	Sources	Framework	CPU	GPU
gte Qwen2	GitHub	Transformer	✅	✅
Jina Clip V1	GitHub	Transformer	✅	✅
Stella EN V5	GitHub	Transformer	✅	✅

📘
Depending on your internet speed, importing LLM models will take a while. Some models support only GPU deployment. By default, Model can access all your GPUs.

Custom Task

Model is very flexible and allows models even if the task is not standardised yet or the output of the model can't be converted to the format of supported AI Tasks. The model will be classified as an Custom task.

{
  "task": "TASK_CUSTOM",
  "taskOutputs": [
    {
      "data": {
        "key1": ...,
        "key2": ...
      }
    }
  ]
}

Suggest a New Task

Currently, the model input and output is converted to standard format based on the Standardized AI Task Spec.

If you'd like to support for a new task, you can create an issue or request it in the #give-feedback channel on Discord.