Skip to content
This repository was archived by the owner on Jan 1, 2025. It is now read-only.
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
8 changes: 6 additions & 2 deletions .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -3,7 +3,8 @@ output
instant_test_output
inference_test_output


other
.havenignore
*.png
*.json
*.diff
Expand Down Expand Up @@ -50,4 +51,7 @@ _darcs
!/datasets/*.*
/projects/*/datasets
/models
/snippet
/snippet
wandb
.DS_Store
test.log
78 changes: 2 additions & 76 deletions README.md
Original file line number Diff line number Diff line change
@@ -1,78 +1,4 @@
# Mask2Former: Masked-attention Mask Transformer for Universal Image Segmentation (CVPR 2022)
# Mask2Former

[Bowen Cheng](https://bowenc0221.github.io/), [Ishan Misra](https://imisra.github.io/), [Alexander G. Schwing](https://alexander-schwing.de/), [Alexander Kirillov](https://alexander-kirillov.github.io/), [Rohit Girdhar](https://rohitgirdhar.github.io/)
### Datasets

[[`arXiv`](https://arxiv.org/abs/2112.01527)] [[`Project`](https://bowenc0221.github.io/mask2former)] [[`BibTeX`](#CitingMask2Former)]

<div align="center">
<img src="https://bowenc0221.github.io/images/maskformerv2_teaser.png" width="100%" height="100%"/>
</div><br/>

### Features
* A single architecture for panoptic, instance and semantic segmentation.
* Support major segmentation datasets: ADE20K, Cityscapes, COCO, Mapillary Vistas.

## Updates
* Add Google Colab demo.
* Video instance segmentation is now supported! Please check our [tech report](https://arxiv.org/abs/2112.10764) for more details.

## Installation

See [installation instructions](INSTALL.md).

## Getting Started

See [Preparing Datasets for Mask2Former](datasets/README.md).

See [Getting Started with Mask2Former](GETTING_STARTED.md).

Run our demo using Colab: [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/drive/1uIWE5KbGFSjrxey2aRd5pWkKNY1_SaNq)

Integrated into [Huggingface Spaces 🤗](https://huggingface.co/spaces) using [Gradio](https://github.com/gradio-app/gradio). Try out the Web Demo: [![Hugging Face Spaces](https://img.shields.io/badge/%F0%9F%A4%97%20Hugging%20Face-Spaces-blue)](https://huggingface.co/spaces/akhaliq/Mask2Former)

Replicate web demo and docker image is available here: [![Replicate](https://replicate.com/facebookresearch/mask2former/badge)](https://replicate.com/facebookresearch/mask2former)

## Advanced usage

See [Advanced Usage of Mask2Former](ADVANCED_USAGE.md).

## Model Zoo and Baselines

We provide a large set of baseline results and trained models available for download in the [Mask2Former Model Zoo](MODEL_ZOO.md).

## License

Shield: [![License: MIT](https://img.shields.io/badge/License-MIT-yellow.svg)](https://opensource.org/licenses/MIT)

The majority of Mask2Former is licensed under a [MIT License](LICENSE).


However portions of the project are available under separate license terms: Swin-Transformer-Semantic-Segmentation is licensed under the [MIT license](https://github.com/SwinTransformer/Swin-Transformer-Semantic-Segmentation/blob/main/LICENSE), Deformable-DETR is licensed under the [Apache-2.0 License](https://github.com/fundamentalvision/Deformable-DETR/blob/main/LICENSE).

## <a name="CitingMask2Former"></a>Citing Mask2Former

If you use Mask2Former in your research or wish to refer to the baseline results published in the [Model Zoo](MODEL_ZOO.md), please use the following BibTeX entry.

```BibTeX
@inproceedings{cheng2021mask2former,
title={Masked-attention Mask Transformer for Universal Image Segmentation},
author={Bowen Cheng and Ishan Misra and Alexander G. Schwing and Alexander Kirillov and Rohit Girdhar},
journal={CVPR},
year={2022}
}
```

If you find the code useful, please also consider the following BibTeX entry.

```BibTeX
@inproceedings{cheng2021maskformer,
title={Per-Pixel Classification is Not All You Need for Semantic Segmentation},
author={Bowen Cheng and Alexander G. Schwing and Alexander Kirillov},
journal={NeurIPS},
year={2021}
}
```

## Acknowledgement

Code is largely based on MaskFormer (https://github.com/facebookresearch/MaskFormer).
Original file line number Diff line number Diff line change
@@ -0,0 +1,56 @@
MODEL:
BACKBONE:
FREEZE_AT: 0
NAME: "build_resnet_backbone"
WEIGHTS: "detectron2://ImageNetPretrained/torchvision/R-50.pkl"
PIXEL_MEAN: [123.675, 116.280, 103.530]
PIXEL_STD: [58.395, 57.120, 57.375]
RESNETS:
DEPTH: 50
STEM_TYPE: "basic" # not used
STEM_OUT_CHANNELS: 64
STRIDE_IN_1X1: False
OUT_FEATURES: ["res2", "res3", "res4", "res5"]
# NORM: "SyncBN"
RES5_MULTI_GRID: [1, 1, 1] # not used
DATASETS:
TRAIN: ("rilv7",)
# TRAIN: ("rilv7-shapenetv1",)
TEST: ("rilv7-test",)
SOLVER:
IMS_PER_BATCH: 16
BASE_LR: 0.0001
STEPS: (327778, 355092) #TODO adjust this to something meaningful ... IDK what this is. Why are there 2 values?
# MAX_ITER: 368750
# MAX_ITER: 500 # 1000 works well for 10 imgs
MAX_ITER: 80000 # 1000 works well for 10 imgs # with 4k images, it's still not converged after 20k iterations
WARMUP_FACTOR: 1.0
WARMUP_ITERS: 10
WEIGHT_DECAY: 0.05
OPTIMIZER: "ADAMW"
BACKBONE_MULTIPLIER: 0.1
CLIP_GRADIENTS:
ENABLED: True
CLIP_TYPE: "full_model"
CLIP_VALUE: 0.01
NORM_TYPE: 2.0
AMP:
ENABLED: True
INPUT:
IMAGE_SIZE: 512
MIN_SCALE: 0.1
MAX_SCALE: 2.0
FORMAT: "RGB"
DATASET_MAPPER_NAME: "mask_former_panoptic"
PATCHES:
ENABLED: True # FIXME patch params can go here instead of augmentations.py
TEST:
EVAL_PERIOD: 500
DATALOADER:
FILTER_EMPTY_ANNOTATIONS: True
NUM_WORKERS: 4 # parallel
# NUM_WORKERS: 0 # non-parallel
VERSION: 2
WANDB:
GROUP: "rilv7"
NAME: "default"
53 changes: 53 additions & 0 deletions configs/ril/panoptic-segmentation/maskformer2_R50_bs16_50ep.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,53 @@
_BASE_: Base-COCO-PanopticSegmentation.yaml
MODEL:
META_ARCHITECTURE: "MaskFormer"
SEM_SEG_HEAD:
NAME: "MaskFormerHead"
IN_FEATURES: ["res2", "res3", "res4", "res5"]
IGNORE_VALUE: 255
NUM_CLASSES: 18
LOSS_WEIGHT: 1.0
CONVS_DIM: 256
MASK_DIM: 256
NORM: "GN"
# pixel decoder
PIXEL_DECODER_NAME: "MSDeformAttnPixelDecoder"
IN_FEATURES: ["res2", "res3", "res4", "res5"]
DEFORMABLE_TRANSFORMER_ENCODER_IN_FEATURES: ["res3", "res4", "res5"]
COMMON_STRIDE: 4
TRANSFORMER_ENC_LAYERS: 6
MASK_FORMER:
TRANSFORMER_DECODER_NAME: "MultiScaleMaskedTransformerDecoder"
TRANSFORMER_IN_FEATURE: "multi_scale_pixel_decoder"
DEEP_SUPERVISION: True
NO_OBJECT_WEIGHT: 0.1
CLASS_WEIGHT: 2.0
MASK_WEIGHT: 5.0
DICE_WEIGHT: 5.0
HIDDEN_DIM: 256
NUM_OBJECT_QUERIES: 100
NHEADS: 8
DROPOUT: 0.0
DIM_FEEDFORWARD: 2048
ENC_LAYERS: 0
PRE_NORM: False
ENFORCE_INPUT_PROJ: False
SIZE_DIVISIBILITY: 32
DEC_LAYERS: 10 # 9 decoder layers, add one for the loss on learnable query
TRAIN_NUM_POINTS: 12544
OVERSAMPLE_RATIO: 3.0
IMPORTANCE_SAMPLE_RATIO: 0.75
TEST:
SEMANTIC_ON: False
INSTANCE_ON: False
PANOPTIC_ON: True
OVERLAP_THRESHOLD: 0.8
OBJECT_MASK_THRESHOLD: 0.8


# log into wandb on darjeeling
# install new detectron on darjeeling, replacing old one
# include the code for wandb visualizer in train_net
# run another training, see how it goes
# figure out where the eval error comes from
# test current model
8 changes: 8 additions & 0 deletions configs/ril/panoptic-segmentation/ril1-shnv1.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,8 @@
_BASE_: maskformer2_R50_bs16_50ep.yaml
DATASETS:
# TRAIN: ("rilv7",)
TRAIN: ("rilv7-shapenetv1",)
SOLVER:
# MAX_ITER: 80000 # 20h
MAX_ITER: 160000

Original file line number Diff line number Diff line change
@@ -0,0 +1,56 @@
MODEL:
BACKBONE:
FREEZE_AT: 0
NAME: "build_resnet_backbone"
WEIGHTS: "detectron2://ImageNetPretrained/torchvision/R-50.pkl"
PIXEL_MEAN: [123.675, 116.280, 103.530]
PIXEL_STD: [58.395, 57.120, 57.375]
RESNETS:
DEPTH: 50
STEM_TYPE: "basic" # not used
STEM_OUT_CHANNELS: 64
STRIDE_IN_1X1: False
OUT_FEATURES: ["res2", "res3", "res4", "res5"]
# NORM: "SyncBN"
RES5_MULTI_GRID: [1, 1, 1] # not used
DATASETS:
TRAIN: ("rilv9",)
# TRAIN: ("rilv7-shapenetv1",)
TEST: ("rilv9-test",)
SOLVER:
IMS_PER_BATCH: 16
BASE_LR: 0.0001
STEPS: (327778, 355092) #TODO adjust this to something meaningful ... IDK what this is. Why are there 2 values?
# MAX_ITER: 368750
# MAX_ITER: 500 # 1000 works well for 10 imgs
MAX_ITER: 80000 # 1000 works well for 10 imgs # with 4k images, it's still not converged after 20k iterations
WARMUP_FACTOR: 1.0
WARMUP_ITERS: 10
WEIGHT_DECAY: 0.05
OPTIMIZER: "ADAMW"
BACKBONE_MULTIPLIER: 0.1
CLIP_GRADIENTS:
ENABLED: True
CLIP_TYPE: "full_model"
CLIP_VALUE: 0.01
NORM_TYPE: 2.0
AMP:
ENABLED: True
INPUT:
IMAGE_SIZE: 512
MIN_SCALE: 0.1
MAX_SCALE: 2.0
FORMAT: "RGB"
DATASET_MAPPER_NAME: "mask_former_panoptic"
PATCHES:
ENABLED: True # FIXME patch params can go here instead of augmentations.py
TEST:
EVAL_PERIOD: 500
DATALOADER:
FILTER_EMPTY_ANNOTATIONS: True
NUM_WORKERS: 4 # parallel
# NUM_WORKERS: 0 # non-parallel
VERSION: 2
WANDB:
GROUP: "rilv9"
NAME: "default"
53 changes: 53 additions & 0 deletions configs/rilv9/panoptic-segmentation/maskformer2_R50_bs16_50ep.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,53 @@
_BASE_: Base-COCO-PanopticSegmentation.yaml
MODEL:
META_ARCHITECTURE: "MaskFormer"
SEM_SEG_HEAD:
NAME: "MaskFormerHead"
IN_FEATURES: ["res2", "res3", "res4", "res5"]
IGNORE_VALUE: 255
NUM_CLASSES: 19
LOSS_WEIGHT: 1.0
CONVS_DIM: 256
MASK_DIM: 256
NORM: "GN"
# pixel decoder
PIXEL_DECODER_NAME: "MSDeformAttnPixelDecoder"
IN_FEATURES: ["res2", "res3", "res4", "res5"]
DEFORMABLE_TRANSFORMER_ENCODER_IN_FEATURES: ["res3", "res4", "res5"]
COMMON_STRIDE: 4
TRANSFORMER_ENC_LAYERS: 6
MASK_FORMER:
TRANSFORMER_DECODER_NAME: "MultiScaleMaskedTransformerDecoder"
TRANSFORMER_IN_FEATURE: "multi_scale_pixel_decoder"
DEEP_SUPERVISION: True
NO_OBJECT_WEIGHT: 0.1
CLASS_WEIGHT: 2.0
MASK_WEIGHT: 5.0
DICE_WEIGHT: 5.0
HIDDEN_DIM: 256
NUM_OBJECT_QUERIES: 100
NHEADS: 8
DROPOUT: 0.0
DIM_FEEDFORWARD: 2048
ENC_LAYERS: 0
PRE_NORM: False
ENFORCE_INPUT_PROJ: False
SIZE_DIVISIBILITY: 32
DEC_LAYERS: 10 # 9 decoder layers, add one for the loss on learnable query
TRAIN_NUM_POINTS: 12544
OVERSAMPLE_RATIO: 3.0
IMPORTANCE_SAMPLE_RATIO: 0.75
TEST:
SEMANTIC_ON: False
INSTANCE_ON: False
PANOPTIC_ON: True
OVERLAP_THRESHOLD: 0.8
OBJECT_MASK_THRESHOLD: 0.8


# log into wandb on darjeeling
# install new detectron on darjeeling, replacing old one
# include the code for wandb visualizer in train_net
# run another training, see how it goes
# figure out where the eval error comes from
# test current model
8 changes: 8 additions & 0 deletions configs/rilv9/panoptic-segmentation/ril1-shnv1.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,8 @@
_BASE_: maskformer2_R50_bs16_50ep.yaml
DATASETS:
# TRAIN: ("rilv7",)
TRAIN: ("rilv9-shapenetv1",)
SOLVER:
# MAX_ITER: 80000 # 20h
MAX_ITER: 160000

Loading