数据集整理
计算机视觉
图像分类
MNIST
MNIST database - Wikipedia
MNIST handwritten digit database, Yann LeCun, Corinna Cortes and Chris Burges
MNIST Dataset | DeepAI
Fashion-MNIST
GitHub - zalandoresearch/fashion-mnist: A MNIST-like fashion product database. Benchmark
fashion_mnist | TensorFlow Datasets
5.3 Fashion MNIST - Pytorch中文手册
fashion_mnist | TensorFlow Datasets (google.cn)
Fashion MNIST dataset, an alternative to MNIST (keras.io)
CIFAR - 10
CIFAR-10是一个更接近普适物体的彩色图像数据集。CIFAR-10 是由Hinton 的学生Alex Krizhevsky 和Ilya Sutskever 整理的一个用于识别普适物体的小型数据集。一共包含10 个类别的RGB 彩色图片:飞机( airplane )、汽车( automobile )、鸟类( bird )、猫( cat )、鹿( deer )、狗( dog )、蛙类( frog )、马( horse )、船( ship )和卡车( truck )。
每个图片的尺寸为32 × 32 ,每个类别有6000个图像,数据集中一共有50000 张训练图片和10000 张测试图片。
CIFAR-10 and CIFAR-100 datasets (toronto.edu)
Dataset之CIFAR-10:CIFAR-10数据集简介、下载、使用方法之详细攻略_一个处女座的程序猿的博客-CSDN博客_cifar-10
CIFAR10数据集手动下载和导入 - 简书 (jianshu.com)
CIFAR10数据集的下载及使用 - 知乎 (zhihu.com)
ImageNet
ImageNet - Wikipedia
ImageNet (image-net.org)
ImageNet - 维基百科,自由的百科全书 (wikipedia.org)
ImageNet这八年:李飞飞和她改变的AI世界 - 知乎 (zhihu.com)
在早期的计算机视觉社区,PASCALViSualObjectClasses(VOC)挑战赛(从2005年到2012)是最重要的竞赛之一。在PASCALVOC中是多任务的,包括图像分类,目标检测,语义分割和行为检测。
VOC数据集是目标检测经常用的一个数据集,自2005年起每年举办一次比赛,最开始只有4类,到2007年扩充为20个类,共有两个常用的版本:2007和2012。学术界常用5k的train/val 2007和16k的train/val 2012作为训练集,test 2007作为测试集,用10k的train/val 2007+test 2007和16k的train/val 2012作为训练集,test2012作为测试集,分别汇报结果。
ImageNetLargeScaleVisualRecognitionChallenge(ILSVRC)已经将一般的目标检测向前推进了一大步。ILSVRC从2010到2017年每年被组织比赛,其中就包含了用ImageNet图像进行检测。ILSVRC中包含了200类视觉目标,图像和目标实例的数量比VOC大两个数量级。例如,ILSVRC-14就包含了517K张图像和534k被标注的目标
MS-COCO是目前最具有挑战性的目标检测,从2015年开始,每年都会举办基于MS-COCO数据集的竞赛,其包含的目标种类要少于ILSVRC,但其有更多的目标实例。例如,MS-COCO-17中包含了164k张图像和897K个被标注来自80个类别的目标。相比于VOC和ILSVRC,MS-COCO最大的进步,除了boundingbox的标注,还有单个实例分割的标注,帮助更准确的定位。另外,MS-COCO包含了更多小目标(其面积小于图像的1%)和更加密集的定位目标比VOC和ILSVRC。MS-COCO的这些特征让其目标分布更接近于真实的世界。MS-COCO已经在目标检测社区变为了实际的标杆。
DOTA是遥感航空图像检测的常用数据集,包含2806张航空图像,尺寸大约为4kx4k,包含15个类别共计188282个实例,其中14个主类,small vehicle 和 large vehicle都是vehicle的子类。其标注方式为四点确定的任意形状和方向的四边形。航空图像区别于传统数据集,有其自己的特点,如:尺度变化性更大;密集的小物体检测;检测目标的不确定性。数据划分为1/6验证集,1/3测试集,1/2训练集。目前发布了训练集和验证集,图像尺寸从800×800到4000×4000不等。
目标检测
COCO
COCO - Common Objects in Context (cocodataset.org)
语义分割
VOC2012
The PASCAL Visual Object Classes Challenge 2012 (VOC2012) (ox.ac.uk)
Cityscapes
Cityscapes Dataset – Semantic Understanding of Urban Street Scenes (cityscapes-dataset.com)
Mapillary
KITTI
The KITTI Vision Benchmark Suite (cvlibs.net)
作者:知乎用户
链接:https://www.zhihu.com/question/30626971/answer/1996387512
来源:知乎
著作权归作者所有。商业转载请联系作者获得授权,非商业转载请注明出处。
参考网站https://awesomeopensource.com/project/jsbroks/awesome-dataset-tools
我觉得写得已经很全面了。
Awesome Dataset Tools
A curated list of awesome dataset tools
Labeling Tools
Images
- CVAT - Online, interactive video and image annotation tool for computer vision
- COCO Annotator - Web-based image segmentation tool for object detection, localization and keypoints
- VoTT - Visual Object Tagging Tool: An electron app for building end to end object detection models from images and videos.
- Scalabel - Versatile and scalable tool that supports various kinds of annotations
- EVA - EVA is a web-based tool for efficient annotation of videos and image sequences and has an additional tracking capabilities
- LOST - Design your own smart Image Annotation process in a web-based environment
- Boobs - Fast and efficient BBox annotation for your images in YOLO, VOC/COCO formats
- MuViLab - Tool to help you labelling videos for computer vision
- Turkey - Web UI on Amazon Mechanical Turk to crowd-source image segmentation
- React Image Annotation - An infinitely customizable image tool built on React
- Point Cloud Annotation Tool - Annotate 3D boxes in point cloud
- ImageTagger - Open source online platform for collaborative image labeling
- DeepLabel - A cross-platform image annotation tool for machine learning
- Visual Object Tagging Tool - An electron app for building end to end Object Detection Models
- VGG Image Annotator - Standalone image annotator application packaged as a single HTML file
- SMART - Efficiently build labeled training datasets for supervised machine learning tasks
- Pixel Annotation Tool - Uses the algorithm watershed marked of OpenCV to annotate images in directories
- Pixie - GUI annotation tool which provides the bounding box, polygon, and semantic segmentation
- Turktool - Modern React app for scalable bounding box annotation of images
- LabelD - Simple image annotation tool to streamlining the overall process
- Comma Coloring - Adult coloring book for image segmentation
- LabelImg - Graphical image annotation tool and label object bounding boxes in images
- LCs Finder - Image annotation and object detection tool written in C
- js-segment-annotator - Javascript image annotation tool based on image segmentation
- Cytomine - Analysis of multi-gigapixel images
- labelme - Image Polygonal Annotation with Python (polygon, rectangle, circle, line, point and image-level flag annotation)
- SimpleAnnotate - Open source video and image annotation software for, currently only for OSX
- Sloth - Labeling image and video data for computer vision research
- Fast Annotation Tool - Online platform for collaborative image annotation
- Anno-Mage - Helps you in annotating images by suggesting you annotations for 80 object classes
- MedTagger - Collaborative framework for annotating medical datasets using crowdsourcing
- OpenLabeling - Labeling in multiple annotation formats
- Alturos.ImageAnnotation - Collaborative tool for labeling image data for yolo
- Yolo_mark - GUI for marking bounded boxes of objects in images
- imglab - peedup and simplify image labeling/ annotation process with multiple supported formats
- OpenLabeler - Open source desktop application for annotating objects
- UltimateLabeling - A multi-purpose Video Labeling GUI with integrated SOTA detector and tracker
Closed Source
- DataTorch - Platform for creating and shareing datasets.
- Labelbox - Platform for data labeling, data management, and data science. Its features include image annotation, bounding boxes, text classification, and more
- Supervise.ly - Image annotation and data management tool that you can use create image and video datasets
- Prodigy - Various machine learning models such as image classification, entity recognition and intent detection
- RectLabel - Label images for bounding box object detection and segmentation
- Lionbridge AI - Quickly annotate thousands of images and videos with relevant tags
- TrainingData.io - Medical image annotation tool for data labeling. Spports DICOM image format for radiology AI
- Spare5 - Crowdsourcing service for tasks such as data and image annotation, language assessment, and more
- Hive - Text and image annotation service that helps you create training datasets
- Figure Eight - Supports audio , [computer vision](https://www.zhihu.com/search?q=computer vision&search_source=Entity&hybrid_search_source=Entity&hybrid_search_extra={“sourceType”%3A”answer”%2C”sourceId”%3A1996387512}), natural language processing, and other data tasks
- Dataturks - Image segmentation, named [entity recognition](https://www.zhihu.com/search?q=entity recognition&search_source=Entity&hybrid_search_source=Entity&hybrid_search_extra={“sourceType”%3A”answer”%2C”sourceId”%3A1996387512}) (NER) tagging in documents, and POS tagging
- Playment - Services offered include bounding boxes, points and lines, polygons, semantic segmentation, and more
- Cogito Tech - Image annotation, content moderation, sentiment analysis, chatbot training
- OCLAVI - Annotate Bounding Box, Polygon, Circle, Point and Cuboidal annotations with precision
- Humans in the Loop - Use cases include face recognition, autonomous vehicles, and figure detection
- WorkAround - Host and annotate data, manage projects, and build datasets alongside top companies
- TaQadam - On-demand annotation with agents-in-the-loop
- Zillin - Image annotation service for classification, object detection and segmentation with API access and georeferenced images support.
- IBM Cloud Annotations - Simple and collaborative image annotation tool for teams and individuals inside ibm cloud environment.
- MedSeg - Free online medical annotation (segmentation) with AI models.
- MVTec Deep Learning Tool - Provides labeling functionalities for HALCON‘s deep-learning-based object detection and classification.
Audio
- Audio Annotator - JavaScript interface for annotating and labeling audio files
- Dynitag - Web-based collaborative audio annotator tool
- EchoML - play, visualize, and annotate your audio files for machine learning
Closed Source
- Figure Eight - Supports audio , computer vision, natural language processing, and other data tasks
Time Series
- Curve - An integrated experimental platform for time series data anomaly detection
- TagAnomaly - Anomaly detection analysis and labeling tool, specifically for multiple time series
- time-series-annotator - Implements classification tasks for time series.
- WDK - Tools to facilitate the development of activity recognition applications with wearable devices
Text
- brat - For all your textual annotation needs
- doccano - Open source text annotation tool for machine learning practitioner.
- Inception - A semantic annotation platform offering intelligent annotation assistance
- NeuroNER - Named-entity recognition using [neural networks](https://www.zhihu.com/search?q=neural networks&search_source=Entity&hybrid_search_source=Entity&hybrid_search_extra={“sourceType”%3A”answer”%2C”sourceId”%3A1996387512})
- YEDDA - For annotating chunk/entity/event on text, symbol and even emoji
- TALEN - Web-based tool for annotating word sequences
- WebAnno - Web-based annotation tool for a wide range of linguistic annotations
- MAE - Lightweight, general-purpose natural language annotation tool
- Anafora - Web-based raw text annotation tool
- TagEditor - Label dependencies, parts of speech, Named entities, and text categories
- ML-Annotate - Supports binary, multi-label and multi-class labeling of text
Closed Source
- Hive - Text and image annotation service that helps you create training datasets
- Figure Eight - Supports audio , computer vision, natural language processing, and other data tasks
- LightTag Text Annotation Tool for Teams.
Libraries
Audio
- Muda - Python library for augmenting annotated audio data