large-Scale Tv Dataset
Multimodal Named Entity Recognition
(STVD-MNER)

The STVD-MNER dataset is designed for the Multimodal Named Entity Recognition (MNER) task. MNER is a well-established research problem in Computer Vision (CV), Audio Signal Processing (ASP), and Natural Language Processing (NLP). It leverages visual, audio, and textual information to detect Multimodal Named Entities (MNEs) and determine their corresponding types (e.g., location, person, organization). The following figure illustrates examples of MNEs with various representations and entity types. As part of the STVD collection, the dataset STVD-MNER originates from television content, including information from Electronic Programming Guides (EPG) and TV audio-visual (A/V) streams.

Multimodal Named Entity Recognition (MNER) in video

The STVD-MNER dataset is currently released in a β version, which includes a Hello World test set. This test set contains approximately 820 hours of audio-visual content linked to around 9 thousand textual Named Entities (NEs). These NEs have been extracted from web sources. In this setting, the MNER task must be performed from scratch relying on the provided lists of NEs, the transcripts extracted from audio streams, and the associated audio-visual data. A Full test set is planned for a future release and is expected to cover several thousand hours of audio-visual material.

The dataset is structured as follows:

The dataset follows the naming convention described below.

CX / NameCode / NEs_list_imdb.json
NEs_list_stvdkgstr.json
NEs_list_stvdkgall.json
day / ts_epg.csv
ts_video.mp4
ts_audio.mp4
ts_transcript_wt.srt
ts_transcript_wt.txt
ts_transcript_wb.srt
ts_transcript_wb.txt
ts_transcript_ws.srt
ts_transcript_ws.txt
ts_transcript_wm.srt
ts_transcript_wm.txt
ts_transcript_wl.srt
ts_transcript_wl.txt

For understanding and testing purposes, a selection of samples (NEs and types with their associated audio, textual and video segments) is provided in the following table.

NE Type Audio Text Video
sample 1 Gibbs PER 1_a 1_t 1_v
sample 2 Saint-Denis LOC 2_a 2_t 2_v

The dataset is available for non-commercial research purposes only. To access the dataset, users must first download the agreement (available in English or French ), complete and sign it, and then send a scanned copy by email to Mathieu Delalandre email. After reviewing and validating the request, we will provide the password required to extract the dataset.

The dataset can be downloaded from the table below, which also presents general statistics. For easier distribution and access, the dataset is split into several archive packages of 16 GB each. The storage service at UT typically offers download speeds between 3 MB/s and 16 MB/s, depending on connection speed and concurrent access. As a result, downloading each package usually takes between 30 and 45 minutes. All packages must be downloaded to extract the complete dataset.

Duration (h) Channels Collections NEs files NEs EPG files A/V files Transcripts1 Size (GB) Packages Link
819 h 7 284 284 x3 9,256 843 x1 876 x2 843 x10 280 18 download

110 additional inconsistent files appear. They are listed here and will be removed in a future release.

For clarity, we detail here technical and scientific aspects of the STVD-MNER dataset. Further information can be found in the research papers [1,2].

tiny base small
Th 42.5 30.9 11.9
Dmax 19.3h 26.5h 68.8h
  1. H.G. Vu, N. Friburger, A. Soulet and M. Delalandre. stvd-kg: A Knowledge Graph for French Electronical Program Guides. International Conference on Web Information Systems Engineering (WISE), 2025.
  2. F. Rayar, M. Delalandre and V.H. Le. A large-scale TV video and metadata database for French political content analysis and fact-checking. Conference on Content-Based Multimedia Indexing (CBMI), pp. 181-185, 2022.