large-Scale Tv Dataset
Multimodal Named Entity Recognition
(STVD-MNER)

The STVD-MNER dataset is related to the Multimodal Named Entity Recognition (MNER) task in video. The MNER in video is an established task in the Computer Vision (CV), Audio Signal Processing (ASP) and Natural Language Processing (NLP) fields which uses visual, audio and textual information to detect Multimodal Named Entities (MNEs) as well to identify their types (e.g. location, person, organization, etc.). The next figure gives examples of MNEs including different representations and types. As the STVD-MNER is taking part of the STVD collection, the multimodality is derived from television content appearing in Electronic programming guides (EPG), TV Audio/Video (A/V) streams.

Multimodal Named Entity Recognition (MNER) in video

The STVD-MNER dataset is provided in a β version with an Hello World test set. This Hello World test set contains about 820 hours of audio/video data with ≈9K textual Named Entities (NEs). It is characterized into two distinct subsets of collections called top and last. The top subset contains ≈85% of NEs dispatched in ≈⅓ of the dataset. The last subset covers the ≈⅔ and has few (or none) NEs per collection. Here, the MNER task has to be driven from scratch with the transcriptions of audio streams. A pending Full test set would cover several thousands of A/V data.

The dataset is composed of different parts provided as:

a root directory for each TV channel,
for each TV channel, collection subdirectories,
for each collection subdirectory, a set of JSON files containing the lists and types of NEs,
for each collection subdirectory, daily subdirectories containing the files of broadcast events (EPG, audio, transcript, video). The EPG data are provided as CSV files summarizing XMLTV content. The audio/video files are encoded with the MPEG-₄ format. The transcripts are provided as SRT and TXT files with and without the timestamps.

The following naming convention is applied within the dataset:

`CX`	`/`	`NameCode`	`/`	`NEs_list_imdb.json`
				`NEs_list_stvdkgstr.json`
				`NEs_list_stvdkgall.json`
				`day`	`/`	`t_s_epg.csv`
						`t_s_video.mp4`
						`t_s_audio.mp4`
						`t_s_transcript_wt.srt`
						`t_s_transcript_wt.txt`
						`t_s_transcript_wb.srt`
						`t_s_transcript_wb.txt`
						`t_s_transcript_ws.srt`
						`t_s_transcript_ws.txt`

CX is a directory name for a TV channel where CX has a label {France2, France3, France5, C8, ARTE, W9, TF1},
NameCode is the name for a subdirectory containing a collection, it has a normalized ASCII code a-z, 0-9 with an empty symbol _ (e.g. le_grand_betisier). The NameCode has a normalized lenght of 80 characters,
NEs_list_X are labels used for naming the list of NEs with three templates X = {imdb, stvdkgall, stvdkgstr}. IMDb and STVD-KG are the two web sources for extracting the NEs. {stvdkgall, stvdkgstr} result of the application of two parametters setting Δ_KG and Δ_A/V as detailled below,
day and t_s are two timestamps for naming the subdirectories and files of broadcast events. They have the templates | YEAR | MONTH | DAY | for day, e.g. 20250529,
and | YEAR | MONTH | DAY | _ | HOURS | _ | MINUTES | for t_s, e.g. 20250529_09_55, fixed with the starting time of a broadcast event,
the subdirectories of broadcast events have the label day,
the EPG files have the label t_s_epg,
the audio/video files have the labels t_s_audio and t_s_video, respectivly,
the transcript files have the labels t_s_transcript_X, where X = {wt, wb, ws}, resulting of the Speech-To-Text (STT) suite Whisper using the 3 transcription models {tiny, base, small}. The transcriptions are provided with and without subtitles using the two formats SRT and TXT (resulting in 6 transcription files per audio file).

For the needs of visualization and testing, some samples (NEs and types with their related audio and video segments) are given in the next table.

	NE	Type	Audio	Text	Video
sample 1	Gibbs	PER	1_a	1_t	1_v
sample 2	Saint-Denis	LOC	2_a	2_t	2_v

The dataset is available for non-commercial research purposes. Before to download the dataset, get the agreement (in English or French version) and sign it. Then, send the scanned version to Mathieu Delalandre . After verifying your request, we will contact you with the password to unzip the dataset.

The files constituting the dataset are given here with general statistics. For a better accessibility, the dataset is provided as different archive packages of 16 GB. Our storage service at the UT delivers at 3-16 MB/s for downloading (from a low / high speed connection, respectively) with concurrent access. It would require a near 30 to 45 minutes of downloading per package. All the packages are needed to uncompress the dataset.

Duration (h)	Channels	Collections	NEs files	NEs	EPG files	A/V files	Transcripts	Size (GB)	Packages	Link
819 h	7	284	439	9,256	843 x1	843 x2	843 x6	281	18	download

For the needs of clarification, we introduce here technical / scientific aspects about STVD-MNER dataset. Further information could be find in the research papers [1,2].

NEs and types: have been extracted with a robust approach [1] processing with Named Entity Recognition (NER) and Named Entity Linking (NEL) to build-up the Knowledge Graph (KG) STVD-KG from EPG (eg. xmltvfr.fr). The KG STVD-KG is linked to the Web database IMDb. Within STVD-KG and IMDb, the NEs are provided into 3 categories {PER, LOC, ORG} (i.e., for person, localisation and organization). The two time intervals Δ_KG, Δ_A/V of a collection between the KG and the A/V data differs such as Δ_KG ≫ Δ_A/V. Requests within the KG have been done using the two intervals for an overall and strict extraction of NEs. The next table gives the statistics of extracted NEs among the two sources (STVD-KG and IMDb) and extraction methods (overall and strict). This results in 3 lists of NEs {imdb, stvdkgall, stvdkgstr} dispatched into the 3 categories {PER, LOC, ORG}. For the needs of clarification, the Union = imdb ∪ stvdkgall is given (considering stvdkgstr ∈ stvdkgall). Let's note that some PER NEs could appear in duplicate between the two sources {imdb, stvdkgall} (why we refer the process as an Union).

	PER	LOC	ORG	Total
`imdb`	2,929	0	0	2,929
`stvdkgstr`	326	206	7	539
`stvdkgall`	3,732	2,496	99	6,327
`Union`	≤ 6,661	2,496	99	≤ 9,256

The NEs appear at different levels in the A/V collections respecting a near exponential distribution. The smallest subset of collections, called top, covers the main distribution of NEs. The next table provides its statistics. The top subset contains ≈85% of NEs dispatched in ≈⅓ of the dataset. The largest set of collection, called last , covers the ≈⅔ and has too few (or none) NEs per collection. Here, the MNER task has to be driven from scratch with the transcriptions obtained from audio streams.

	Collections	Duration	NEs	Min	Mean	Max
`top`	70 (24.6%)	251.7h (30.7%)	7,977 (86.2%)	40	114	893
`last`	214 (75.4%)	567.4h (69.3%)	1,279 (13.8%)	0	6.14	39
`all`	284 (100%)	819.1h (100%)	9,256 (100%)	0	32.6	893

A/V data: the A/V data have been captured as detailed in [2] with adaptations. For quality improvement, the audio has been encoded at 256kbps supported by channels of the French DTT. The video data have been encoded at 1.6Mbps and a SD resolution (720×576, 30 FPS) for a best trade-off between the quality and memory cost. Similar to [2], parameters have been applied to map the captured A/V data to collections where the start t0 and end t1 times of a broadcast event have been corrected into t0^-, t1⁺, respectively. The A/V capture is asynchrous on the TV Workstation [2], the session capture has been bounded to 5×4h a day to minimize the latency between the audio and video. This latency is expressed as L=d_A-d_V with d_A and d_V the audio and video durations, where d_V obtained with hardware encoding is the real-time. That is, the latency is negative with L ∈ ]-19,-13.9[ seconds (see latency). This latency can be expressed a linear function L(t) with L(t=4h) =L_min=-19 seconds. The L(t) distribution is then uniform having as an average =L_min/2=-9.5 seconds and an error gap ∓ ε =|L_min|/2=9.5 seconds. Considering t0 and t1 the timestamps of an audio segment, the mapping with the video segment is obtained as t0+|L_min|/2-ε=t0 and t1+|L_min|/2+ε=t1+|L_min|.
STT: the transcript files have been generated using the Whisper STT suite with 3 models {tiny, base, small}. They are provided as SRT and TXT files with and without the timestamps. The STT requires a huge computation time. To process the n=843 audio files of the dataset (having a total duration of 819h) a parallel processing have been deployed. This uses a high performance computer (DELL 5820 computer, CPU Intel Xeon W-2295, k_max=36 threads, 256 GB of RAM, 36 TB of disk capacity) and a multithreading implementation (a threading parameter k is fixed for each model {tiny, base, small}, where each thread [1, k] processes in batch ≈ n/k files). The optimum k parameters have been fixed for every model {tiny, base, small} from the interpolation of Throughput (Th) curves (x=k, y=Th) preventing the CPU/memory trashing. The overall Th is derived from the multithreading processing and computed as $Th= \sum_{i = 1}^{k} \frac{Di}{RTi}$ with D_i the duration of the audio data to process by the thread i and RT_i its response (or execution) time. For time synchronization, it is important to have an equal duration of audio data to process per thread such as D_i ≈ 819h/k. This is known as the partitioning problem into k equal sum subsets that is NP-hard with a complexity O(kⁿ) (e.g. with k=36 and n=850 we have O(kⁿ)>10^10³ solutions). It can be solved with different algorithms like the Greedy or Backtracking resulting in a RT variation of ≈5℅ between the k threads. Based on these different mechanisms, the obtained optimum Th for the models {tiny, base, small} are given here with, as well, the overall times to process the dataset.

	`tiny`	`base`	`small`
`Th`	42.5	30.9	11.9
`D_max`	19.3h	26.5h	68.8h

H.G. Vu, N. Friburger, A. Soulet and M. Delalandre. stvd-kg: A Knowledge Graph for French Electronical Program Guides. International Conference on Web Information Systems Engineering (WISE), 2025.
F. Rayar, M. Delalandre and V.H. Le. A large-scale TV video and metadata database for French political content analysis and fact-checking. Conference on Content-Based Multimedia Indexing (CBMI), pp. 181-185, 2022.

large-Scale Tv Dataset Multimodal Named Entity Recognition (STVD-MNER)

large-Scale Tv Dataset
Multimodal Named Entity Recognition
(STVD-MNER)