Introduction

Contrastive learning 이란?

The key concept of time series contrastive learning is to capture the meaningful underlying information shared between views (usually generated by random data augmentation), such that the learned representations are discriminative among time series.
To accomplish this objective, existing contrastive methods comply with an assumption that the augmented views from the same instance (i.e., positive pair) share meaningful semantic information

Contrastive learning의 문제점

It was discovered that the effectiveness of the contrastive learning models will indeed suffer when two views generated from the same image actually do not share meaningful information (e.g., an unrelated cat-dog pair cropped from the same image
In other words, not all positive pairs are beneficial to contrastive learning.

Noisy Positive pairs

This results in noisy contrasting views, where the shared information between views is predominantly noise.
As a result, a noisy alignment occurs, in which the model simply learns the pattern of noise rather than the true signal.

Faulty positive pairs

Due to the sensitivity of time series, data augmentation may inadvertently impair sophisticated temporal patterns contained in the original signal and thus produce faulty views.
Consequently, the augmented view is no longer has the same semantic meaning as the original ECG, leading to a faulty alignment where the model learns to align non-representative patterns shared between views.

An intuitive solution is to suppress bad positive pairs during contrastive training, however, directly identifying bad positive pairs in real-world datasets is challenging for two reasons.

First, for noisy positive pairs, it is often infeasible to measure the noise level given a signal from a real-world dataset.
Second, for faulty positive pairs, one will not be able to identify if the semantic meaning of augmented views are changed without the ground truth.

We find noisy positive pairs exhibit relatively small losses throughout the training process. In contrast, faulty positive pairs are often associated with large losses during training.

Dynamic bad pair mining (DBPM) algorithm.

we design a dynamic bad pair mining (DBPM) algorithm.
The proposed DBPM algorithm addresses the bad positive pair problem with a simple concept: identify and suppress.
Specifically, DBPM aims for reliable bad positive pair mining and down-weighting in the contrastive training.
To this end, DBPM first utilizes a memory module to track the training behavior of each pair along training process.

Contribution

First, to the best of our knowledge, this is the first study to investigate the bad positive pair problem exists in time series contrastive learning. Our study contributes to a deeper understanding of an important issue that has not been thoroughly explored. Second, we propose DBPM, a simple yet effective algorithm designed as a lightweight plug-in that dynamically deals with potential bad positive pairs along the contrastive learning process, thereby improving the quality of learned time series representations. Third, extensive experiments over four real-world large-scale time series datasets demonstrate the efficacy of DBPM in enhancing the performance of existing state-of-the-art methods.

Our DBPM differs from image-based solutions in following aspects.

First, DBPM is designed for time series contrastive learning, in which not only faulty positive pairs arise, but also noisy positive pairs exist. The particular challenge of addressing two different types of bad positive pairs is unique to time series data and distinguishes them from image data.
Furthermore, DBPM identifies bad positive pairs based on their historical training behaviors in a dynamic manner, which allows more reliable identification of potential bad positive pairs.

Methods

Problem definition

Linear probing을 통한 contrastive learning 성능 확인

ANALYSIS: TIME SERIES CONTRASTIVE LEARNING

Symbols

$x_{i}$ : A training time series instance in the dataset.
$R^{C \times K}$ : The real coordinate space of the time series data, with $C$ channels and $K$ time points.
$X_{train}$ : The training set containing all time series instances.
$N$ : The total number of instances in the training set.
$τ (\cdot)$ : A random data augmentation function applied to the time series data.
$(u_{i}, v_{i})$ : A pair of different views generated from the original data $x_{i}$ by the augmentation function $τ$ , representing a positive pair.
$G (\cdot ∣ θ)$ : The time series encoder function parameterized by $θ$ .
$r_{i}^{u}, r_{i}^{v}$ : Representations of the views $u_{i}$ and $v_{i}$ in the representation space, as encoded by $G$ .
$L (u_{i}, v_{i})$ : The contrastive loss for a pair of views $u_{i}$ and $v_{i}$ .
$s (r^{u}, r^{v})$ : A score function, specifically the dot product between normalized representations $r^{u}$ and $r^{v}$ .
$t$ : A temperature parameter that scales the contrastive loss.
$z_{i}$ : The true signal in the input data that contains desired features for the encoder to learn.
$ξ_{i}$ : Spurious dense noise present in the input data.
$D_{z}, D_{ξ}$ : Distributions representing the true signal and the noise, respectively.
$ar g max_{θ}$ : An operator that finds the values of $θ$ that maximize the given function.
$I (\cdot; \cdot)$ : Mutual information between the representations of two views.
$D_{unknown}$ : A distribution representing an unknown or faulty signal not aligned with the true signal $D_{z}$ .
$\frac{\partial L}{\partial r _{i}^{u}}$ : The partial derivative of the contrastive loss with respect to the representation $r_{i}^{u}$ .

Preliminaries.

Consider a training time series $x_{i} \in R^{C \times K}$ in a training set $X_{train}$ with $N$ instance, the time series contrastive learning framework first applies a random data augmentation function $τ (\cdot)$ to generate two different views from original data $(u_{i}, v_{i}) = τ (x_{i})$ (i.e., positive pair). These two views are then projected to representation space by the time series encoder $(r_{i}^{u}, r_{i}^{v}) = (G (u_{i} ∣ θ), G (v_{i} ∣ θ))$ . The time series encoder is learned to maximize the agreement between positive pair $(u_{i}, v_{i})$ by minimizing the contrastive loss (i.e., InfoNCE): The contrastive loss $L (u_{i}, v_{i})$ is defined as:

L (u_{i}, v_{i}) = - lo g \frac{exp ( s ( r _{i}^{u} , r _{i}^{v} ) / t )}{exp ( s ( r _{i}^{u} , r _{i}^{v} ) / t ) + \sum _{j = 1}^{N} exp ( s ( r _{i}^{u} , r _{i}^{j} ) / t )},

where $s (r^{u}, r^{v}) = (r^{u} \cdot r^{v}) /∥ r^{u} ∥∥ r^{v} ∥$ is a score function that calculates the dot product between $ℓ_{2}$ normalized representation $r^{u}$ and $r^{v}$ . $(r_{i}^{u}, r_{i}^{j})$ denotes representations from different time series (i.e., negative pairs). $t$ is a temperature parameter.

Noisy Alignment

Let us consider the noisy positive pair problem. According to [Wen & Li, 2021], each input data can be represented in the form of $x_{i} = z_{i} + ξ_{i}$ , where $z_{i} \sim D_{z}$ denotes the true signal that contains the desired features we want the encoder $G (θ)$ to learn, and $ξ_{i} \sim D_{ξ}$ is the spurious dense noise. We hypothesize that the noisy positive pair problem is likely to occur on $x_{i}$ when $ξ_{i} >> z_{i}$ , because noisy raw signals will produce noisy views after data augmentation. Formally, we define $(u_{i}, v_{i})$ as a noisy positive pair when $ξ_{i}^{u} >> z_{i}^{u}$ and $ξ_{i}^{v} >> z_{i}^{v}$ . Further, we know that minimizing the contrastive loss is equivalent to maximize the mutual information between the representations of two views [Tian et al., 2020]. Therefore, when noisy positive pairs present, $ar g max_{θ} I (G (u_{i} ∣ θ); G (v_{i} ∣ θ))$ is approximate to $ar g max_{θ} I (G (ξ_{i}^{u} ∣ θ); G (ξ_{i}^{v} ∣ θ))$ . This results in noisy alignment, where the model predominantly learning patterns from noise.

Faulty Alignment

Now, we consider the faulty positive pair problem. In view of the sensitivity of time series, it is possible that random data augmentations (e.g., permutation, cropping [Um et al., 2017]) alter or impair the semantic information contained in the original time series, thus producing faulty views. Formally, we define $(u_{i}, v_{i})$ as a faulty positive pair when $τ (x_{i}) \sim D_{unknown}$ , where $D_{unknown} \neq = D_{z}$ . Take partial derivatives of $L (u_{i}, v_{i})$ w.r.t. $r_{i}^{u}$ (full derivation in Appendix A.4), we have

- \frac{\partial L}{\partial r _{i}^{u}} = \frac{1}{t} (r_{i}^{v} - j = 0 \sum N \frac{r _{i}^{j} exp ( s ( r _{i}^{u} , r _{i}^{j} ) / t )}{\sum _{j = 0}^{N} exp ( s ( r _{i}^{u} , r _{i}^{j} ) / t )}),

Eq.2 reveals that the representation of augmented view $u_{i}$ depends on the representation of augmented view $v_{i}$ , and vice versa. This is how alignment of positive pairs happens in contrastive learning: the two augmented views $u_{i}$ and $v_{i}$ provide a supervision signal to each other. For example, when $v_{i}$ is a faulty view (i.e., $z_{i}^{v} \sim D_{z}$ , $z_{i}^{v} \sim D_{unknown}$ ), $r_{i}^{v}$ provides a false supervision signal to $G (θ)$ to learn $r_{i}^{u}$ . In such a case, $ar g max_{θ} I (r_{i}^{u}; r_{i}^{v})$ is approximate to $ar g max_{θ} I (G (z_{i}^{u} ∣ θ); G (z_{i}^{v} ∣ θ))$ , where $G (θ)$ extracts faulty or irrelevant information shared between $u_{i}$ and $v_{i}$ , leading to a faulty alignment. This analysis can also be applied to situations where $u_{i}$ is a faulty view, or where both $u_{i}$ and $v_{i}$ are faulty views. Moreover, we hypothesize that representations from faulty positive pairs often exhibit low similarity (i.e., $s (r_{i}^{u}, r_{i}^{v}) ↓$ ), as their temporal patterns are different. This means the encoder will place larger gradient on faulty positive pairs, which exacerbates the faulty alignment.

Empirical evidence

We pre-define three types of positive pair in our simulated dataset: normal positive pair, noisy positive pair, and faulty positive pair.
In order to simulate noisy positive pairs, we assume that the noise level of a signal depends on its Signal-to-Noise Ratio (SNR). By adjusting the SNR, we can generate time series with varying levels of noise, from clean (i.e., SNR>1) to extremely noisy (i.e., SNR<1).
To simulate faulty positive pairs, we randomly select a portion of data and alter their temporal features to produce faulty views.
We find that noisy positive pairs exhibit a relatively small loss and variance throughout the training process (the green cluster)
In contrast, faulty positive pairs are often associated with large contrastive loss and small variance during training (the orange cluster), which implies that the faulty positive pair is difficult to align.

Dynamic Bad Pair Mining

Graphical Description

Code

 
#* calculate weighted average training loss for each instance 
epoch_weight = [(1+i)/self.config['trainer']['max_epochs'] for i in range(epoch)]
instance_mean = {k: np.mean(np.array(v)*epoch_weight) for k, v in sorted(memory_module.items(), key=lambda item: item[1])}
 
mu = np.mean(list(instance_mean.values()))
sd = np.std(list(instance_mean.values()))
 
#* global statistic
gaussian_norm = norm(mu, sd)
 
#* thresholds for noisy positive pairs and faulty positive pairs
np_bound = mu-self.config['correction']['cut_np']*sd
fp_bound = mu+self.config['correction']['cut_fp']*sd
 
#* identify potential bad positive pairs
np_index = [k for k in instance_mean.keys() if instance_mean[k]<=np_bound]
fp_index = [k for k in instance_mean.keys() if instance_mean[k]>=fp_bound]

z1 = self.head(self.encoder(x1))
z2 = self.head(self.encoder(x2))
 
pos_loss = loss_ntxent([z1, z2], self.device)
 
#* update the memory module
for i in range(len(index)):
memory_module[index[i].cpu().item()].append(pos_loss[i].cpu().item())
 
#* DBPM re-weighting process
if epoch >= self.config['trainer']['warm_epochs']:
	l = pos_loss.detach().cpu()
	w = gaussian_norm.pdf(l)
	for i in range(len(index)):
		_id = index[i].cpu().item()
		if _id in np_index:
			#* penalize potential np
			pos_loss[i] *= w[i]
		elif _id in fp_index:
			#* penalize potential fp  
			pos_loss[i] *= w[i]

Experiments

RINCE: contrastive loss가 큰 애들만 패널티를 줌. 논문의 가정에 따르면 faulty positive pair에만 penalty가 가해진다고 볼 수 있음.

Conclusion

In this paper, we investigated the bad positive pair problem in time series contrastive learning.
We theoretically and empirically analyzed how noisy positive pairs and faulty positive pairs impair the quality of time series representation learned from contrastive learning.
We further proposed a Dynamic Bad Pair Mining algorithm (DBPM) to address this problem and verified its effectiveness in four real-world datasets.

🪴JJong's Digital Garden🪴

Explorer

TOWARDS ENHANCING TIME SERIES CONTRASTIVE LEARNING- A DYNAMIC BAD PAIR MINING APPROACH

Introduction

Contrastive learning 이란?

Contrastive learning의 문제점

Noisy Positive pairs

Faulty positive pairs

Dynamic bad pair mining (DBPM) algorithm.

Contribution

Methods

Problem definition

ANALYSIS: TIME SERIES CONTRASTIVE LEARNING

Preliminaries.

Noisy Alignment

Faulty Alignment

Empirical evidence

Dynamic Bad Pair Mining

Graphical Description

Code

Experiments

Conclusion

Graph View

Table of Contents

Backlinks

🪴JJong's Digital Garden🪴

Explorer

TOWARDS ENHANCING TIME SERIES CONTRASTIVE LEARNING- A DYNAMIC BAD PAIR MINING APPROACH

Introduction

Contrastive learning 이란?

Contrastive learning의 문제점

Noisy Positive pairs

Faulty positive pairs

Dynamic bad pair mining (DBPM) algorithm.

Contribution

Related works

Methods

Problem definition

ANALYSIS: TIME SERIES CONTRASTIVE LEARNING

Preliminaries.

Noisy Alignment

Faulty Alignment

Empirical evidence

Dynamic Bad Pair Mining

Graphical Description

Code

Experiments

Conclusion

Graph View

Table of Contents

Backlinks