Материал: искусственный интеллект

Внимание! Если размещение файла нарушает Ваши авторские права, то обязательно сообщите нам

suai.ru/our-contacts

quantum machine learning

based on quantum probability theory [19] [25]. It provides a promising new solution to the data fusion problem by constructing a model that has (a) a single finite state vector that lies within a low dimensional vector space, and

(b) by forming a set of measurement operators that represent the p measurements. In this way, we can achieve a compressed, coherent, and interpretable representation of the p variables that generate the complex collection of K tables, even when no standard p-way joint distribution exists. In a Hilbert space model, the state vector represents respondents’ initial tendencies to select responses to each of the p measurements; the measurement operators describe the inter-relations between the p measurements (independent of the initial state of the respondents). 1

HSM models are similar to traditional multi-dimensional scaling (MDS) models [32], but also di erent from them in important aspects. Like traditional MDS models, HSM models are based on similarity relations between entities located within a vector space. However, traditional MDS models define the similarity relations by inner products between vectors in a real vector space, whereas HSM models define similarity relations by projections onto subspaces of a complex vector space. Also, MDS models are designed to account for a single 2−way symmetric similarity matrix, whereas HSM models can be applied to multiple similarity matrices (e.g., when the similarity relation is asymmetric, see [27]).

The article is organized as follows. First, we provide an artificial data example that illustrates how consistency requirements of a single p−way joint distribution can be violated. Second, we describe the general procedures for building HSM models. Third, we illustrate the general method by applying the principles to the artificial data set. Finally, we apply the theory to construct a HSM model for a real data set.

2. An Artificial Example

In this section we present an artificial example that serves to illustrate several ways that a joint probability model can fail. (Later we present a real data example, but the artificial example provides a clearer illustration

1Technically, a Hilbert space is a complete inner product vector space defined on a complex field (although it can be restricted to the real field). The norm for the vector space is determined by the inner product. Our vector spaces are of finite dimension, and so they are always complete.

3

suai.ru/our-contacts

quantum machine learning

of all of the problems). Suppose that the relations among three variables are investigated, labeled A,B,C. For example, these variables could represent ratings about the Adeptness (yes, no), Brilliance (low, medium, high), and Confidence (1,2,3,4) of a political candidate reported on various large social media sources. Suppose that six contingency tables, shown together in Table 1, are collected from various sources. The table labeled p(C = ck) is a 1way table containing the relative frequency of ratings for 4 increasing levels of Confidence obtained from one source. For example, the relative frequency of the second level of confidence equals p(C = c2) = .2788. Table p(A = ai, B = bj) is a 2 × 3 contingency table containing the relative frequencies of responses to Adeptness (yes,no) and then Brilliance (low, medium, high) obtained from another source. Table p(A = ai, C = ck) is a 2×4 contingency table containing the relative frequencies of responses to Adeptness and then Confidence presented in the AC order. For example, the relative frequency of saying yes to Adeptness and then choosing Confidence level 2 equals p(A = a1, C = c2) = .0312. Table p(C = ck, A = ai) is a table produced when the attributes (Confidence, Adeptness) were asked in the CA order. For example, the relative frequency of choosing Confidence level 2 and then saying yes to Adeptness equals p(C = c2, A = a1) = .0297. (The CA table is arranged in the same format as the AC table so that they can be directly compared). The table p(B = bj, C = ck) is a 3 × 4 contingency table containing the relative frequencies of responses to Brilliance and then Confidence in the BC order; and the table p(C = ck, B = bj) is a table produced by the opposite CB order. (Again it is displayed in the same format to facilitate comparison). Each of the six tables forms a context for judgments.

2.1. Does a joint distribution exist?

The following question can be asked about Table 1: Does a single 3 − way joint probability distribution of the observed variables exist that can reproduce Table 1? The 3 − way joint probability distribution is defined by 3 discrete random variables, A with 2 values, B with 3 values, and C with 4 values, that generate 2 · 3 · 4 = 24 latent joint probabilities that sum to one: π(A = ai ∩B = bj ∩C = ck), where, for example, A is a random variable with values a1 for yes and a2 for no, and similar definitions hold for the other three random variables. For example, the relative frequency of (A = a2, C = c4) in the table p(A = ai, C = ck) is predicted by the marginal π(A = a2, C =

is

 

 

∩B = bj ∩C = c4), and the relative frequency of (C = c4)

c4) =

j π(A = a2

 

 

 

i,j

 

 

 

 

predicted by the marginal p(C = c4) =

π(A = ai

 

B = bj

 

C = c4).

4

suai.ru/our-contacts

quantum machine learning

Table 1: Six di erent contingency tables produced by answers to attributes A,B,C.

p(C = ck)

1

 

2

3

 

4

0.2186

 

0.2788

0.2551

 

0.2475

 

 

 

 

p(A = ai, C = ck)

 

 

 

 

 

 

0.0388

 

0.0312

0.2675

 

0.3201

 

 

 

 

 

 

0.1554

 

0.1506

0.0182

 

0.0183

 

 

 

 

p(B = bj , C = ck)

 

 

 

 

 

 

0.1266

 

0.0476

0.0049

 

0.0165

0.0915

 

0.0911

0.2158

 

0.2167

 

 

 

 

 

 

0.1086

 

0.0320

0.0133

 

0.0354

 

 

 

 

 

 

p(A = ai, B = bj)

0.0721 0.5777 0.0078

0.1235 0.0374 0.1815

p(C = ck, A = ai)

0.0233

0.0297

0.2279

0.2212

 

 

 

 

0.1953

0.2491

0.0272

0.0264

p(C = ck, B = bj)

0.0680

0.0762

0.0581

0.0713

0.1089

0.1391

0.1273

0.0924

 

 

 

 

0.0416

0.0635

0.0697

0.0838

 

 

 

 

Note that this 3 − way joint distribution is completely general without any type of independence restrictions.

The answer to our question is that no single 3 − way joint distribution of the three observed variables can reproduce Table 1. First of all, the 3 − way distribution requires the marginal distribution of a single random variable to be invariant across contexts. Marginal invariance is based on the law of total probability. This requirement fails (e.g., p(C = c2) = .2788 from the 1 − way table, which is smaller than that from the 2 − way table p(A = a1, C = c2) + p(A = a2, C = c2)= .0312 + 1506 = .1818.) A second problem is that the order that questions are asked changes the 2 − way distributions for some pairs (e.g., the distribution for the context BC is not the same as the distribution for the context CB.) Order e ects violate the commutative property required by the joint probability model.

The tables also violate another consistency requirement, called the LeggettGarg [21] inequality, which concerns the correlations between pairs of variables required by a single 3 − way joint distribution. To illustrate this in a simple manner, consider the three tables shown in Table 2. These tables were formed by defining new random variables X, Y, Z as follows:

X = A; (Y = y1) = (B = b1 B = b2) and (Y = y2) = (B = b3); (Z = z1) = (C = c1) C = c2) and (Z = z2) = (C = c3) C = c4). The Leggett-Garg inequality implies the following restriction on the 2 × 2 joint

probabilities required by the 3 − way joint probability model (see Appendix for a simple proof for this inequality):

5

suai.ru/our-contacts

quantum machine learning

Table 2: Three 2×2 tables produced from Table 1. Left table is formed from p(A = ai, B = bj ), middle is formed from p(B = bj , C = ck),right is formed from p(A = ai, C = ck)

p(X = xi, Y = yj) p(Y = yj , Z = zk) p(X = xi, Z = zk)

0.6498

0.0078

 

0.3568

0.4539

 

0.0700

 

0.5876

 

 

 

 

 

 

 

 

 

 

 

0.1609

0.1815

 

0.1406

0.0487

 

0.3060

 

0.0365

 

 

p(X =Y ) + p(Y =Z) − p(X =Z) 0,

(1)

where for example, p(X =Y ) = p(X = x1 ∩ Y = y2) + p(X = x2 ∩ Y = y1), which is simply the sum of the probabilities that the attributes produce di erent yes-no answers. Using the data in Table 2, the Leggett-Garg value equals −.1303, which is below the zero bound required by the 3 − way joint probability model.

Other consistency requirements can be shown to apply for larger sets of variables. For example, four 2 × 2 tables (AC, AD, BC, BD) generated from 4 binary variables (A, B, C, D) must satisfy the famous Bell inequalities in order to be reproduced by a 4 way joint distribution of the four observed variables. More general formulations for linear inequalities required for a p − way joint distribution to reproduce a collection of K tables produced by subsets of p variables is provided in [15] [16] [12].

2.2. A general test of the joint distribution model

More generally, we can test whether or not a 3 − way joint probability distribution can fit the data in Table 1 by estimating all of its parameters from the data to produce a closest fit. The 3 − way joint probability model has 24 joint probability parameters, π(A = ai ∩ B = bj ∩ C = ck). Using the law of total probability, these 24 model parameters can be used to predict the 50 cell probabilities in Table 1. Parameter estimation programs can be used to search for parameters that produce the closest fit according to the Kullback-Leibler (KL) divergence. Denote πn as the predicted cell probability and define pn, as the observed cell probability. Then the (KL) divergence is defined as

D = pn · ln

pn .

 

πn

n

 

6

suai.ru/our-contacts

quantum machine learning

For sample data, with N observations per table, the KL divergence can be converted into a chi square statistic G2 = 2 · N · D. The null hypothesis states that the data in Table 1 was generated by a 3way joint probability model. If the null hypothesis is correct, then G2 has a chi-square distribution with degrees of freedom df = (50 6) (24 1) (the probabilities in each of the 6 tables sum to one, and so 6 of the 50 probabilities are linearly constrained; the 24 joint probabilities sum to one and so one probability is linearly constrained). Using G2, we can compute a p−value, which equals the probability of obtaining the observed G2 or greater under the null hypothesis. If it is below the significance level α = .05, then we reject the null hypothesis.

Following these procedures, we fit the 3 − way joint probability model to the data in Table 1, and the KL divergence equals D = 0.5431. If we assume that each table is based on 100 observations, then G2 = 54.31 and p = .0001, which is a statistically significant di erence.

The above test of the joint distribution based on the KL divergence, is not limited to this example. The same principles can be extended to p variables, that produce K tables of various sizes collected under di erent contexts.2

The proposed test of a p−way joint distribution to account for a collection of K contingency tables formed by subsets of the p−variables does not rule out all Kolmogorov models. A more general p + q joint distribution can be chosen to reproduce the data tables by using an additional q random variables (see, e.g., [16]). The proposed non-parametric method only tests a p − way joint distribution based on the observed p−variables.

3. Empirical evidence from research on document retrieval

Violations of a single joint probability model for the observed variables have observed in several empirical studies investigating document retrieval.

Question order e ects were observed by Bruza and Chang [9]. They collected a large sample of participants from the Amazon Mechanical Turk platform. Five query terms about topics were used to examine question order e ects. Participants were told to read a brief description of each topic (e.g., research on emerging branding trends), and then they were asked to rate an article (presumably returned from a web search engine) on two di erent dimensions (e.g., how relevant is this article for the topic, how interesting is

2see https://arxiv.org/abs/1704.04623 for a di erent example using 4 variables

7