Data Quality Improvement – Conditional Functional Dependency (CFD)

Data Quality Improvement – Conditional Functional Dependency (CFD)

To fulfil the promise I made before, I dedicate this blog post to cover the topic of Conditional Functional Dependency (CFD). The reason that I dedicate a whole blog post to this topic is that CFD is one of the most promising constraints to detect and repair inconsistencies in a dataset. The use of CFD allows to automatically identifying context-dependent quality rules. That makes it a promising tool for automatic data quality rule discovery. CFD was originally developed by professor Wenfei Fan from the University of Edinburg. I have listed some of his papers regarding CFD [1][4][5] in the reference section at the end of this post.

Conditional Functional Dependency (CFD)

CFD is an extension to traditional functional dependencies (FD). FD is a class of constraints where the values of a set of columns X determine the values of another set of columns Y that can be represented as X → Y. FD that was developed mainly for schema design [1] is required to hold on entire relation(s) that makes it less useful for detecting data inconsistencies for the real-world datasets. CFD extends the FD by incorporating bindings of semantically related values that is capable to express the semantics of data fundamental to data cleaning [1].

Here I will borrow the classic ‘cust’ relation example from [1] that I found is possibly the simplest way to explain the CFD concept. Figure 1 shows an instance r0 of ‘cust’ relation that specifies a customer in terms of the customer’s phone (country code (CC), area code (AC), phone number (PN)), name (NM), and address (street (STR), city (CT), zip code (ZIP)).

Figure 1. An instance of the cust relation
Traditional Functional Dependency (FD)

From the customer records shown in Figure 1, we can see that all the customers with the same country code (CC) and area code (AC) are located in the same city (CT) as well. For example, all customers with CC as ’01’ and AC as ‘908’ are located in the CT ‘MH’ (t1, t2, and t4). Here we have the functional dependency, f1: [CC, AC ] → [CT ], that the CC and the AC can determine the value of CT. As this dependency applies to the entire relation, this is a traditional functional dependency. In addition, we can find another traditional functional dependency in the relation, f2: [CC, AC, PN ] → [STR, CT, ZIP ], that represents the dependency that phone country code, area code and number can determine the street name, city, and zip code.

Conditional Functional Dependency (CFD)

Let’s take a look at another example, f: [ZIP] → [STR]. For the customers (t5 and t6) with the same ZIP code ‘EH4 1DT’, they do have the same STR value ‘High St.’. Now let’s check the dependency on the entire relation for all customers. We can see that the dependency applies to t1 and t2 with ZIP as ‘07974’ and STR as ‘Tree Ave.’. However, this dependency does not apply to t4 that has ZIP as ‘07974’ but STR as ‘Elm Str.’. In this case, the ZIP code ‘07974’ cannot determine a unique STR value. Therefore, the constraint [ZIP] → [STR] does not hold on the entire relation so it is not a functional dependency. However, the constraint [ZIP] → [STR] does hold on for the customers with country code (CC) as ’44’, i.e. a UK address. In other words, the postcode can determine street in the UK address system but not in the US address system, and the constraint [ZIP] → [STR] only hold on when the country code is 44. This type of constraints is conditional functional dependency (CFD) that can be notated as ([CC, ZIP] → STR, (44, _ || _ )) where [CC, ZIP] → STR refers to the functional dependency and (44, _ || _ ) specifies the condition, i.e. when CC is 44, ZIP uniquely determines STR. The notation of a CFD can be generalised as (X → A, tp) where X → A is an FD and tp is a tuple pattern specifying the attributes in X and A that defines the subset where the FD holds on.

Constant Conditional Functional Dependency (CCFD)

As mentioned above, a CFD can be expressed as (X → A, tp) where tp is a tuple pattern, such as the example mentioned earlier (44, _ || _ ) where the symbol ‘||’ separate the left-hand side (LHS) from the right-hand side (RHS) of the dependency and the symbol ‘_’ represents any possible value, i.e. ‘_’ is a variable in the language of programming. One special case of CFD is that all the attributes for defining the tuple pattern are constants. For example, for the CDF, ([CC, AC] → CT, (01, 908 MH)), all the attributes defining the tuple pattern are constants: CC=01; AC=908; CT=MH. In plain English, the phone country code 01 and phone area code 908 of a customer determine the city where this customer is located to be MH. Compared to normal CFD, CCFD attracts more attention from the data quality community as CCFD expresses the semantics at the most detailed level and is easy to be used to check data consistencies.

CFD Discovery

Compared to FD, CFD is more effective in detecting and repairing inconsistencies of data [2][3]. CFD has demonstrated its potentials for commercial adoption in the real world. However, to use it commercially, effective and efficient methods for discovering the CFDs in a dataset are required. Even for a relatively small dataset, the workload to manually identify and validate all the CFDs is overwhelming. Therefore, the CFD discovery process has to be automated. A number of CFD discovery algorithms [2][5][6] have been developed since the concept of CFD was introduced.

In this section, I will introduce the CFD discovery algorithm developed by Chiang & Miller [2]. This method has been proved by Chiang & Miller [2] as effective to capture semantic data quality rules to enforce a broad range of domain constraints. In addition, as the redundant candidates are pruned as early as possible, the search space for the set of minimal CDFs is reduced as well as the computation time for discovering the CDFs.

Chiang & Miller have used 12 double-column pages in their paper [2] to elaborate their algorithm. That paper is informative and worth reading in detail if you are interested at data quality rule discovery. In the rest of this blog post, I will try my best to use simple language to explain how this algorithm works.

Firstly, let’s assume we have a relation with four attributes, A, B, C, D. We want to find the CFDs in the relation. The result set of CFDs needs to be minimum with redundant CFDs disregarded. In addition, we expect the algorithm takes as little computation time as possible. From the four attributes, A, B, C, D in the relation, the search space for the CFDs can be illustrated in an attribute search lattice (Figure 2). Each node in the lattice represents an attribute set. For example, ‘AC’ represents an attribute set including attribute ‘A’ and attribute ‘C’. The single-direction edge between two nodes represents a candidate CFD. For example, the edge (AC, ABC) represents ([A, C] → B, (x, x || x)) and the pattern tuples are unknown and needs to be mined out by the algorithm.

Figure 2: Attribute search lattice

The candidate CDFs are evaluated by traversing the lattice using a breadth-first search (BFS) algorithm, i.e. starting at the tree root and traversing all nodes at the present depth level before moving to the next depth level. In our example, we first traverse through the first level (k =1) that consists of single attribute sets (i.e. [A], [B], [C], [D]), followed by the second level (k = 2) that consists of 2-attribute sets (i.e. [A, B], [A, C], [A, D], [B, C], [B, D], [C, D]), and the next level until level k = total levels -1 or all CDFs in a minimum set have been found.

To achieve optimised algorithm efficiency, not all nodes in the lattice are visited. Instead, this algorithm prunes nodes that are supersets of already discovered rules and nodes that do not satisfy the threshold. In this way, the search space is reduced by skipping the evaluations of descendant nodes. For example, when the candidate ([A, C] → B, (A=’x’, _ || _)) is found to be a CDF, then it is no point to evaluate the candidate ([A, C, D] → B, (A=’x’, _, _ || _)) as the CDF ([A, C] → B, (A=’x’, _ || _)) already covers the semantics of ([A, C, D] → B, (A=’x’, _, _ || _)).

To evaluate whether or not a candidate in the lattice is a valid CDF, we can identify the values of left-hand side (LHS) of the candidate and check whether the same values of left-hand side (LHS) maps to the same value of right-hand side (RHS). To model this in the language of programming, we group the LHS attribute value set and RHS attribute value set, for a valid CDF, the tuples in the LHS group do not split into two or more RHS groups. Generally speaking, the more tuples fallen in a group that represents a valid CDF, this CDF is more preferable. A threshold can be imposed to the number of tuples a CDF need to hold on for it is qualified as a valid CDF.

For the tuple groups that fail the CDF validity test, they can still be refined by considering additional variable and conditional attributes.

Figure 3: A portion of the attribute search lattice

For example, as Figure 3 shown, we assume that no CFD can be materialised on a candidate [A, C] → B, of edge (AC, ABC). We can generate a new candidate in the next level of the lattice by considering an additional attribute such as D so that we have the new candidate as ([A, C, D] → B) as shown as an edge (X’, Y’) on Figure 3. If the new candidate is still not able to materialise to a valid CFD after adding a variable attribute, we can condition on an additional attribute. For example, we can condition on attribute D on top of an existing conditional attribute such as A. In Chiang & Miller’s algorithm, these new candidates are added into a global candidate list and the corresponding lattice edges are marked. The algorithm only searches the nodes with marked edges to ensure only minimal rules are returned.

References

[1] P. Bohannon, W. Fan, F. Geerts, X. Jia and A. Kementsietsidis, “Conditional Functional Dependencies for Data Cleaning,” 2007 IEEE 23rd International Conference on Data Engineering, 2007, pp. 746-755, doi: 10.1109/ICDE.2007.367920.

[2] F. Chiang, & R.J. Miller (2008). Discovering data quality rules. Proc. VLDB Endow., 1, 1166-1177.

[3] G. Cong, W. Fan, F. Geerts, X. Jia, and S. Ma, “Improving data quality: Consistency and accuracy,” in VLDB, 2007.

[4] W. Fan, F. Geerts, X. Jia, and A. Kementsietsidis, “Conditional functional dependencies for capturing data inconsistencies,” TODS, vol. 33, no. 2, June, 2008.

[5] W. Fan, F. Geerts, J. Li and M. Xiong, “Discovering Conditional Functional Dependencies” in IEEE Transactions on Knowledge and Data Engineering, vol. 23, no. 5, pp. 683-698, May 2011, doi: 10.1109/TKDE.2010.154.

[6] M. Li, H. Wang and J. Li, “Mining conditional functional dependency rules on big data,” in Big Data Mining and Analytics, vol. 3, no. 1, pp. 68-84, March 2020, doi: 10.26599/BDMA.2019.9020019.

[7] V.S. Santana, F.S. Lopes, (2019) Method for the Assessment of Semantic Accuracy Using Rules Identified by Conditional Functional Dependencies. In: Garoufallou E., Fallucchi F., William De Luca E. (eds) Metadata and Semantic Research. MTSR 2019. Communications in Computer and Information Science, vol 1057. Springer, Cham. https://doi.org/10.1007/978-3-030-36599-8_25

Leave a Reply

Please log in using one of these methods to post your comment:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s