Challenges and Techniques for Mining Real Clinical Data

Regression analysis and statistical hypothesis testing are commonly used for association and classification of clinical data sets in medical studies. Although such traditional techniques are wildly used, they have several shortcomings. For example, when analyzing datasets with a large number of temporal attributes, domain experts often miss important associative attributes in regression analysis because of the large number of correlated attributes. On the other hand, for rare occurring diseases or operations, the number of documented observed cases is usually small, and hypothesis testing becomes ineffective for such analysis due to insufficient statistical significance. We shall present two such case studies to showcase how data mining techniques [1-7] can be used to remedy such shortcomings.

The first case studied the side effects of drug intake by pregnant women. This study was jointly carried out with physicians from the UCLA Epidemiology Department [8]. A portion of the real clinical data set from the Demark National Birth Cohort (studies of 4500 pregnant women with 20 attributes) records the specific drugs intake during pregnancy and their side effects¡ªe.g., preterm birth, malformation, and prenatal complication. The analysis problem is complicated by the sequence of different types of drugs taken at various time periods during the pregnancy (periods known as trimesters). Guided by the domain experts and using data mining classification and association techniques, goal-oriented (e.g., types of side effects) rules with high confidence were derived to locate the set of highly correlated seed attributes. In particular, we were studying the side effects of taking antidepressant drugs, as well as the confounders (drinking, smoking, social-economic status, psych-social stress, etc.). Rule hierarchies were formed to explore rules with these correlated seed attributes. In fact, because of the different approaches taken between data mining and statistical analysis, new rules were discovered by data mining techniques [8] that were missed by traditional statistical analysis. For example, the antidepressant citalopram, taken with alcohol during pregnancy, has a high probability of causing preterm birth. Further, this combination, taken in a late trimester, has an even higher risk of perterm birth. These rules were then verified by the epidemiologist via regression analysis from the original data set.

The second case developed reference guidelines for surgeons to select an operation based on a patient¡¯s preoperative condition. This project was jointly carried out with the Pediatric Urology Clinic at the UCLA Medical School [7]. More specifically, we studied urine incontinence disease in children caused by birth defects. Such problems can usually be relieved by one of the following four types of surgery: bladder neck repair with or without argumentation, and bladder neck closure with or without argumentation. Surgeons usually make the operation decision based on the patient¡¯s preoperative conditions, e.g., demography data (age, gender, etc.), ambulatory status, amount of?creatinine in the blood, leak point pressure, and uro-dynamics (such as the minimum volume of salient infused into a bladder when its pressure reaches 20 cm of water, etc.). Decisions are also based on each patient¡¯s catheterizing skill, his/her experiences of the postoperative complications, and an estimate of the final outcome in urine continence (dry or wet). However, such systematic documented data are not usually readily available. A comprehensive data set was collected by the UCLA Pediatric Urology Clinic from 1995 to 2002, and it covered 130 patients with 28 attributes. Because of the relative small sample size (although this is considered a large available recorded data set) and the large number of attributes, traditional hypotheses testing methods did not yield results that provided enough statistical significance. Therefore, it was proposed to formulate, as an association (of relevant attributes) and classification (operation type and outcome), data mining techniques as an alternative approach.

A critical problem here was the partitioning of some of the continuous-valued attributes (e.g., bladder pressure, amount of creatinine in the blood, etc.) into discrete cells for analysis. Using a domain expert to partition can be biased and inconsistent. Yet statistical clustering techniques failed to cluster the data of these variables with statistical significance based on such a small set of samples and the large number of attributes. Therefore, jointly with a bio-statistician, we developed a new hybrid approach for clustering. First, based on the domain expert's initial partition, we used data mining techniques to derive a small set of highly correlated (key) attributes from high confidence and support rules. Second, based on the small set of key attributes, statistical classification technique, CART, was used to determine the optimal partitions for the variables (i.e., the number of cells and their corresponding cell sizes). Data mining, based on these attributes with the improved partitions, yielded more accurate rules. This methodology enabled us to generate high-quality rules with a small sample size; this could not be accomplished by using data mining or statistical classifying methods alone.

Based on the operation type and its outcome, the data mining algorithm [7] can derive the rules by associating the preoperative conditions (attribute values) with the corresponding operation and outcome. Based on the patient¡¯s preoperative conditions, these rules can be used by surgeons as a guide in the treatment of their patients.

The practice of data mining is very much problem-dependent. Available data algorithms provide a starting point. Real-world case studies reveal that there are many unforeseen problems that prevent us from applying the known algorithm directly. Therefore, it is important to work jointly and closely with domain experts, as well as statisticians, to find new and novel techniques to resolve large-scale complex data mining problems.

Additional information: please refer to the attached slides (ppt)


The author wishes to thank Drs. Lars Pedersen and Jorn Olsen for providing the clinical data from the Demark National Birth Cohort and guiding the focus of the study; Dr. B. Churchill and Dr. Andy Chen for providing the urology clinical data as well as collaborating in formulating the data mining model, and Professor James W. Sayre for his stimulating discussion in developing the statistical and data mining approach in determining the cell size. This research was supported by the NIH PPG Grant #4442511-33780 and the NSF IIS Grant #97438.


  1. R. Agrawal and R. Srikant, "Fast algorithms for mining association rules," Proceedings of the 20th VLDB Conference, Santiago, Chile, 1994.
  2. B. Liu, M. Hu, and W. Hsu, "Multi-level organization and summarization of the discovered rules," Proceedings of the ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, August 2000, Boston, USA.
  3. B. Liu, M. Hu, and W. Hsu, "Intuitive representation of decision trees using general rules and exceptions," Proceedings of Seventeeth National Conference on Artificial Intellgience (AAAI-2000), July 30 - August 3, 2000, Austin, Texas, USA.
  4. D. Burdick, M. Calimlim, and J. Gehrke, "MAFIA: a maximal frequent itemset algorithm for transactional databases," Intl. Conf. on Data Engineering, April 2001.
  5. K. Gouda and M. J. Zaki: "Efficiently mining maximal frequent itemsets," Proc. of the IEEE Int. Conference on Data Mining, San Jose, 2001.
  6. Q. Zou, W. W. Chu, and B. Lu, "SmartMiner: A depth-first search algorithm guided by tail information for mining maximal frequent itemsets," Proc. of the IEEE Intl. Conf. on Data Mining, 2002.
  7. Qinghua Zou, Yu Chen, Wesley W. Chu, and Xinchun Lu, Mining association rules from tabular data guided by maximal frequent itemset," book chapter in "Foundations and Advances in Data Mining," edited by Wesley W. Chu and T.Y. Lin, Springer, 2005.
  8. Yu Chen, Lars Henning Pedersen, Wesley W. Chu, and Jorn Olsen, "Drug exposure side effects from mining pregnancy data," SIGKDD Explorations (Volume 9, Issue 1), June 2007, Special Issue on Data Mining for Health Informatics, Guest Editors: Raymond Ng and JianPei.
  9. Frequent Itemset Mining Implementations Repository,