Drug Exposure Side Effects from Mining Pregnancy Data


More than half of all pregnant women use some sort of medication during pregnancy. The exposure varies from accidental use of known teratogens such as retinoic acid to planned use of assumed safe drugs, e.g. acetaminophen. However, the fetus is potentially more vulnerable partly due to incomplete metabolic pathways that normally detoxify drugs and other chemicals in adults or children. Furthermore, the complexity of fetal biology makes it very difficult to predict what kind of side effects certain drugs might possibly have in the developing organism.

The knowledge about the safety of the various types of drugs is from different sources, initially from animal research, as all drugs released to the market have to be tested on experimental animals. However, the results cannot be easily extrapolated from animals to humans, and the main information on the safety in humans must be derived from observational studies, including post-marketing surveillance and register based studies. The major problem in the observational studies is to recognize the causal relationship between drug exposure and its effect in these data sources. As an example, thalidomide, one of the most well known human teratogens, may cause specific malformation in one third of infants exposed in first trimester. Despite that, thousands of malformed babies were born before the relationship was discovered illustrating the problems in not using available information before someone comes up with a specific hypothesis. In the case of thalidomide, the signal was strong enough to alert a clinician. Most associations are expected to be much weaker and occur over a longer time span, therefore will be even harder to identify.


In this proposed study, Our goal is to use the mining results to provide an early warning for drugs that is harmful to the unborn baby. we plan to extend our SmartRule data mining tool to study the Danish National Birth Cohort (DNBC) data set and derive specific rules that contain combinations of drug exposure in multiple time slots. Such temporal and sequence dependent drug exposures provide new insights that are currently unknown. We should be able to find out the most vulnerable period for patients that are exposed to a combination of specific drugs during pregnancy.

Project Description

This is an interdisciplinary research project of the faculty and students between the Epidemiology Department and the Computer Science Department. More specifically, we plan to use novel data mining techniques to analyze real life clinical data to reveal the side effect due to drug exposure during pregnancy. Almost half of all pregnant women use medication during pregnancy. However, research indicates that a fetus is very vulnerable due to a lack of, or incomplete, metabolic pathways that detoxify drugs and other chemicals that may cross the placental barrier. Furthermore, the complexity of fetal biology makes it very difficult to predict what kind of side effects certain drugs might possibly have. Even in observational studies, including post-marketing surveillance and register based studies, the major problem is to recognize the causal relationship between drug exposure and its effect in these data sources. As an example, thalidomide, one of the most well known human teratogens, may cause malformation in one third of infants exposed in first trimester. Despite that, thousands of malformed babies were born before the relationship was discovered illustrating the difficulties in interpreting the data.

The traditional approach in epidemiology has been to use a deductive approach to data analysis and the results have been that large data files remain un-analyzed for years. Setting up a deductive hypothesis usually means waiting for reports on side effects to be published that can activate research findings. Then follows testing of specific associations using existing data or by generating new data sources; this usually takes months or years. We need a screening tool to use on available data sources in order to identify drugs or combinations of drugs that deserve further scrutiny. We need inductive methods because in principle, only a very small number of cause-effect mechanisms can be ruled out using biological reasoning alone. There are many technical reasons that such data is difficult for traditional data analysis methods. For example:

  1. Subtle side effect: In pregnancy data, only a very small number of cases may reflect the subtle influences of drug exposure. For example, among a large number of patients who took a particular type of drug, only 1% may be a susceptible preterm birth. However, this 1% is still significant side effect of the drug to discover.
  2. Temporal sensitive: Timing is an important aspect, as the susceptibility of the fetus varies in the developmental processes. Thalidomide was linked to serious malformations of the limbs; however, these specific malformations only resulted from exposure in the first eight weeks after conception. With a different timing, other malformations may have resulted.
  3. Data sequences: The sequence of taking different types of drugs is related to the timing, but also involves the possible interaction between the different drugs. Almost nothing is known about the potential time-dependent interaction between drugs in pregnancy, partly due to the lack of analytical screening tools.


Data mining may be an alternative that can discover more patterns of drugs and health effects than that can be scrutinized using more traditional techniques. The SmartRule mining technique that we developed for medical data classification may be extended to handle the above mentioned problems.

SmartRule can discover the small number of cases that reflects the subtle influence of drug exposure. Different from traditional analysis methods, this mining approach can generate all possible rules with a very low support and confidence level. However, such low support or low confidence rules could still be significant because of their contrast to normal pregnant woman. With the hierarchical presentation of rules, it is easier for domain experts to discover these hidden rules in a rule tree. We could also perform a partition of the large dataset to extract cases that are exposed to a drug of interest during a certain time period. Rules generated from the sub-population also provide more specific insights on the effect of such composite events as exposure to multiple drugs and/or confounders.

The derived rules can include the relationship between timing of drug exposure and the safety outcome. Our proposed approach bypasses the problem of a large combination of drug exposure sequences by dividing the pregnancy period into small time slots. The drug exposure information is represented for each time slot based on patient pharmacy record. By treating each drug exposure in a certain time slot as a single independent attribute, the rules generated contain both drug type and timing information. Such a method is very flexible in terms of timing slot division. The user can control the granularity of time sequences to study specific effects°™for example, a drug taken in each trimester for a big picture, or drug exposure in every week for finer granularity rules.

To study the dependency between multiple pregnancies for a patient, we extract the patient data and form a subpopulation of patients with multiple pregnancies. For each patient in the subpopulation, we list her drug exposure and other confounders in the pregnancies, and her multiple birth outcomes. The derived association rules reveal the patterns of the attributes and corresponding birth outcomes. Such derived rules will provide domain experts with information to analyze different outcomes with the drug exposure.

Our SmartRule MFI mining algorithm does not require superset checking and reduces the computation for counting support; thus, it is very efficient in mining rules. To further improve the performance of mining MFI, we use a technique to gather past tail information to determine the next node to explore during the mining process. Our experimental results reveal that it is an order of magnitude faster than other MFI mining methods.

Danish National Birth Cohort (DNBC) Dataset

DNBC is large-scale cohort study focusing on many aspects of pregnancy, which consists of approximately 100,000 pregnant women for their drug exposure and later developmental information. We have done a preliminary study to investigate the safety of different treatments of depression in pregnancy which includes antidepressants and other central nervous system (CNS) active drugs, e.g., benzodiazepines and antipsychotic medications. There are about 4454 patients in this sub-dataset, in which approximately 1000 women exposed to various CNS active drugs have been identified, with a high variation in timing and sequence. Using our proposed temporal data mining method, we are able to derive associations between antidepressant exposure in different periods and the preterm outcome. For example, we first confirmed thay exposure to citalopram ("cita") significantly increases the risk of preterm labor. Then we found that for all three trimesters of cita exposure, "alcohol " is the most important factor that combines with the drug exposure to cause preterm. Further, the combining of cita exposure in a different temporal period together with alcohol is relevant to causing preterm. We have already able to discover new findings in our data mining study that were unable to discover by traditional statistical regression analysis because the regression analysis cannot deal with the temporal and sequence of events.

Research Team

Computer Science department: Laura Yu Chen, Jianming He, Wesley W. Chu
Epidemiology department: Lars Henning Pedersen, Jorn Olsen


  • Yu Chen, Lars Henning Pedersen, Wesley W. Chu and Jorn Olsen. "Drug Exposure Side Effects from Mining Pregnancy Data" In SIGKDD Explorations (Volume 9, Issue 1), June 2007, Special Issue on Data Mining for Health Informatics, Guest Editors: Raymond Ng and Jian Pei (pdf & ppt).