This Article

Citations


Creative Commons License
Except where otherwise noted, this work is licensed under Creative Commons Attribution-NonCommercial 4.0 International License.

A Pilot study on Prediction of Pouchitis in Ulcerative Colitis Patients by Decision Tree Method Versus Logistic Regression Analysis


1 Department of Biostatistics, School of Medicine, Shiraz University of Medical Sciences, Shiraz, IR Iran
2 Gastroenterohepatology Research Center, Shiraz University of Medical Sciences, Shiraz, IR Iran
3 Colorectal Research Center, Shiraz University of Medical Sciences, Shiraz, IR Iran
*Corresponding author: Ali Reza Safarpour, Gastroenterohepatology Research Center, Shiraz University of Medical Sciences, Shiraz, IR Iran, Tel.: +98-7112357282, Fax: +98-7112307594, E-mail: asafarpour@sums.ac.ir.
Annals of Colorectal Research. 2013 September; 1(2): 67-70. , DOI: 10.17795/acr-12649
Article Type: Research Article; Received: Jun 2, 2013; Revised: Jul 7, 2013; Accepted: Jul 20, 2013; epub: Jul 30, 2013; ppub: Sep 29, 2013

Abstract


Background: Pouchitis is a non-specific inflammation of the ileal reservoir and the most frequent complication that patients experience in long periods. Diagnosis should be made based on the clinical, endoscopic, and histological aspects. Prediction of pouchitis is an important issue for the physician.

Objectives: The study was aimed to identify the predictive factors of pouchitis as well as their importance.

Patients and Methods: In the present study, two classification techniques entitled decision trees method and logistic regression analysis were used to help the physician in prediction of pouchitis in ulcerative colitis (UC) patients. These patients are submitted to a specific surgery. The ability of these two methods in prediction is achieved by comparison of the accuracy of the correct predictions (the minimum error rate) and the interpretability and simplification of the results for clinical experts.

Results: The accuracy rate in prediction is 0.6 for logistic regression method and 0.45 for decision tree algorithm. In addition, the mean squared error is lower for logistic regression (0.41 versus 0.48). However, the area under the ROC is more for decision tree than logistic regression (0.52 and 0.45 respectively).

Conclusions: The results are not in favor of none of these two methods. However, the simplicity of decision tree for clinical experts and theoretical assumptions of logistic regression method make the choice clear. But more sample size may be needed to choose the best model with more confident.

Keywords: Pouchitis; Ulcerative Colitis; Decision Trees; Logistic Regression

1. Background


Up to 30% of patients suffering from ulcerative colitis (UC) ultimately need a total colectomy (1). Most frequent indications for colectomy include intractable disease and occurrence of dysplasia or cancer in case of longstanding colitis. A total proctocolectomy with ileal pouch-anal anastomosis (IPAA) is the surgery of choice for the “definitive” management of UC, since it avoids a permanent stoma while removing all diseased colonic mucosa (1). The most frequent long-term complication is the occurrence of pouchitis, with cumulative incidence rates varying significantly between studies (7 to 59%).

The most frequent symptoms, that characterize pouchitis, include increased stool frequency and fluidity, rectal bleeding, abdominal cramping, urgency, malaise, tenesmus, and, in the most severe cases, incontinence and fever (2).

A clinical diagnosis should be confirmed by endoscopy and histology. Prediction of pouchitis in patients with ulcerative colitis is a challenging issue for the physicians, a problem requiring the classifiers in data mining techniques.

There are different classification methods in data mining techniques. Some of them are parametric methods (depending on underlying theoretical assumptions) such as logistic regression model, and some others are nonparametric (assumption free), such as artificial neural networks, decision trees, K-nearest neighborhood, etc. Logistic regression is a type of predictive model in which the output variable is a binary variable such as healthy or unhealthy, dead or alive, win or loss, etc. This is used for prediction of the desired event probability (3). Logistic regression is widely applied in the medical sciences. The binary output variable can take one of two possible values, denoted by 1 and 0 (for example, Y = 1 if a disease is present; Y = 0 otherwise). The input variables are the attributes involved in prediction of the probability of the desired event (Y = 1) denoted by X = (x1 , x2 ,…, xn). Logistic regression method models the relations between these variables through the following Formula:

Where P stands for probability, b0 is called the “intercept” and b1, b2,… are called the “regression coefficients” of x1, x2, … respectively. Each of the regression coefficients describes the importance of the corresponding input attribute on the output.

As mentioned earlier, logistic regression method depends heavily on its theoretical assumptions, whereas the real dataset rarely follows the underlying theoretical assumptions of parametrical modeling methods, such as with clinical datasets. The variable nature of biological data and their vague relation do not consist with their ideal assumptions. In comparison, nonparametric methods in data mining techniques using of learning procedures from a set of existing prototypes, attract the attention of researchers in different fields. In these methods, without any specific underlying assumption, the relation among a large part of one dataset (training set) is discovered and the model parameters are estimated in such a way that the error prediction gets minimized. Then, the power of model prediction is evaluated by the other part of the dataset (testing set). Decision tree is one of these methods for classification of objects into decision classes (4). A decision tree classifier is a function stated as following:

dt:dom(X1 )×dom(X2 )×…×dom(Xn)→dom(Y)

In which X1, X2 , …, Xn are input attributes and Y is the output, where Xi has domain dom (Xi) and Y has domain dom (Y). A decision tree is a directed, acyclic graph T in a form of a tree. Each node in a tree has either zero or more outgoing edges. If a node has no outgoing edge, it is called a decision node (a leaf node); otherwise, it is called a test node (or an attribute node). Each decision node N is labeled with one of the possible decision classes,Y∈ {Y1, Y2 }. Each test node is labeled with one input attribute, Xi∈ {X1, X2, …, Xn}, called the splitting attribute. Each splitting attribute (Xi) has a splitting function (fi) associated with it. fi determines the outgoing edge from the test node, based on the Xi of an object (O) in a question. It is in the form of Xi∈ Yi where Yi⊂ dom(Xi) If the Xi value of O is within Yi, then the corresponding outgoing edge from the test node is chosen (5).

All these lead to some practical rules and help the person to make a decision. Decision tree plays a vital role in medical diagnoses (6). The derived rules from this method help physicians to decide about patients based on their own clinical observations. Among all classifiers in data mining techniques, decision tree is preferred in medical researches as it provides readable classification rules, is easy to interpret, has higher better accuracy and can be constructed fast (6). These advantages led us to applying this method for prediction of pouchitis in Patients with ulcerative colitis.

2. Objectives


In the present study, the performance of this method is compared with logistic regression analysis in a real clinical data set. Since there is low number of patients and their clinical observations at hand, this study is done as a pilot. Obviously, to derive more reliable rules a larger sample size of patients is required.

3. Patients and Methods


All 43 patients undergoing a proctocolectomy with IPAA for ulcerative colitis (UC) at the Nemazi and Faghihi Hospitals of Shiraz University of medical sciences (tertiary referral center) between 2001 and 2012 were identified through the pre and post operation data.

Clinical charts of all patients were reviewed to trace the clinical, endoscopic, and histologic characteristics. Occurrence of pouchitis was considered as the binary output and 85 related attributes were used as the inputs or risk factors.

Logistic regression analysis and decision tree method were applied for the data analysis (7), using Weka software. C4.5 algorithm was chosen for constructing the decision tree. This algorithm is called J48 in Weka and was chosen due to its ability to handle binary outputs, both nominal and numeric input attributes, and missing values. The datasets were split into 70% for the training set (constructing the tree) and the remainder percent for testing. To evaluate these two modeling methods, a 10-fold cross validation procedure for accuracy rate estimation as well as area under the ROC (Receiver Operating Curve ) were used (8).

4. Results


Some descriptive attributes and the frequency of pouchitis in the patient samples are summarized in Table 1. In addition, factors such as onset of the disease symptoms, extra intestinal manifestations, complications, autoimmune diseases, microscopic findings, WBC, HB, Na, K, etc. at the time of surgery, instruments during the surgery, some post operation, etc. were considered as the input risk factors. Logistic regression analysis and J48 decision tree algorithm were used to predict the occurrence of pouchitis in the patients. The results for both methods are shown in Table 2. However, large number of categorical variables and their parameters caused some issues in the estimation process of the logistic regression method; e.g. the number of parameters was equal or even more than the observations in this pilot study. Therefore, there was no possibility to enter all the variables to the model. Consequently, significant variables with occurrence of pouchitis were chosen from the univariate analysis (by chi-square test) and then entered to the model simultanously. Unfortunately, the clinically important variables were not entered to the model, thus the formulation is not shown here.

Table 1.
Some Descriptive Attributes of our Patients’ Samples
Table 2.
Brief Results of two Applied Modeling Method on the Clinical Data Set

The area under ROC close to 1 and far from 0.5, and also a mean squared error close to 0 were desired. The more the area under ROC was far from 0.5, the higher was the model ability to distinguish between patient with and without pouchitis. Furthermore, a mean squared error near 0 demonstrated a low error prediction rate. Accordingly, findings showed weak results for both methods. This was also true for accuracy rates of both methods. The main reason for this result was the low sample size compared with the large number of attributes in the study. The derived rules from the constructed decision tree are summarized in Table 3. Furthermore, Figure 1 features these rules in a flowchart view. These rules were evaluated by the testing set (30% of the observations which were not used in model construction). However, with larger sample size more reliable rules can be derived.

Table 3.
The derived rules from the trained decision tree based on clinical findings of 43 ulcerative colitis patients
Figure 1.
Decision Tree Derived from the Information of 43 Ulcerative Colitis Patients

5. Discussion


The attempt in the present study was to predict pouchitis in patients with ulcerative colitis, the diagnosis of which is accompanied with some doubts for the physicians (1). In other words, there are no clear laboratory tests or wholly accepted diagnosis criteria for occurrence of this phenomenon in ulcerative colitis patients. Therefore, we tried to model the relations among clinical findings of these patients by two classifier methods. One method was more theoretical with ideal underlying assumptions to be submitted before use, named logistic regression analysis, and the other one was more practical and flexible to the real data circumstances but required a larger data set to be trained, namely decision tree technique (5). Unfortunately, the number of available patients for this research was low and consequently this study was done in a pilot scale. Therefore, the derived results could not reveal the real performance of these two methods. Nevertheless, the simplicity and interpretability of the decision tree was obvious from the results. Since the underlying theoretical assumptions of logistic regression analysis had not been checked before the modeling procedure, the application of this method was not appropriate thus the results should be expressed with more caution. Furthermore, logistic regression method cannot handle the large amount of categorical variables especially with a low sample size (5). Therefore, although the significant variables in primary univariate analysis were entered to the logistic regression model, the clinical important variables did not remain in the final model and consequently, the estimated model was not valuable for clinical experts. In contrast, the decision tree algorithm led to practical rules directing the user to the decision. Obviously, with a larger sample size, the more accurate results can be achieved from both methods.

Acknowledgments

We Wish to thanks the colorectal research center staff for their attempts in the data gathering procedure.

Footnotes

Implication for health policy/practice/research/medical education: Two theoretical methods help the clinical experts to predict Pouchitis in Ulcerative Colitis Patients.
Authors’ Contribution: Saeedeh Pourahmad: Data analyzer and The main writer of the manuscript. Ali Reza Safarpour, Alimohammad Bananzadeh and Salar Rahimikazerooni: Clinical research group. Zahra Zabangirfard: Data gathering secratory
Financial Disclosure: The authors declare no conflict of interest.
Funding/Support: This work is financially supported by Colorectal Research Center, Shiraz University of Medical Sciences, Shiraz, Iran.

References


  • 1. Pardi DS, D'Haens G, Shen B, Campbell S, Gionchetti P. Clinical guidelines for the management of pouchitis. Inflamm Bowel Dis. 2009;15(9):1424-31. [DOI] [PubMed]
  • 2. Yu ED, Shao Z, Shen B. Pouchitis. World J Gastroenterol. 2007;13(42):5598-604. [PubMed]
  • 3. Srivatsa SK. Evaluation of Logistic Regression and Neural Network Model With Sensitivity Analysis on Medical Datasets. Int J Computer Sci Secur (IJCSS). 2011;5(5)
  • 4. Kokol P, Pohorec S, Štiglic G, Podgorelec V. Evolutionary design of decision trees for medical application. Wiley Interdisciplin Rev: Data Mining and Knowledge Discovery. 2012;2(3):237-54.
  • 5. Pourahmad S, Azad M, Paydar S. Prediction of Malignancy in Suspected Thyroid Tumour Patients by Three Different Methods of Classification in Data Mining. 2012; Available from: http://airccj.org/CSCP/vol2/cs...
  • 6. Lavanya D. Performance Evaluation of Decision Tree Classifiers on Medical Datasets. Int J Comput App. 2011;26(4):1.
  • 7. Endo A, Shibata T, Tanaka H. Comparison of Seven Algorithms to Predict Breast Cancer Survival. Biomed Soft Comput Human Sci. 2008;13(2):11-16.
  • 8. Long WJ, Griffith JL, Selker HP, D'Agostino RB. A comparison of logistic regression to decision-tree induction in a medical domain. Comput Biomed Res. 1993;26(1):74-97. [PubMed]

Table 1.

Some Descriptive Attributes of our Patients’ Samples

AttributePercent, %
Gender
Male44
Female56
Marital Status
Single32.4
Married67.6
Education level
Illiterate14.7
Primary38.2
High school23.5
University14.8
Post graduate8.8
Birth place
Shiraz39.3
Other cities in Fars60.7
Family history of Pouchitis
Yes35.3
No64.7
Under surgery
Yes29.4
No70.6
Type of surgery
Laparoscopy61.8
Laparatomy38.2
Malignancy
No81.3
Rectum16.3
Liver and gall bladder2.4 (1 case)
Occurrence of Pouchitis
Yes47.1
No52.9

Table 2.

Brief Results of two Applied Modeling Method on the Clinical Data Set

MethodTP rate aFP rate bAccuracy rate cArea under ROCMean squared error
Logistic regression0.60.430.60.450.41
Decision tree (J48 algorithm)0.40.480.450.520.48

Table 3.

The derived rules from the trained decision tree based on clinical findings of 43 ulcerative colitis patients

The rulesThe Occurrence of Pouchitis
Appendectomy, yesYes
Appendectomy, no and symptom (onset of disease) bloody, no Yes
Appendectomy, no and symptom (onset of disease) bloody, yes and Post-op-ABa Gentamycin, yesNo
Appendectomy, no and symptomYes
(onset of disease) bloody, yes and Post-op-AB Gentamycin, no No
a Post Operation Antibiotics

Figure 1.

Decision Tree Derived from the Information of 43 Ulcerative Colitis Patients

Formula