Participants
This is a retrospective observational study of 420 patients with overt hyperthyroidism, including 127 TAF cases.
All participants had undergone or were undergoing outpatient or inpatient hyperthyroidism treatment in Almazov National Medical Research Centre or Pavlov First Saint Petersburg State Medical University between December 2000 and December 2019. Firstly, to select the eligible subjects, hyperthyroid patient medical records were examined. Secondly, to document a patient case history a single office visit was arranged. Finally, tracing the disease dynamics was fulfilled by phone. Local Ethics Committee approval was obtained. And, prior to the research, all participants had signed the informed consent form.
The participants were recruited in accordance with the criteria listed below.
Entry criteria:
-
1.
Men and women with a history of overt hyperthyroidism, associated with Graves’ disease (GD), toxic adenoma (TA) or multinodular toxic goiter (MTG).
-
2.
Age between 18 and 80 years.
Exclusion criteria:
-
1.
Subclinical hyperthyroidism (without the period of overt hyperthyroidism).
-
2.
A history of AF developed before the onset of hyperthyroidism.
-
3.
Concomitant diseases:, severe obstructive lung diseases, severe blood disorders, severe organ failure.
-
4.
Chronic intoxication (alcohol, narcomania, toxicomania).
-
5.
Pregnancy at the time of hyperthyroidism.
Data collection and ascertainment of clinical features
Project data were collected retrospectively, from the in-patient and out-patient medical records (including electronic medical records), face-to-face and telephone patient inquiries.
The dataset contained 36 study variables classified into six categories: demographic data, characteristics of hyperthyroidism course, cardiological status before and during hyperthyroidism, some metabolic parameters and blood tests, smoking status and heart rate-reducing therapy (Table 1). The variables were selected based on recognized or possible associations with TAF.
Thyroid status and other laboratory measurements were assessed at the time of the newly diagnosed hyperthyroidism, before thyrostatic drugs administration. Due to the distinction in reference intervals, thyroid hormones and thyroid-stimulating hormone (TSH) receptors antibodies values were evaluated as elevation above upper limit of normal (ULN).
Hyperthyroidism duration was established in months since the first clinical manifestations until euthyroid state was reached. Subclinical hyperthyroidism duration, the number of relapses and hypothyroidism periods were identified by repeated clinical thyroid status control.
The cardiovascular status was assessed before and during thyrotoxicosis. In TAF patients it was assessed prior to AF development. Initial cardiovascular status involved hypertension, coronary heart disease, cardiac arrhythmias and heart failure, diagnosed before hyperthyroidism development. Cardiovascular status during hyperthyroidism comprised the presence of the same pathologies excluding the coronary heart disease. Additionally, we assessed the heart rate at the time of thyrotoxicosis. It was defined as the average value, based on at least three measurements from the medical records. The analysis included only the values obtained during hyperthyroidism and before heart rate-reducing therapy administration.
Arterial hypertension was defined by the presence of essential or secondary hypertension history. This diagnosis was also made in case of antihypertensive medication use or if systolic blood pressure (SBP) of 140 mmHg or greater and/or diastolic blood pressure (DBP) of 90 mmHg or greater were found at least twice in a medical record. Hypertensive patients were divided into those with target ABP and those with above target ABP. The separation was made in accordance with ABP level, having been present most of the time.
Coronary heart disease was defined as a history of angina pectoris and/or myocardial infarction and/or recorded on electrocardiogram (ECG)/during Holter ECG monitoring silent myocardial ischemia and/or coronary angioplasty and/or coronary bypass.
Participants were categorized as having any rhythm disorder if it was present in diagnosis or registered on ECG/Holter ECG monitoring.
Heart failure was diagnosed based on the clinical criteria from the ESC guidelines, 2016 [25].
The metabolic parameters, widely known to be contributing to TAF development, such as body mass index, carbohydrate metabolism disorders and lipid profile were assessed. Body mass index was calculated by dividing weight in kilograms (kg) by height in metres squared (m2). The diagnosis of diabetes was established in case of a history of diabetes or antidiabetic medication use or if fasting blood glucose was 7 mmol/l or greater at least twice.
Moreover, smoking status was examined. Those who had been smoking before or during hyperthyroidism were classified as smokers. In TAF patients, smoking status was assessed before AF development.
We additionally analyzed potassium, hemoglobin and serum creatinine blood tests. An estimated glomerular filtration rate (GFR) was calculated with the CKD-EPI formula [26]. The potassium was assessed as both its increase and decrease can lead to cardiac arrhythmias, including atrial fibrillation. The hemoglobin level was assessed since anemia could cause myocyte dysfunction as a result of oxygen deprivation. The renal function was estimated, because renal failure had been shown predispose to TAF [12].
Statistical analysis
Initially, 36 studied features were compared between patients with and without TAF by classical statistical methods. After that, we trained several intermediate prediction models with eight machine learning algorithms and selected the most important variables for inclusion in the final model. Then, the best performing optimal model was tested. Lastly, we ranked TAF predictors elicited from the optimal final model with the machine learning tehniques.
The initial analysis of the data: descriptive statistics and data exploration
The initial analysis was conducted by SPSS Statistics 17.0. All study features but TSH level were compared between those who developed TAF and those who did not. As TSH level occurred to be lower than the detection threshold in the majority of cases, it was excluded from the analysis. The normality of the distribution was checked by the Kolmogorov-Smirnov test. The various tests according to the distribution of variables and their characteristics were applied to evaluate the differences in the studied parameters: Mann-Whitney U test, Pearson’s chi-square formula and Fisher’s exact test. The p-value below 0.05 was assumed as statistically significant.
The data are presented as a mean ± standard deviation for abnormal distribution and as a median (interquartile range (IQR)) for abnormal distribution.
Derivation of a thyrotoxic atrial fibrillation prediction model
We used machine learning techniques and Python 3.6 for a TAF prediction model development.
Hereafter we described the steps of the model development.
Input variables
The analysis of previously examined TAF risk factors [4, 5, 7,8,9,10,11,12,13,14,15,16,17,18,19] and non-thyrotoxic AF prediction tools [24, 27,28,29,30] helped to define input variables for our models. First, we built several intermediate models, including more than 30 variables. Following that, to facilitate implementation of the model in clinical practice, we reduced the number of the predictors. We eliminated the features of low clinical effectuality such as serum potassium and lipids, since their concentrations are highly variable and strongly depend on the drugs taken and the diet. Then we removed the features of low importance for model output using multivariable statistical analyses. This analysis was based on the feature importance in decision trees method. Each decision tree included nodes and edges. For each node one feature was used for dividing observations into classes. Feature for this operation was chosen using some criteria, for classification tasks it was the Ginny coefficient, for regression tasks it was a variance of the feature. We calculated the influence on reducing the Ginny coefficient by each feature in average, this value was the feature importance indicator. As a result, ten most important and clinically feasible features were selected for the final model.
Preprocessing of the data
Preprocessing of the data comprised the following steps: normalization (module sklearn-preprocessing-normalize), scaling (module sklearn-preprocessing-scale), resampling for the balance of classes, replacing the data gaps.
Splitting the data
To evaluate the models’ quality, we randomly divided the study sample into two parts: 70% (n = 294) were used for the estimation of the models (training) and 30% (n = 126) for the validation (testing).
Used classification machine learning algorithms
We investigated the performance of the following machine learning methods: logistic regression, decision tree classifier, random forest classifier, dummy classifier, K-neighbors classifier, Bernoulli naive Bayes classifier, eXtreme Gradient Boosting classifier (XGB classifier) and Support Vector Machines for Classification.
Model performance assessment
The next step was to estimate the models’ performance. For this purpose, a five-fold cross-validation was performed. Quality indicators included accuracy, recall, precision, F1 score and area under the receiver operator characteristics curve (AUROC). The quantitative metrics of accuracy and AUROC are used for the classifier overall performance evaluation. Accuracy is a measure related to the total number of correct predictions from all predictions made. AUROC is a measure of the model’s performance which is based on the receiver operator characteristics curve that plots the tradeoffs between sensitivity and 1-specificity [31]. Precision is the number of true positives divided by the number of true positives and false positives. Recall (sensitivity) is the number of true positives divided by the number of true positives and false negatives. The F1 Score is the 2*((precision*recall)/(precision+recall)). According to these indicators, the best performing models were chosen. For these models hyperparameters were selected by a grid search method. Finally, we validated the best model with only the test set.
Model interpretation
To represent the prediction model graphically, three interpretability techniques (Feature importance, SHapley Additive exPlanations (SHAP) method and Partial dependence plot) were applied. Next, we will list and explain each of them.
-
1)
Feature importance. To show the impact degree of each feature on the model output we used the charts, demonstrating the feature importance ranking. Feature importance is defined as the increase in the model’s prediction error after the values of the features were permuted. A feature is considered important if permuting its values increases the error [32].
-
2)
SHAP or Shapley values method. The average contribution of each feature to the model prediction in different coalitions can be presented with SHAP plot. SHAP method is a solution concept of fairly distributing both gains and costs to several players working in coalition used in game theory [32].
-
3)
Partial dependence plot. It shows the marginal effect one or two features have on the predicted outcome of a machine learning model [33]. To construct partial dependence plot, a variable is selected, and its value is continuously changing, whilst a change in the prediction value is observed and recorded.
Investigation of the TAF predictors elicited from the model
We used feature importance and SHAP values methods to rank and select the most important TAF predictors elicited from the model.