The support vector machine

The Support Vector Machine (SVM) was implemented using linear kernel, eps- regression. Briey, the SVM classifies the disease by taking the symptom vector (ie. the frequency of the different symptoms for that disease), mapping the symptoms to a (potentially) higher dimensional space and determining whether the disease lies on one or the other side of a “separating hyperplane". The hyperplane is generated from a training set, here the set of symptom-vectors for the known mitochondrial diseases and the set of vector for the known non- mitochondrial diseases. The hyperplane is optimised such that it best separates the mitochondrial from the non-mitochondrial diseases.

The linear kernel allows for an easy interpretation of the results. The linearity basically means that the symptoms of the disease are directly used in the classification without any non-linear mapping, which allows the different symptoms and their contribution the score to be assessed individually. Eg. The symptom with the highest affnity in mitochondrial diseases (for the filtered SVM) is "Optic atrophy" with a weight of 1.000827, which means that a disease having 100% afflicted patients with this symptoms will get +1.000827 contribution to the score.

"Eps regression" allows for a continuum of scores, where the other (typical) choice "C-classification" is a binary "either/or" classification. Ie. with "C-classification" a disease would get either a -1 or a +1 depending on the symptoms present, while “eps-regression" provides a more nuanced view where diseases could possibly be “marginally mitochondrial" (having a score just above 0, eg. 0.2).

Several variations of the SVM can be selected: one unfiltered taking all symptoms in the database into account and various filtered using only symptoms displaying a cumulative prevalence of 25% (ie. one disease with 25% prevalence or five diseases with 5% prevalence each of a given symptom), 50%, 75%, 100% or 200%.

The complete set of equations used can be found here