top of page

Community Outreach

Public·21 members
Theodore Turner
Theodore Turner

Feature Engineering For Machine Learning And Da...

Feature engineering or feature extraction or feature discovery is the process of using domain knowledge to extract features (characteristics, properties, attributes) from raw data.[1] The motivation is to use these extra features to improve the quality of results from a machine learning process, compared with supplying only the raw data to the machine learning process.

Feature Engineering for Machine Learning and Da...


Automation of feature engineering is a research topic that dates back to the 1990s.[11] Machine learning software that incorporates automated feature engineering has been commercially available since 2016.[12] Related academic literature can be roughly separated into two types:

Feature engineering can be a time-consuming and error-prone process, as it requires domain expertise and often involves trial and error.[38][39] Deep learning algorithms may be used to process a large raw dataset without having to resort to feature engineering.[40] However, it's important to note that deep learning algorithms still require careful preprocessing and cleaning of the input data.[41] In addition, choosing the right architecture, hyperparameters, and optimization algorithm for a deep neural network can be a challenging and iterative process.[42]

Exploiting information in health-related social media services is of great interest for patients, researchers and medical companies. The challenge is, however, to provide easy, quick and relevant access to the vast amount of information that is available. One step towards facilitating information access to online health data is opinion mining. Even though the classification of patient opinions into positive and negative has been previously tackled, most works make use of machine learning methods and bags of words. Our first contribution is an extensive evaluation of different features, including lexical, syntactic, semantic, network-based, sentiment-based and word embeddings features to represent patient-authored texts for polarity classification. The second contribution of this work is the study of polar facts (i.e. objective information with polar connotations). Traditionally, the presence of polar facts has been neglected and research in polarity classification has been bounded to opinionated texts. We demonstrate the existence and importance of polar facts for the polarity classification of health information.

However, the amount of information is so vast that it is difficult for the users to find the information that they need. One step towards easier access to relevant information is opinion mining, that is frequently understood as classifying a fragment of text (phrase, sentence, or document) in polarity classes such as positive, neutral or negative. To address this problem, several unsupervised and supervised methods have been proposed. Traditionally, the supervised methods have achieved significantly better results and have been implemented using machine learning techniques, the main challenge being the identification of the appropriate signals or features for the algorithms. However, to the best of our knowledge, no previous work have evaluated the effectiveness of novel approaches based on word embeddings in order to extract features from e-Health forums. To this end, our aim is to evaluate to what extent word embeddings which have been widely used in other classification tasks and domains, may be applied to patient-authored content, along with other traditional lexical, grammatical, semantic, network-based and sentiment-based features. To do this, we evaluate its effectiveness to a very well-known task known as polarity classification; in particular, to the classification of sentences from online health forums into positive, negative and neutral.

The rest of the article is organized as follows. First, we present some related work in the area. Second, we describe the eDiseases dataset, the machine learning methods and the classification features used in our experiments. Third, we present the evaluation framework and results. Fourth, we discuss the obtained results. Finally, we draw the main conclusions of the study and outline future work.

A particular form of vector-based model that is receiving increasing attention by the NLP community is the word embeddings model. Word embeddings have been used to reduce data sparsity in the training data for supervised learning, achieving a significant increase in accuracy. Each dimension of the word vector represents a feature of the word, that usually has a semantic and/or syntactic interpretation. Word embeddings are typically induced using neural networks [37, 39, 49, 50].

Zou et al. [52] use a Twitter data set to train word embeddings and use them as features to both a regularised linear (Elastic Net) and a nonlinear (Gaussian Process) regression function for the surveillance of infectious intestinal diseases in social media. Nikfarjam et al. [53] learn word embeddings from more than one million unlabeled user posts from DS and Twitter, and use them as a feature to a conditional random field, along with other lexical and semantic features, for pharmacovigilance. Dubois and Romano [54] use a combination of natural language processing and deep learning techniques to develop models that can learn embeddings of clinical terms and notes, that can be later used in multiple applications.

In [58], prescriptions from discharge summaries are extracted using word embeddings and conditional random fields, while De Vine et al. [59] study the application to clinical concept extraction of a specific unsupervised machine learning method, called the Skip-gram Neural Language Model, combined with a lexical string encoding approach and sequence features.

We have seen that the classes are very unbalanced. It seems that, when patients share information in online communities, most of the information is neutral (i.e. it has no positive or negative connotations). The imbalanced dataset problem is quite common in many real applications, and may lead to poor performance for the machine learning algorithms, since they tend to be biased towards the majority class [78]. To understand the extent of this problem, we have applied an under-sampling strategy that consists in randomly removing examples from the majority class to make the dataset balanced. When applying this re-sampling strategy, classification accuracy considerably increases for the three diseases (an increment of around 10 percentage points of accuracy) (see Table 10).

Feature engineering is a process to select and transform variables when creating a predictive model using machine learning or statistical modeling. Feature engineering typically includes feature creation, feature transformation, feature extraction, and feature selection as listed in Figure 11. With deep learning, the feature engineering is automated as part of the algorithm learning.

With these goals in mind, we propose that images derived from routine radiological testing have been largely ignored in the context of precision medicine, and motivate the use of powerful new machine learning techniques applied to radiological images as the basis for novel and useful biomarker discovery.

Medical images are routinely collected and contain dense, objective information. Medical image analysis is therefore highly attractive for precision medicine phenotyping. Cross-sectional studies can comprehensively assess whole regions of the body during a single examination, a variety of human and machine detectable changes have been shown to quantify clinical and preclinical disease states31,32,33, and high-throughput image analysis techniques may be able to identify biomarkers which are closer to optimal for a given task. Recent advances in the field of medical image analysis have shown that machine-detectable image features can approximate the descriptive power of biopsy, microscopy and even DNA analysis for a number of pathologies34,35,36.

Five-year mortality prediction was performed with deep learning, as well as a range of classifiers trained on the human-defined image features. Classification models tested included random forests, support vector machines and boosted tree algorithms. The random forest model performed the best of the human-defined feature classifiers. This was an expected result, as random forests are known to perform well in smaller data settings as they are fairly robust to noise51. We show the pooled ROC curves (combining 6 cross validations) for the testing sets of the deep learning and the best traditional models in Fig. 3. In Table 3 we present the mean and the standard deviation of the AUC and accuracy results for the deep learning and the best traditional models. Comparison is also made with selected published clinical scores predicting 5 year mortality.

Pooled ROC curves of the 6-fold cross validation experiment to predict 5 year mortality and survival outcomes from CT chest imaging, comparing the deep learning model and the human feature engineering/random forest model.

Finally, qualitative visual assessment of the CT chest images was performed. The highest certainty average predictions of the deep learning and engineered feature models for survival and mortality outcomes were examined by a consultant radiologist for visual differences. This revealed many of the expected associations; patients predicted to survive longer than five years appeared visually healthier than those predicted to die within 5 years. In Fig. 4, we show representative images from the three chest CT scans with the highest certainty predictions for mortality and survival.

Images at the level of the proximal left anterior descending coronary artery, with the most strongly predicted mortality and survival cases selected by averaging the predictions from the deep learning and engineered feature models. The mortality cases (left side) demonstrate prominent visual changes of emphysema, cardiomegaly, vascular disease and osteopaenia. The survival cases (right side) appear visually less diseased and frail. 041b061a72


Welcome to the group! You can connect with other members, ge...


  • Ceridwen Ceridwen
    Ceridwen Ceridwen
  • lalalala096 lalalala
    lalalala096 lalalala
  • Real Crackers
    Real Crackers
  • Crackps Store
    Crackps Store
bottom of page