List of publications chronologically

28. AdaCC: cumulative cost-sensitive boosting for imbalanced classification

This work addresses the challenge of class imbalance in machine learning, where models often exhibit bias towards the majority class, leading to poor performance in the minority class. Traditional cost-sensitive learning relies on fixed misclassification costs, demanding careful tuning and domain knowledge. In our approach, AdaCC, we propose a novel cost-sensitive boosting method that dynamically adjusts misclassification costs based on the model's performance across boosting rounds. AdaCC is parameter-free, relying on the boosting model's cumulative behavior, and guarantees theoretical bounds on training error.

27. Discrimination and Class Imbalance Aware Online Naive Bayes

Addressing fairness in data streams is critical as automated systems make pivotal decisions like hiring or credit assessment. Existing discrimination-aware methods often favor majority classes, sidelining the minority class. We introduce an adapted Naïve Bayes approach, prioritizing fairness for both majority and minority classes. Our method, incorporating dynamic instance weighting for class imbalance and concept drifts, outperforms existing fairness-aware techniques in discrimination scores and balanced accuracy across various datasets.

26. Multi-fairness under class-imbalance

Recent studies have revealed imbalances in datasets used for fairness-aware machine learning with multiple protected attributes, especially in critical minority classes. Existing methods overlook this imbalance issue, focusing solely on error-discrimination trade-offs and potentially amplifying biases. Addressing multi-discrimination and class-imbalance, we introduce Multi-Max Mistreatment (MMM), a fairness measure considering both protected group attributes and class membership. To tackle this combined problem, we propose Multi-Fair Boosting Post Pareto (MFBPP), a boosting approach incorporating MMM-costs in distribution updates and post-training, optimizing accuracy, class balance, and fairness.

25. Parity-based cumulative fairness-aware boosting

This research addresses discrimination in data-driven AI systems due to societal biases and class imbalance. Existing fairness-aware methods focus on overall accuracy but may exacerbate discrimination, especially in underrepresented groups. AdaFair, our proposed solution, is a fairness-aware boosting ensemble that dynamically adjusts data distribution based on both class errors and cumulative fairness-related performance. During post-training, AdaFair optimizes ensemble size for balanced error, effectively tackling both imbalance and discrimination. AdaFair supports various fairness notions and significantly mitigates discriminatory outcomes, ensuring equal social privileges, particularly in contexts like employment.

24. A survey on datasets for fairness-aware machine learning

This paper delves into the critical issue of fairness in data-driven artificial intelligence systems, emphasizing empirical evaluations on benchmark datasets representing diverse scenarios. The study focuses on tabular data, a prevalent format for fairness-aware ML. Using Bayesian networks, the research explores relationships among attributes, especially concerning protected and class attributes. Through exploratory analysis, the study aims to enhance understanding of biases within the datasets, providing valuable insights for fairness-aware machine learning solutions.

23. Online Fairness-Aware Learning with Imbalanced Data Streams

This work addresses the challenge of fairness-aware learning in dynamic data streams, where class imbalance and evolving distributions complicate model adaptation. We introduce FABBOO, an online boosting approach that ensures fairness by dynamically adjusting the training distribution based on stream's class imbalance and modifying the decision boundary. Experiments across real-world and synthetic datasets showcase the method's superiority, with notable improvements in balanced accuracy, gmean, recall, kappa, and statistical parity.

22. LSTM Based Sentiment Analysis for Cryptocurrency Prediction

This research focuses on predicting cryptocurrency price movements by analyzing sentiment in social media, particularly in Chinese posts from Sina-Weibo. The study develops a pipeline for data capture, creates a crypto-specific sentiment dictionary, and proposes an LSTM-based recurrent neural network. By correlating social media sentiment with historical cryptocurrency prices, the approach outperforms existing models, showing a 18.5% improvement in precision and a 15.4% boost in recall, demonstrating its effectiveness in predicting volatile price fluctuations.

21. Using Machine Learning to Automate Mammogram Images Analysis

This work introduces a computer-aided automatic mammogram analysis system to enhance breast cancer detection. The system utilizes discrete wavelet transforms and Fourier cosine transform for feature extraction, followed by entropy-based feature selection. Various pattern recognition methods and a voting classification scheme are employed for classification. The system, validated on the Eastern Health dataset in Canada, effectively improves sensitivity, specificity, and accuracy in discriminating normal and cancerous mammogram images, addressing the challenges of false positives and low specificity in mammography technology.

20. A Data-driven Human Responsibility Management System

The paper introduces a smart safety management system using responsibility big data analysis and the internet of things (IoT). This system aims to enhance workplace safety by instructing staff, automating risk assessments, and alerting when necessary. It addresses the rising occupational-related accidents by providing real-time supervision and self-reminder mechanisms. The real-world implementation showcases the system's effectiveness, improving staff accountability and responsibility fulfillment while minimizing accidents and damages.

19. FABBOO - Online Fairness-Aware Learning Under Class Imbalance

In dynamic environments where data arrive sequentially, fairness-aware learning must adapt continually. Existing fairness-aware stream classifiers often neglect class distribution skewness, leading to discrimination against minority instances. Our solution, FABBOO, is an online boosting approach that adjusts the training distribution based on stream imbalance and historical discriminatory behavior. By considering long-term class imbalance and fairness, FABBOO maintains a valid and fair classifier. Experimental results demonstrate the effectiveness of this approach, ensuring both good predictive performance and fairness-related outcomes.

18. Bias in data-driven artificial intelligence systems—An introductory survey

This survey explores the critical issue of bias in AI systems, emphasizing the need to embed ethical and legal principles in their design and deployment. As AI-based decisions impact individuals and society, concerns about human rights violations arise. The study focuses on data-driven AI, highlighting problems related to data gathering and processing, which may lead to biased decisions based on demographic features like race and gender. The survey provides a multidisciplinary overview, addressing technical challenges, solutions, and suggesting research directions within a legal framework, aiming to ensure social good while harnessing the potential of AI technology.

17. Semi-supervised learning and fairness-aware learning under class imbalance

The thesis addresses data quality, class imbalance, and fairness issues in machine learning. It emphasizes the impact of class imbalance on classification models, often leading to biased outcomes, especially in high societal impact domains. The research introduces methods to handle class imbalance in semi-supervised learning, utilizing data augmentation to equalize class distributions effectively. Additionally, the thesis proposes techniques to mitigate unfairness in supervised models, considering all classes and outperforming existing methods in terms of performance and fairness outcomes. The study underscores the importance of addressing these challenges for unbiased and reliable machine learning algorithms.

16. FairNN - Conjoint Learning of Fair Representations for Fair Decisions

The paper introduces FairNN, a neural network for fairness-aware learning, integrating feature representation and classification. FairNN optimizes a multi-objective loss function, suppressing protected attributes, minimizing reconstruction loss, and ensuring fairness in classification through equalized odds-based fairness regularizer. Unlike separate treatments, our joint approach outperforms, demonstrated across diverse datasets. Additionally, adaptable regularizer weights offer a versatile framework for fair representation learning and decision making.

15. AdaFair: Cumulative Fairness Adaptive Boosting

The rise of ML-based decision-making in critical areas like recidivism and job hiring raises discrimination concerns. Existing fairness-aware ML approaches focus on accuracy and fairness but overlook class imbalance. Our solution, AdaFair, integrates fairness and class balance. It employs AdaBoost, adjusting instance weights for fairness and optimizing ensemble size for balanced classification. AdaFair achieves parity in true positive and true negative rates for protected and non-protected groups, surpassing existing methods by up to 25% in balanced error, addressing both fairness and class imbalance effectively.

14. Fairness-Enhancing Interventions in Stream Classification

Automated data-driven decision systems lack human supervision, raising concerns about fairness. Current fairness-aware methods treat fairness as a fixed model applied to future data instances, ignoring evolving data streams. We propose interventions that modify input data to ensure fairness for any stream classifier applied. Experiments with real and synthetic data demonstrate our approach's good predictive performance and low discrimination scores, addressing the challenge of evolving data characteristics in automated decision-making systems.

13. FAE: A Fairness-Aware Ensemble Framework

Automated decision-making through big data and machine learning can lead to biased outcomes, especially concerning protected groups. Existing fairness-aware machine learning methods target specific stages like input data, algorithms, or models. However, discrimination often arises from intricate interactions between data and algorithms, necessitating a comprehensive approach. The proposed Fairness-Aware Ensemble (FAE) framework intervenes in both pre- and post-processing stages. Pre-processing addresses group and class imbalances, generating balanced training samples. In post-processing, the framework addresses class overlapping by adjusting the decision boundary to promote fairness, offering a holistic solution to the complexity of discriminatory patterns in automated decision-making systems.

12. Simple-ML: Towards a framework for semantic data analytics workflows

The paper introduces Simple-ML, a framework utilizing semantic technologies for efficient, robust, and reusable data analytics workflows. Semantic data models underpin the framework, enabling the development of analytics workflows. The paper illustrates an example application in the mobility domain, demonstrating the practical implementation of Simple-ML's data models.

11. Sentiment analysis on big sparse data streams with limited labels

Sentiment analysis of vast social media data like Twitter is challenging due to lack of labels. This study employs distant supervision and semi-supervised learning on a massive 2015 tweet stream (228 million tweets without retweets, 275 million with retweets). Various semi-supervised methods, including Self-Learning, Co-Training, and Expectation-Maximization, were explored, revealing efficient stream processing with a three-month sliding window. To address class imbalance, data augmentation in semi-supervised learning was applied, significantly outperforming default methods. The resulting sentiment-annotated dataset, TSentiment15, is shared with the community for evaluation and method development.

10. Enriching lexicons with ephemeral words for sentiment analysis in social streams

Traditional sentiment analysis relies on fixed dictionaries, but with Web 2.0, non-sentimental words gain temporary sentiment based on events (e.g., "refugees", "Trump"). This study introduces a method to identify and monitor such "ephemeral words" from social streams. These words convey sentiment without being inherently sentimental and their sentiment changes over time. Unlike fixed lexicons, detecting and estimating sentiment for such words enhances lexicon-based approaches, as demonstrated by our experiments, showcasing improved performance.

9. Tracking the History and Evolution of Entities: Entity-centric Temporal Analysis of Large Social Media Archives

To comprehend the Greek Prime Minister's 2015 popularity, sentiment, and controversies, archived data analysis is essential. Social media content, especially from platforms like Twitter and Facebook, offers comprehensive societal documentation. Our proposed entity-centric approach delves into these archives, defining measures to assess entity evolution, sentiment, and related entities. This method provides valuable insights for sociologists, historians, and researchers interested in studying the historical context and evolution of entities and events.

8. Time-Aware and Corpus-Specific Entity Relatedness

Entity relatedness is vital in applications like information retrieval and entity recommendation. Context, especially time, significantly impacts entity relationships. We introduce a versatile model utilizing entity-aware word embeddings from the corpus. This approach, independent of external knowledge and language, offers simplicity and flexibility, making it applicable across diverse contexts and applications.

7. Dealing with Bias via Data Augmentation in Supervised Learning Scenarios

Research efforts across data mining, machine learning, and related fields tackle bias and discrimination-aware learning. This study concentrates on supervised learning, addressing biases tied to attributes like race or gender. Introducing data augmentation methods at the input layer, the research demonstrates their effectiveness in mitigating biases, as validated through experiments on real-world datasets.

6. Tweetskb: A public and large-scale rdf corpus of annotated tweets

The paper introduces TweetsKB, a vast publicly available corpus comprising 1.5 billion tweets from Jan'13 to Nov'17. This resource, rich in metadata, entities, hashtags, user mentions, and sentiment data, employs RDF/S vocabularies for structured organization. Describing the extraction process, the paper highlights use cases, demonstrating its utility for entity-centric exploration, data integration, and knowledge discovery across diverse fields.

5. Large scale sentiment learning with limited labels

Sentiment analysis is vital for understanding vast social media opinions. Existing datasets like TSentiment are limited. To address this gap, we annotated a large 2015 Twitter dataset (275 million tweets with retweets). Utilizing unlabeled and labeled data with semi-supervised learning, specifically Self-Learning, we enhance dataset quality, providing a valuable resource for research.

4. Sentiment classification over opinionated data streams through informed model adaptation

Opinionated data streams, reflecting diverse user opinions, pose challenges in mining due to concept shifts. Addressing this, adaptive learning models employing age-based adaptation discard outdated information, focusing on recent data. Existing methods often use fixed ageing strategies, like window sizes or ageing factors, disregarding the evolving nature of opinions over time. Flexible adaptation strategies are crucial to capture nuanced shifts in opinions across the dynamic data stream.

3. Multi-aspect entity-centric analysis of big social media archives

The paper introduces an entity-centric method for analyzing social media archives, crucial for historical and sociological research. It proposes measures to assess entities' representation across time periods and aspects. The study, using a 4-year Twitter archive, demonstrates valuable insights from this entity-centric, multi-aspect analysis approach.

2. Compressing Inverted Files using Modified LZW

The paper introduces a modified Ziv Lempel Welch (LZW) algorithm for efficient data compression. It utilizes an index treating terms as characters, optimizing storage of encoded document identifiers. The approach incorporates preprocessing steps like document identifier reassignment and gaps, along with post-processing methods such as IPC encoding and GZIP, enhancing space savings. Experimental results on the Wikipedia dataset demonstrate the superior space compaction achieved by this modified LZW algorithm.

1. Partial Order Preserving Encryption Search Trees

Rapid internet service expansion results in vast, dispersed user data, posing privacy challenges. Addressing this, we introduce a tree-based structure for encrypted data, ensuring quick search and operations. Our approach exposes limited ordering information for fast data location. Unlike insecure total order preservation, our method balances security and efficiency effectively.

List of publications chronologically

28. AdaCC: cumulative cost-sensitive boosting for imbalanced classification

​

​

​

​

​26. Multi-fairness under class-imbalance

​​

​

​

​25. Parity-based cumulative fairness-aware boosting

​​

​

​

​24. A survey on datasets for fairness-aware machine learning

​

​

​23. Online Fairness-Aware Learning with Imbalanced Data Streams

​

​

​

​22. LSTM Based Sentiment Analysis for Cryptocurrency Prediction

​​

​

​

​21. Using Machine Learning to Automate Mammogram Images Analysis

​

​

​

​20. A Data-driven Human Responsibility Management System

​

​

​19. FABBOO - Online Fairness-Aware Learning Under Class Imbalance

​

​

​18. Bias in data-driven artificial intelligence systems—An introductory survey

​

​

​

​17. Semi-supervised learning and fairness-aware learning under class imbalance

​

​

​16. FairNN - Conjoint Learning of Fair Representations for Fair Decisions​

​

​

​

​15. AdaFair: Cumulative Fairness Adaptive Boosting

​

​

​14. Fairness-Enhancing Interventions in Stream Classification

​

​

​13. FAE: A Fairness-Aware Ensemble Framework

​

​

​12. Simple-ML: Towards a framework for semantic data analytics workflows

​

​11. Sentiment analysis on big sparse data streams with limited labels

​

​

​10. Enriching lexicons with ephemeral words for sentiment analysis in social streams

​