Mingda Li

Machine Learning Engineer, Ph.D.

Phone: (201) 702-3208
Email: mingda (dot) r (dot) li (at) gmail (dot) com

Mingda Li

Short Bio:

Mingda Li is a Machine Learning Engineer at Pinterest. Before joining Pinterest, he was a Machine Learning SDE Intern at Facebook. He obtained his Ph.D. in Computer Science at New Jersey Institute of Technology, advised by Prof. Yi Chen.
His research interests lie in
  • Deep learning
  • Recommender system
  • Keyword search and query processing
  • Pattern recognition
  • Bioinformatics
  • Signal processing
  • For more details, please check his resume.

    Research Projects:

  • Analyzing and Assisting Patient Decision-Making in Online Health Community
  • In recent years, many users have joined online health communities (OHC) to seek information, suggestions and social support. However, little has been studied about the roles of OHCs in patients' decision-making processes. The aim of this research is to analyze OHC data to better understand patient decision-making processes, and to provide assistance to OHC users in their decision-making.

    In order to analyze or assist patients in their decision-making, a novel classification model is designed to identify discussion threads in OHC that are related to decision making. This is achieved by building a two-step combined deep learning model. Empirical evaluation shows the effectiveness of the approach. Using the new model, we found 46.9% of threads in the breast cancer discussion forum in Cancer Survivors Network are related to decision-making, demonstrating the significant role that OHC plays in patient decision making. To understand what patients consider during their decision making processes, topic modeling techniques are used to analyze the concern factors that patients expressed in those threads.

    For users seeking help in OHC in their decision-making processes, the influences received are analyzed by developing a framework and deep-learning techniques to identify influence relationships among posts. The state-of-the-art text relevance measurement methods are leveraged to generate sparse feature vectors to present the text relevance. The probability of question and action presence in a post are modeled as dense features. Then deep learning techniques are leveraged to combine the sparse and dense features to learn the influence relationships. Empirical evaluation demonstrates the effectiveness of the approach.

    Finally, to assist patient decision-making process, a personalized thread recommender system is developed to help users find relevant discussions without the burden of information overload. The system captures user interests in two dimensions: topics and concepts. Topic dimension is summarized through topic modeling techniques and concept dimension is encoded by a Convolutional Neural Network. A thread neural network is built to capture thread characteristics, and a user neural network is built to capture user interests as well as interest shifts over time using a Long Short-Term Memory Model. At last, user interests and thread characteristics are matched to make recommendations. Experimental evaluation with multiple OHC datasets demonstrates the performance advantage over the state-of-the-art recommender systems on various evaluation metrics.

  • The Leir Health Care & Economic Crisis Project
  • To analyze the connection between the economic and financial effects of the 2008-2009 housing crisis and the health-related impacts on individuals and families in the State of New Jersey, we purchase the patient admission data of hospitals the State of New Jersey from 1990 to 2014 and the S&P 500 data from 1998 to 2017. Currently, we are verifying the correlation between the admissions related to mental illnesses and the stock prices' volatility. We are planning to include more features to analyze the connections.
  • Efficient Top-k Path Search in Large Knowledge Bases
  • Exploring Top-k important paths between two given entities can be time consuming, especially in large knowledge bases. By doing group join in parallel, we design a dynamic programming algorithm for the path searching. First, we extract the sets of entities where each set contains the same provided entity name. Second, for each entity set, e.g., S1, we collect the edges (facts) of the entities in S1 where the entities are either subject or object in parallel. We generate the result by checking whether the edges (facts) can be connected to the other entity set, e.g., S2, until top-k paths are generated or no more paths can be found. In each iteration, we increase the length of the path by using group join in parallel. We implement this algorithm in Scala, and build a Spark cluster on top of Hadoop YARN on Amazon EC2. We evaluate the performance of this proposed algorithm using YAGO2 knowledge base with 15 million labels and 9 million facts. Results clearly show that the proposed algorithm is five times faster than other comparison algorithms found in related literature.
  • Constructing Target-Aware Results for Keyword Search on Knowledge Graphs
  • This paper is published in Elsevier Journal Data & Knowledge Engineering (DKE). In order to improve the search quality in knowledge graph, we develop a target-aware search engine called TAR which can infer the users' search intention. To build such a system, we first propose the concept "modifier" and "return specifier" in query keywords. By leveraging the information theory, we extract modifiers and return specifier from the keywords, and infer the target entity type we are looking for. Based on information above, we generate dependency digraphs, and construct top-k subgraphs as the returned results. In order to speed up the searching process, we also build keyword index and path index for high efficiency. We implement TAR in C++ and use Oracle Berkeley DB to build the indices. We test TAR on IMDB, DBLP, DBPedia and an INEX IMDB with millions of vertices in each dataset. For the benchmark INEX IMDB dataset, we design a ranking function which has successfully improved the mean average generalized precision from 3% to 43%. TAR outperforms BANKS and EASE in almost every case in terms of both effectiveness and efficiency.
  • Analysis Method for the Disease Feature of Omics Datasets
  • To mine the disease patterns in omics data in order to get a better understanding of the disease feature, we conduct the following analysis of omics data:
    <1> To classify patients into three categories by leveraging omics data: normal, prediabetes, diabetes. We propose a mid-level fusion analysis method for omics data. In this method, we first prepossess the raw data by filtering the records with missing attribute value, and afterwards we use z-score for normalization. Then, we reduce the dimension by partial least squares. Finally, we use SVM to do the classification. The AUC of the predication results are more than 0.95 on two omics datasets.
    <2> To predict whether a prediabetes patient will become a diabetes, healed, or still prediabetes, we develop a hieratical decision model to analyze omics data. We first extract the principal components from two different omics datasets by using partial least squares regression. Then, we use KNN classifier to make the predication. The proposed method achieves a much higher precision (70%) than existing works (46% and 53%). We extract the omics features with large variable importance in the projection values when using partial least squares regression, and we find out that they are related to the development of prediabetes.
  • Analysis Methods to Predict Sudden Cardiac Death Based on Holter Signal
  • We propose a novel R-wave detection algorithm for DCG signal. The algorithm considers the average two-way slope and relative height as its two main characteristics, and it can deal with several abnormal cases where some of the R-wave is either missing or detected by mistake. The feasibility of this proposed algorithm is verified by MIT-BIH Long-Term ECG database and the Holter records from FAHHMU. Experimental results indicate that this algorithm has a much higher precise detection rate (98.3%) than the maximum double-searching technology and the difference operation method (95.2% and 90.7%). To predicate whether a patient will have a sudden cardiac death, we analyze the electrocardiogram of patients with or without heart disease. We first extract several important features related to heart disease. Then, we use Naive Bayes and SVM on several feature combinations to predicate sudden cardiac death. Based on the provided real-life records of return patient visits (dead or alive), we evaluate the performance of three assembled classification models based on mentioned classifiers (AUC > 0.8).

    Publications:

  • Mingda Li, Weiting Gao, and Yi Chen. "A Topic and Concept Integrated Model for Thread Recommendation in Online Health Communities". The 29th ACM International Conference on Information and Knowledge Management, 2020.
  • Mingda Li, Jinhe Shi, and Yi Chen. "Analyzing Patient Decision Making in Online Health Communities". The 7th IEEE International Conference on Healthcare Informatics, 2019.
  • Shan, Yi, Mingda Li, and Yi Chen. "Constructing target-aware results for keyword search on knowledge graphs". Data & Knowledge Engineering, 2017.
  • Mingda Li, Haoran Zheng. "A mid-level fusion method for omics dataset". Beijing Biomedical Engineering, 2016, 35(3).
  • Yingtao Zhang, Jianhua Huang, Mingda Li, et al. "Novel R-wave Detection Algorithm of DCG Signal". Journal of Tianjin University Science and Technology, 2014, 47(1).