Happy to share ideas, projects and code in data science
A project-oriented tutorial for full stack deep learning engineering. It covers EDA, dataset augmentation, deep learning model training with PyTorch, two-ends web services building with FastAPI and JavaScript, application deploying with Docker Container on Google Cloud.
The following contents will be presented:
Part1: Objective
Part2: Set up the development environment
Part3: Exploratory data analysis
Part4: Train a baseline deep learning model
Part5: Download and prepare the external texts datasets
Part6: Generate an augmented dataset by pseudo labeling the external datasets with baseline model
Part7: Pretrain the model with augmented dataset and finetune the model with original dataset
Part8: Build and deploy the backend web service
Part9: Build and deploy the frontend service and connect it to the backend service
Key words: Web Scraping, Popularity Ranking, Naive Bayes, SVM, TF-iDF, Content_based, Collaborative filtering
Part1.Web Scraping and EDA
Part2.Weighted Rating__Naive Bayes__SVM
Part3.Content Based Recommender
Part4.Collaborative Filtering with neighborhood methods
Part5.Collaborative filtering with latent factor models
Key words: BioMedical Question Answering, General to domain-specific transfer learning, SQuAD to BioASQ, RoBERTa
First, pretrain RoBERTa with BioMedical publication corpus. Next, Fine-tune the RoBERTa QA Model with SQuAD Dataset.Then, Fine-tune the RoBERTa QA Model with BioASQ dataset. The performance of the final model validated on the BioASQ test dataset reaches at 84 for F1 score and 72 for EM(exact match), which is a decent result.
Key words: GPT-2, Generative language modeling, story generate
Fine-tune GPT-2 with specific domain texts. The fine-tuned model decreases the perplexity for valid dataset from 39 to 24, which is a improvement. Howerver, from human evaluation, the generated stories from plain and fine-tuned models are almost at the same level. That means we still have a long way to explore in the field of generative language modeling.
Key words: Roberta&CNN, WordPiece&character hybird embedding, two-stage-models
The first stage model relies on Roberta transformer and CNN to predict the start and end token index of sentiment in the wordpiece token sequence of tweet. The second stage model is based on character_level embeddings and CNN to futhur compute the start and end index of sentiment.The implementation of the second stage model significantly impoved the jaccard score from 7.13 to 7.26.
Key words: Sentiment analysis, LSTM, Attention
Build and train LSTM model with Attention mechanism to identify Quora insincere questions. Programmed with python and keras. The score ranks in the top of 16% out of 4037 teams.
Key words: LightGBM, news&market state
Build a LightGBM model to predict how stocks will change based on the market state and news articles. The outcome performance ranks in the top of 4% out of 2927 teams.
Key words: EDA, Multi-labels classifier, Feature Engineering
First, pre-process the cuisine receipe, then feed them to multiple models, choose the model with the best accuracy to predict the food types.