Emily’s NLP Blog

Happy to share ideas, projects and code in data science

Emily’s NLP Blog

Full Stack Deep Learning NLP: Building and Deploying a Reading Passages Readability Evaluator

A project-oriented tutorial for full stack deep learning engineering. It covers EDA, dataset augmentation, deep learning model training with PyTorch, two-ends web services building with FastAPI and JavaScript, application deploying with Docker Container on Google Cloud.

The following contents will be presented:

Part1: Objective
Part2: Set up the development environment
Part3: Exploratory data analysis
Part4: Train a baseline deep learning model
Part5: Download and prepare the external texts datasets
Part6: Generate an augmented dataset by pseudo labeling the external datasets with baseline model
Part7: Pretrain the model with augmented dataset and finetune the model with original dataset
Part8: Build and deploy the backend web service
Part9: Build and deploy the frontend service and connect it to the backend service

Book Recommender

Key words: Web Scraping, Popularity Ranking, Naive Bayes, SVM, TF-iDF, Content_based, Collaborative filtering

Part1.Web Scraping and EDA
Part2.Weighted Rating__Naive Bayes__SVM
Part3.Content Based Recommender
Part4.Collaborative Filtering with neighborhood methods
Part5.Collaborative filtering with latent factor models

BioMedical Question Answering

Key words: BioMedical Question Answering, General to domain-specific transfer learning, SQuAD to BioASQ, RoBERTa

First, pretrain RoBERTa with BioMedical publication corpus. Next, Fine-tune the RoBERTa QA Model with SQuAD Dataset.Then, Fine-tune the RoBERTa QA Model with BioASQ dataset. The performance of the final model validated on the BioASQ test dataset reaches at 84 for F1 score and 72 for EM(exact match), which is a decent result.

Fine-tune GPT-2 to generate stories

Key words: GPT-2, Generative language modeling, story generate

Fine-tune GPT-2 with specific domain texts. The fine-tuned model decreases the perplexity for valid dataset from 39 to 24, which is a improvement. Howerver, from human evaluation, the generated stories from plain and fine-tuned models are almost at the same level. That means we still have a long way to explore in the field of generative language modeling.

Two-stage-models to extract tweet sentiment

Key words: Roberta&CNN, WordPiece&character hybird embedding, two-stage-models

The first stage model relies on Roberta transformer and CNN to predict the start and end token index of sentiment in the wordpiece token sequence of tweet. The second stage model is based on character_level embeddings and CNN to futhur compute the start and end index of sentiment.The implementation of the second stage model significantly impoved the jaccard score from 7.13 to 7.26.

Quora Insincere Question Identification

Key words: Sentiment analysis, LSTM, Attention

Build and train LSTM model with Attention mechanism to identify Quora insincere questions. Programmed with python and keras. The score ranks in the top of 16% out of 4037 teams.

Use news and market state to predict stock trend

Key words: LightGBM, news&market state

Build a LightGBM model to predict how stocks will change based on the market state and news articles. The outcome performance ranks in the top of 4% out of 2927 teams.

Is this your favorite food?

Key words: EDA, Multi-labels classifier, Feature Engineering

First, pre-process the cuisine receipe, then feed them to multiple models, choose the model with the best accuracy to predict the food types.