How to win kaggle competition github

How to win kaggle competition github

simaaron/kaggle-Rain

Use Git or checkout with SVN using the web URL.

Work fast with our official CLI. Learn more.

Launching GitHub Desktop

If nothing happens, download GitHub Desktop and try again.

Launching GitHub Desktop

If nothing happens, download GitHub Desktop and try again.

Launching Xcode

If nothing happens, download Xcode and try again.

Launching Visual Studio Code

Your codespace will open once ready.

There was a problem preparing your codespace, please try again.

Latest commit

Git stats

Files

Failed to load latest commit information.

README.md

How Much Did It Rain? II

Kaggle competition winning solution

This document describes how to generate the winning solution to the Kaggle competition How Much Did It Rain? II.

Further documentation on the method can be found in this blog post.

Generating the solution

Install the dependencies

The models are written in Python 2.7 and makes use of the NumPy, scikit-learn, and pandas packages. These can be installed individually via pip or all together in a free Python distribution such as Anaconda.

Theano can be installed and configured to use any available NVIDIA GPUs by following the instructions here and here. The Lasagne package often requires the latest version of Theano; a simple pip install Theano may give a version that is out-of-date (see Lasagne documentation for details).

Lasagne can be installed by following the instructions here.

Download the code

To download the code run:

Create an empty data folder

Download the training and test data

The training and test data can be downloaded from the Kaggle competition webpage at this link. The two extracted files train.csv and test.csv should be placed in the data folder.

Note: the benchmark sample solution and code provided by Kaggle are not required.

Preprocess the data

Replace the NaN entries with zeros (training and test data) and remove the outliers (training data only) by running:

Augment the data sets with dropin copies

Create random augmentation copies of the datasets by running:

This creates 61 randomly augmented copies of the preprocessed training and test data sets and one of the validation holdout set. Note that each copy is > 2GB in size. If there is an issue with insufficient hard disk space, one should modify the training script NNregression_*.py and test script NNprediction_*.py to perform these augmentations dynamically.

The number of copies can be changed in the above scripts.

Train the networks

The two best models can be trained by running:

The outputs from different models are continually saved into separate output folders. These include the files training_scores.txt and validation_scores.txt which, for monitoring purposes, give the evolution of the training and validation errors respectively. The file model.npz is the current best fitting set of model parameters (w.r.t. the validation holdout set), and the last_learn_rate.txt records the current (decayed) learning rate.

Generate predictions from augmented test sets

The set of 61 augmented test set predictions from the model ‘v1’ can be obtained by running:

The predictions from the pre-trained model included in the code download can be obtained by running:

Average the augmented predictions

The predictions from different augmented copies can be combined by running:

The individual predictions from the models ‘v1’ and ‘v2’ would place one 2nd/3rd in the competition. A straight average of the two solutions would be sufficient for 1st place.

About

maremoto/Coursera-Kaggle

Use Git or checkout with SVN using the web URL.

Work fast with our official CLI. Learn more.

Launching GitHub Desktop

If nothing happens, download GitHub Desktop and try again.

Launching GitHub Desktop

If nothing happens, download GitHub Desktop and try again.

Launching Xcode

If nothing happens, download Xcode and try again.

Launching Visual Studio Code

Your codespace will open once ready.

There was a problem preparing your codespace, please try again.

Latest commit

Git stats

Files

Failed to load latest commit information.

README.md

Coursera-Kaggle final project

Course «How to win a datascience competition: Learn from top Kagglers» final project.

Firstly read the brief documentation to have a comprehensive overview (0_FinalProjectDocumentation.pdf).

Then to see (and reproduce) the evolution of the project work, just follow the python notebooks in the order of their names.

The features generated are lags from the monthly sold item counts and revenues (price * items), and an expanding mean encoding of item category.

The solution is a staking of models, Linear ElasticNet and LightGBM as 1st level with a Linear Convex Mix as 2nd level, but at the end, a LGB model alone performs almos the same!

There is available a requirements.txt file, but in fact the main used tools are:

About

Course How to win a datascience competition final project

KumarPython/Kaggle-Competition

Use Git or checkout with SVN using the web URL.

Work fast with our official CLI. Learn more.

Launching GitHub Desktop

If nothing happens, download GitHub Desktop and try again.

Launching GitHub Desktop

If nothing happens, download GitHub Desktop and try again.

Launching Xcode

If nothing happens, download Xcode and try again.

Launching Visual Studio Code

Your codespace will open once ready.

There was a problem preparing your codespace, please try again.

Latest commit

Git stats

Files

Failed to load latest commit information.

README.md

Kaggle Competition | Machine Learning Competition | Machine Learning Competitions on Kaggle

Kaggle Competitions I participated using Python and Sklearn

Kaggle, a subsidiary of Google LLC, is an online community of data scientists and machine learning practitioners. Kaggle allows users to find and publish data sets, explore and build models in a web-based data-science environment, work with other data scientists and machine learning engineers, and enter competitions to solve data science challenges.

In June 2017, Kaggle announced that it passed 1 million registered users, or Kagglers.The community spans 194 countries. It is a diverse community, ranging from those just starting out to many of the world’s best known researchers. Kaggle competitions regularly attract over a thousand teams and individuals. Kaggle’s community has thousands of public datasets and code snippets (called «kernels» on Kaggle). Many of these researchers publish papers in peer-reviewed journals based on their performance in Kaggle competitions

By March 2017, the Two Sigma Investments fund was running a competition on Kaggle to code a trading algorithm.

Machine learning competitions: this was Kaggle’s first product. Companies post problems and machine learners compete to build the best algorithm, typically with cash prizes. Kaggle Kernels: a cloud-based workbench for data science and machine learning. Allows data scientists to share code and analysis in Python, R and R Markdown. Over 150K «kernels» (code snippets) have been shared on Kaggle covering everything from sentiment analysis to object detection. Public datasets platform: community members share datasets with each other. Has datasets on everything from bone x-rays to results from boxing bouts. Kaggle Learn: a platform for AI education in manageable chunks.

How Kaggle competitions work

The competition host prepares the data and a description of the problem. Participants experiment with different techniques and compete against each other to produce the best models. Work is shared publicly through Kaggle Kernels to achieve a better benchmark and to inspire new ideas. Submissions can be made through Kaggle Kernels, through manual upload or using the Kaggle API. For most competitions, submissions are scored immediately (based on their predictive accuracy relative to a hidden solution file) and summarized on a live leaderboard. After the deadline passes, the competition host pays the prize money in exchange for «a worldwide, perpetual, irrevocable and royalty-free license to use the winning Entry», i.e. the algorithm, software and related intellectual property developed, which is «non-exclusive unless otherwise specified». Alongside its public competitions, Kaggle also offers private competitions limited to Kaggle’s top participants. Kaggle offers a free tool for data science teachers to run academic machine learning competitions, Kaggle In Class. Kaggle also hosts recruiting competitions in which data scientists compete for a chance to interview at leading data science companies like Facebook, Winton Capital, and Walmart.

Impact of Kaggle competitions

Kaggle has run hundreds of machine learning competitions since the company was founded. Competitions have ranged from improving gesture recognition for Microsoft Kinect to making an football AI for Manchester City to improving the search for the Higgs boson at CERN.

Competitions have resulted in many successful projects including furthering the state of the art in HIV research, chess ratings and traffic forecasting. Most famously, Geoffrey Hinton and George Dahl used deep neural networks to win a competition hosted by Merck. And Vlad Mnih (one of Hinton’s students) used deep neural networks to win a competition hosted by Adzuna. This helped show the power of deep neural networks and resulted in the technique being taken up by others in the Kaggle community. Tianqi Chen from the University of Washington also used Kaggle to show the power of XGBoost, which has since taken over from Random Forest as one of the main methods used to win Kaggle competitions.

Several academic papers have been published on the basis of findings made in Kaggle competitions. A key to this is the effect of the live leaderboard, which encourages participants to continue innovating beyond existing best practice. The winning methods are frequently written up on the Kaggle blog, Kaggle Winner’s Blog.

In March 2017, Fei-Fei Li, Chief Scientist at Google, announced that Google was acquiring Kaggle during her keynote at Google Next.

ShuaiW/how-to-kaggle

Use Git or checkout with SVN using the web URL.

Work fast with our official CLI. Learn more.

Launching GitHub Desktop

If nothing happens, download GitHub Desktop and try again.

Launching GitHub Desktop

If nothing happens, download GitHub Desktop and try again.

Launching Xcode

If nothing happens, download Xcode and try again.

Launching Visual Studio Code

Your codespace will open once ready.

There was a problem preparing your codespace, please try again.

Latest commit

Git stats

Files

Failed to load latest commit information.

README.md

This article has been depreciated. For an update version, check this blogpost.

Prologue: Use the Right Tool

According to a talk by Anthony Goldbloom, CEO of Kaggle, there are only two winning approaches:

In the first category, it «has almost always been ensembles of decision trees that have won». Random Forest used to be the big winner, but XGBoost has cropped up, winning practically every competition in the structured data category recently.

On the other hand, for any dataset that contains images or speech problems, deep learning is the way to go. And instead of spending almost none of their time doing feature engineering, winners spend their time constructing neural networks.

So, the major takeaway from the talk is that if you have lots of structured data, the handcrafted approach is your best bet, and if you have unusual or unstructured data your efforts are best spent on neural networks.

I’ve recently conducted a use case proposal of a kaggle competition, and I analyzed (in the appendix) why deep nerual networks, while theoratically being able to approximate any functions, in practice are easliy dwarfed by ensemble tree algorithms in terms of efforts put in vs. results. That said, stacking deep nets in the final ensemble might boost performance, as long as the features extracted are not highly correlated with those from ensemble trees.

1 + 1 is not always equal to 2 when it comes to team collaboration. More often, it is smaller than 2 due to factors such as individual favorite scripting langauge, workflow, etc. Besides using collaborative tools (Skype, svn, git) to keep effective communication, how to divide force effciently is key to get great results as a team.

According to Kaggle winners’ interviews, forming teams can make them more competitive as «combining features really helps improve performance». At the same time, it helps to specialize, having different members focus on different stages of the data pipeline.

So one plausible plan is as follows:

So in traditional machine learning problem, after the preprocessing step, the people who work on feature engineering should be producing new feature sets frequently so that machine learning people can tune and train models continuously. On the other hand, in deep learning regime most of the time would be spent on net architecture design and model/parameters tuning.

Clearly documented, reproducible codes and data files are crucial.

This may be one of the most important (yet overlooked) part for competing in Kaggle, or doing data science in general. To avoid reinventing the wheels and get inspired on how to preprocess, engineer and model the data, it’s worth spend 1/10 to 1/5 of the project time just researching how people previously dealt with similar problems/datasets. Some good places to start are:

Time spent on literature review is time well spent.

Establish an Evaluation Framework

Having a sound evaluation framework is crucial for all the works that follow: if we use suboptimal metrics or don’t have an effective cross validaton strategy that could gudie us tune generalizable models, we are aiming at the wrong target and wasting our time.

Metrics

In terms of the metrics to use, if we are competing on Kaggle we can find it under Evalution. If, however, we are working on some real-world data problem where we are free to craft/choose the ruler to measure how good (or bad) our models are, cares should be given about choosing the ‘right’ metric that makes the most sense for a domain the problem/data at hand. For example, we all know wrongfully classifying a spam email as non-spam does less harm than classifying the non-spam to be spam (Imagine missing an important meeting email from your boss. ). Similar case for medical diagnosis. In such situations, simple metric such as accuracy is not enough and we might want to consider other metrics, such as precision, recall, F1, ROC AUC score, etc.

A good place to view different metrics and their use cases is sklearn.metrics page.

Cross Validation (CV) Strategies

The key point for machine learning is to build models that can generalize on unseen data, and a good cross validation strategy help us achieve that (while a bad strategy misleads us and gives us false confidence).

Some good read on cross validation stategies:

Here is one simple example of nested cross-validation in Python using scikit-learn:

In a nutshell, the inner loop GridSearchCV is used to tune parameters, while the outer loop cross_val_score is used to train with optimal parameters and report an unbiased score. This is known as 2×5 cross-validation (although we can do 3×5, 5×4, or any fold combinations that work for our data size and computational constraints).

Exploratory Data Analysis (EDA)

EDA is an approach to analyze data sets to summarize their main characteristics, often with plots. EDA helps data scientists get a better understanding of the dataset at hands, and guide them to preprocess data and engineer features effectively. Some good resources to help carry out effective EDA are:

The one and only reason we need to preprocess data is so that a machine learning algorithm can learn most effectively from them. Specifically, three issues need to be addressed:

In practice, tasks in data preprocessing include:

Also check CCSU course page and scikit-learn documentation for data preprocessing in general and how to do it effectively in Python.

While in deep learning we usually just normalize the data (such that the image pixels have zero mean and unit variance), in traditional machine learning we need handcrafted features to build accurate models. Doing feature engineering is both art and science, and requires iterative experiments and domain knowledge. Feature engineering boils down to feature selection and creation.

Selection

scikit-learn offers some great feature selection methods. They can be categoried as one of the following:

Here is one simple example of selecting features from model in Python using scikit-learn:

In the above example, features with zero importance (feature_importances_ = 0) will be eliminated.

Some good articles about feature selection are:

Creation

This is the part where domain knowledge and creativity come in. For example, insurers can create features that help their machine learning model better identify customers with higher risks; similar for geologists who work with geological data.

But here are some general methods that help you create features to boost model performance:

After adding new handcrafted features, we need to perform another round of feature selection (e.g., using SelectFromModel to elimiate non-contributing features). Note that different classifiers might select different features, and it’s imoprtant that features selected using a certain classifier are later trained with the same classifier. For classifiers that don’t have either feature_importances_ or coef_ attribute (e.g., nonparametric classifiers such as KNN), the best way is to cross validate the features selected from various classifiers (i.e., to select the set of feature that has the highest CV score).

Feature engineering is the jewel in crown of machine learning. As machine learning professor [Pedro Domingos] (http://homes.cs.washington.edu/

pedrod/) puts it: «. some machine learning projects succeed and some fail. What makes the difference? Easily the most important factor is the features used.»

If we’ve got this far, model tuning is actually the easy part: we simply need to provide the parameter search space (coarse to refined) with certain classifiers, and hit the run button and let the machines do all the heavy lifting. In this regard, the nest CV strategy discussed above will do the job perfectly.

If only things were this simple. In practice, however, we face a big constraint: time (deadline to deliver the model to clients or Kaggle competition). To add more complexity, single model never generates the best results, meaning we need to stack many models together to deliver great performance. How to effectively stack (or ensemble) models will be discussed in the next section.

So, one alternative is to optimize the search space automatically, rather than manually setting each parameter from the coarse to the refined. Several Kaggle winners use and recommend hyperopt, a Python library for serial and parallel parameter optimization.

More notes/codes will be added w.r.t how to effectively use hyperopt.

Training one machine learning model (e.g., XGBoost, Random Forest, KNN, or SVM) can give us decent results, but not the best ones. If in practice we want to boost the performance, we might want to stack models using a combiner, such as logistic regression. This 1st place solution for Otto Group Product Classification Challenge is AWESOME. Here is one way how we do stacking:

About

A research into the workflow for Kaggle competition (and data science in general) collaboratively

LenzDu/Kaggle-Competition-Favorita

Use Git or checkout with SVN using the web URL.

Work fast with our official CLI. Learn more.

Launching GitHub Desktop

If nothing happens, download GitHub Desktop and try again.

Launching GitHub Desktop

If nothing happens, download GitHub Desktop and try again.

Launching Xcode

If nothing happens, download Xcode and try again.

Launching Visual Studio Code

Your codespace will open once ready.

There was a problem preparing your codespace, please try again.

Latest commit

Git stats

Files

Failed to load latest commit information.

README.md

This is the 5th place solution for Kaggle competition Favorita Grocery Sales Forecasting.

This competition is a time series problem where we are required to predict the sales of different items in different stores for 16 days in the future, given the sales history and promotion info of these items. Additional information about the items and the stores are also provided. Dataset and detailed description can be found on the competition page: https://www.kaggle.com/c/favorita-grocery-sales-forecasting

I build 3 models: a Gradient Boosting, a CNN+DNN and a seq2seq RNN model. Final model was a weighted average of these models (where each model is stabilized by training multiple times with different random seeds then take the average). Each model separately can stay in top 1% in the final ranking.

LGBM: It is an upgraded model from the public kernels. More features, data and periods were fed to the model.

CNN+DNN: This is a traditional NN model, where the CNN part is a dilated causal convolution inspired by WaveNet, and the DNN part is 2 FC layers connected to raw sales sequences. Then the inputs are concatenated together with categorical embeddings and future promotions, and directly output to 16 future days of predictions.

RNN: This is a seq2seq model with a similar architecture of @Arthur Suilin’s solution for the web traffic prediction. Encoder and decoder are both GRUs. The hidden states of the encoder are passed to the decoder through an FC layer connector. This is useful to improve the accuracy significantly.

How to Run the Model

Before running the models, download the data from the competition website, and add records of 0 with any existing store-item combo on every Dec 25th in the training data. Then use the function load_data() in Utils.py to load and transform the raw data files, and use save_unstack() to save them to feather files. In the model codes, change the input of load_unstack() to the filename you saved. Then the models can be runned. Please read the codes of these functions for more details.

Note: if you are not using a GPU, change CudnnGRU to GRU in seq2seq.py

About

5th place solution for Kaggle competition Favorita Grocery Sales Forecasting

Источники информации:

Добавить комментарий

Ваш адрес email не будет опубликован. Обязательные поля помечены *