How to data mine
How to data mine
Data Mining в онлайн играх
Во всех онлайн сервисах и играх самая большая доля аудитории уходит прямо на старте – в первые же минуты и часы знакомства с продуктом. Этой теме уже посвящены сотни книг и статей с самыми различными гипотезами успеха и причин лояльности аудитории – уникальность, простота, юзабилити, бесплатность, обучение или инструкция, эмоциональность, и еще множество факторов считаются крайне важными.
Мы захотели узнать, почему уходят игроки и можно ли предсказать их уход. Предмет исследования – ММОРПГ Аион, однако наши результаты оказались применимы к широкому кругу игр и онлайн сервисов.
Чуть ли не британскими учеными установлено, что у пользователя очень короткая память. Сегодня он ушел из игры, а завтра уже не вспомнит, что он вообще ее устанавливал. Если игрок ушел, то действовать надо немедленно. Но как нам определить, действительно ли ушел человек, или просто сегодня вечером пьет пиво с друзьями и в игре не появится? Идеальным случаем было бы предсказание потенциального ухода еще до того, как пользователь нас покинул. И даже до того, как в его сознании зародилась мысль, что Аион не похож на торт. Наверное, такая задача тоже решаема, однако мы ставили более реалистичную цель – оперативно предсказывать уход в день последнего логина в игру. Уходом назовем неактивность человека в течение недели – и мы как раз не хотим ждать эти 7 дней, а желаем знать как можно скорее, что игрок больше не вернется. Мы желаем знать будущее!
Техническая сторона
Для анализа у нас было море информации – у Аиона лучшая система логирования, что я видел среди корейских игр, мы буквально знаем об игроке каждое его движение, каждый чих и каждый след, который он оставил на сервере. Период для анализа – первые девять уровней в игре, около 10 стартовых часов геймплея – за этот период отваливалась примерно половина всех новичков.
На проект выделили часть ресурсов нашей системы аналитики – два блейд сервера Dual Xeon E5630 32Gb RAM, 10 Tb холодного хранилища для исходных и промежуточных данных, 3 Tb горячего хранилища в RAID10 SAS массиве для рабочих данных. Оба сервера под MS SQL 2008R2 – один под БД и один под Analysis Services. Программная часть решения – стандартный пакет Business Intelligence от Microsoft, входящий в SQL Server.
Фаза 1 – я все знаю!
Поскольку я много лет был геймдизайнером и провел под сотню плейтестов, то был уверен, что и сейчас экспертное мнение даст 90% ответов почему уходят игроки. Не научился пользоваться телепортацией, надоело бегать ногами – ушел. Умер от первого же монстра в игре – ушел. Не выполнил вторую миссию, застрял и не знает что делать – ушел. Аион, при всем его качестве и технологичности, не самая дружелюбная к новичку игра. Это черта всех корейских игр, рассчитанных на хардкорную и гиперсоциальную среду корейских игроков, а не одиноких скучающих российских казуальных пользователей.
Как читать lift chart: нижняя наклонная прямая линия — это результат генератора случайных чисел, предсказывающего нашу булеву переменную научным методом бросания монетки. Верхняя линия, быстро доходящая до 100% — это оракул, идеальный предсказатель будущего. Между ними находится неровная, трепещущая ниточка – это наша модель. Чем ближе график к идеальной линии – тем выше предсказательная точность модели. График приведен для 7-го уровня, но картина похожая от первого до девятого.
Fatality! Наша первая модель предсказывает уходящих игроков чуть-чуть лучше метода орла и решки. Отправляем в модели оставшиеся гипотезы, чистим данные, процессим:
Уже лучше, но все равно точность чуть выше 50%. А если посмотреть детальнее recall (ошибки второго рода), то картина грустная:
Эта же таблица русским языком – из каждых 100 предсказанных моделью уходов 49 будут ложные (игрок никуда уходить не собирался), точность модели составит 1008/(1008+982)=51%. При этом еще часть реальных уходов модель вообще пропустит – примерно 28% из истинно ушедших [391/(391+1008)=28%]. Внимание, это не каноническое определение recall, но такая формула нагляднее.
Итог фазы 1: все изначальные идеи провалились, предсказание не работает. Шеф, все пропало!
Фаза 2 – мы ничего не знаем
Полный разгром и бегство с поля боя, и вечный вопрос «Что делать?». На помощь приходит наивный алгоритм Байеса – максимально человеко-читаемый и понятный из всех data mining классификаторов. Анализ Байесом показал, что выбранные гипотезы довольно слабо характеризуют ушедших и остающихся игроков, то есть я ошибся с выбором изначальных предпосылок. Но, поиграв с глубиной и чувствительностью другого алгоритма, дерева принятия решений, стало понятно – есть правильные гипотезы, дерево ветвится по ним, но факторов решительно недостаточно – рост дерева прерывается на 2-3 ветке.
Не забираясь в дебри математики, которые и сам не понимаю, упрощенно — алгоритм дерева решений делит исходные данные на сегменты с максимально низкой итоговой энтропией, то есть на максимально непохожие наборы данных. Если дерево перестало ветвиться – значит, нужны новые гипотезы и новые метрики в исходных данных, чтобы дерево глубже разделяло входной поток данных и лучше предсказывало будущее.
Я собрал брейншторм с командой проекта, где мы фонтанировали идеями – кто же наши новички, как они играют, чем они отличаются друг от друга. Вспоминали истории как наши подруги и жены знакомились с Аионом, и что из этого вышло. Итогом брейншторма стал дополненный список индивидуальных гипотез (пользовался ли игрок телепортацией, расширил ли себе инвентарь, привязал ли точку воскрешения и т.д.) и новая идея – хорошо бы посмотреть насколько вообще отличается активность уходящих от остающихся в игре.
Загрузили, обучили, верифицировали, проанализировали. Не буду грузить вас морем lift chart’ов по каждому уровню и каждой модели, приведу сразу обработанные и проанализированные данные:
Пик точности на 9 уровне был связан с внутренней особенностью игры на момент исследования.
В целом картина улучшилась в области 2-4 уровней, но 6-8 ниже плинтуса, с такой точностью данные нам просто бесполезны.
Дерево принятия решений бодро показывает – факторы активности являются самыми важными для предсказания ухода. По сути, три величины – время на уровне, убитые монстры и сделанные задания – определяют львиную долю уходов. Остальные факторы добавляют не более 5% точности. Также дерево по-прежнему остается голым, крона обрывается на третьей ветке – то есть модель жаждет больше релевантных метрик. Что еще непонятно – точность трех алгоритмов сильно меняется от уровня к уровню.
Итог фазы 2: успех идеи об измерении средней активности, а не индивидуальных факторов. Но точность предсказания все еще неудовлетворительна. Путь по граблям вывел к правильной последовательности анализа результатов – сначала факторы и корреляции (Байес), потом их влияние на итог (дерево решений).
Фаза 3 – мы знаем куда копать
Воодушевленный прогрессом, я наметил три вектора развития проекта – больше метрик общей активности, больше специфических метрик индивидуальной эффективности, и более глубокое изучение инструментов Microsoft BI.
Пришлось повозиться с новыми индивидуальными метриками, связанными с глубиной геймплея и эффективностью игры, например процентом автоатаки. Мы сегментировали персонажей по классам (воины направо, целители налево) и для каждого класса рассчитали 25, 50 и 75-й перцентили распределения по %% автоатаки, и разбили всех на 4 категории. Теперь данные нормализованы, и игровые классы можно сравнивать между собой – на вход data mining моделей уходит номер категории.
Индивидуальные метрики закрепились на глубине седьмого-девятого узла дерева, т.е. они прибавили пару процентов к точности предсказания, но не улучшили ситуацию кардинально. Следующим шагом было штудирование книги Data Mining with Microsoft SQL Server 2008 на предмет тонкостей работы с Analysis Services. Сама по себе книга помогла только с настройкой чувствительности дерева (от силы плюс один-два процента прироста точности), но натолкнула на мысль о правильной дискретизации.
В примере выше с автоатакой мы сделали ручную дискретизацию данных – разбиение на категории по каким-то признакам. SQL сервер автоматически делает дискретизацию несколькими способами. Экспериментальным путем я быстро понял, что алгоритм разбиения и число сегментов очень сильно влияют на предсказательную силу модели. Ручное изменение числа сегментов сильно влияет на форму и точность дерева. На ручную подгонку я потратил неделю, скрупулезно для каждой структуры каждого уровня (а это 9 уровней по 30+ метрик) экспериментируя с числом сегментов. Для каких-то метрик оптимальным было 7 сегментов (например, время на текущем уровне), для каких-то 12 (суммарное время в игре), для каких-то больше 20 (число убитых монстров).
Ручная настройка дала сильный прирост предсказанных значений – точность при этом не сильно повысилась, но модели стали делать заметно меньше пропусков, а результаты дерева сравнялись с нейронной сетью:
Итог фазы 3: мы вышли на приемлемые показатели точности и аккуратности и узнали много интересного про нашу игру и наших игроков.
Фаза 4 – только победа
Я, честно говоря, думал, что потолок достигнут – дерево ветвится глубиной до 9-12 узлов, аккуратность сильно улучшена. Новые гипотезы точность никак не повышают, новые факторы никакой информации не дают. В принципе, общая точность в 78% и recall 16% — это удовлетворительно для начала работы с игроками. Я бы, наверное, не стал при таких цифрах давать бесплатную подписку для удержания в игре, но сообщать игроку релевантную информацию уже можно без особых ошибок.
Помощь пришла неожиданно – поскольку data mining проект длился уже третий месяц, у нас несколько устарели логи – игра же изменилась за это время. Подгрузив немного свежих данных, а заодно в очередной раз доработав ETL процедуры, мы заметили изменения в моделях. На новых данных они вели себя иначе – при, в общем-то, прежней точности и аккуратности, разбиения дерева были другими. На этом этапе все три алгоритма обучались очень быстро – минуту на каждый уровень из 9, и накормить их дополнительным набором данных просто.
Сказано – сделано, выгружаем вообще все накопленные за 3 месяца данные и одним махом направляем модели обучаться (процесс стал занимать не минуту, а целых пять на каждый уровень – не критично). Очередной раунд ручной подгонки, и вот итог:
Увеличив объем обучающих данных, мы сделали процесс обработки дольше, но зато какой отличный результат!
С первым уровнем, к сожалению, немного можно сделать – около половины уходов, как сказал бы Авинаш Кошик, “I came, I puked, I left”. У нас есть данные о буквально паре действий игрока – и дальше он закрывает клиент игры и никогда не возвращается обратно.
Напоминаю, что все исследования выше – это обучение на накопленных исторических данных. Теперь я хочу проверки боем! Проверяем на живых данных – берем свежих, сегодняшних пользователей, прогоняем через модель и сохраняем результат предсказания. Через неделю сравниваем предсказания модели с объективной реальностью – кто из недельной давности новичков действительно ушел, а кто в игре остался:
Самое интересное
Первая цель проекта – предсказание ухода новичка из игры – безусловно достигнута. С такой точностью уже можно принимать решения по возврату игрока, общаться с ним, мотивировать его, давать плюшки. И это предсказание почти в день ухода: сегодня вечером человек вышел из игры – завтра в 5 утра данные обработались и вероятность ухода уже известна с высокой достоверностью.
Победа!
За два месяца, с нуля – никто из нас никогда даже близко с data mining не сталкивался, — с помощью двух книг и желания попробовать что-то новое, на основе созданной нами мощной, но пассивной системы аналитики в Иннове, мы сделали инструмент, активно смотрящий в будущее. В отличие от обычной отчетности и аналитики трендов на исторических данных, мы в 6 утра уже знаем почти наверняка о наших вчерашних новичках в Аионе – увидим ли мы их сегодня в игре или нет. И можем действовать, пока еще не поздно.
Анализ проведен для онлайн-игры, но как вы могли заметить, основной вклад в точность предсказания был от обобщенных метрик активности – и я уверен, что подход будет работать для любого вашего продукта или сервиса, с которым активно работают пользователи, если конечно у вас есть желание выйти на качественно новый уровень.
PS. Если тема хабравчанам интересна, то можно продолжить – про предсказание уходов старичков, сегментацию и кластеризацию, миграцию между кластерами и другие data mining проекты, которые мы сделали в уходящем году.
PS2. Вторая книга, рекомендую абсолютно всем — Программируем коллективный разум
Introduction to Data Mining: A Complete Guide
In this article
Picking an online bootcamp is hard. Here are six key factors you should consider when making your decision.
Data mining is the process of finding anomalies, patterns, and correlations within large datasets to predict future outcomes. This is done by combining three intertwined disciplines: statistics, artificial intelligence, and machine learning.
Read on to learn more about the uses of data mining in the real world, important distinctions between data mining and other related data functions, and data mining tools and techniques.
What Is Data Mining?
Data mining is an automated process that consists of searching large datasets for patterns humans might not spot.
For example, weather forecasting is based on data mining methods. Weather forecasting analyzes troves of historical data to identify patterns and predict future weather conditions based on time of year, climate, and other variables.
This analysis results in algorithms or models that collect and analyze data to predict outcomes with increasing accuracy.
How Does Data Mining Work?
In the information economy, data is downloaded, stored, and analyzed for most every transaction we perform, from Google searches to online shopping. The benefits of data mining are applicable across industries, from supply chains to healthcare, advertising, and marketing.
Data mining business use cases typically center around personalizing customer experiences.
Predictive analytics help businesses personalize user interactions, determine the best time to upsell or cross-sell a customer, identify cost inefficiencies in their supply chain, and analyze user behavior to deduce customer pain points.
Data Mining Process In 5 Steps
The data mining process consists of five steps. Learning more about each step of the process provides a clearer understanding of how data mining works.
What Is Data Mining Often Confused With?
Data mining is often confused with a number of related terms. It’s important to understand how data mining differs from the terms it is often confused with.
3 Common Data Mining Applications
Data mining is used across a wide range of industries. Below are three common data mining applications in three fields: marketing, business analytics, and business intelligence.
Get To Know Other Data Science Students
Data Scientist at NPD Group
Lead Solutions Manager at Hypergiant
Sr. Healthcare Analyst at IBM
4 Key Data Mining Programming Languages
In order to become a data miner, there are four essential programming languages you need to learn: Python, R, SQL, and SAS.
7 Essential Data Mining Techniques
There are a number of data mining techniques. Below is a breakdown of the seven most essential techniques used by data scientists.
Essential Data Mining Tools
Data scientists use a range of statistical software applications like Spark and IBM SPSS Modeler to clean, organize, parse, analyze, and visualize data to convert it into usable information.
Thankfully, many data mining tools are open-source and free to use, so anyone can experiment with them.
Data Mining: Frequently Asked Questions
Below you’ll find the answers to a number of frequently asked questions on data mining, how data mining is used in business, and more.
Who uses data mining?
Businesses across every industry and sector use data mining to extract business insights from their data, from retail to healthcare, manufacturing, banking, education and more. For example, companies with a low customer retention rate, such as utilities and telecommunications companies, use data mining to predict customer ‘churn’ based on customer behavior.
Data mining has non-commercial use cases, too. Local governments use it to predict graduation rates in their school districts, public health officials use it to predict the spread of infectious disease, and doctors use it to predict whether premature babies might develop dangerous infections.
How is data mining used in business?
In business, data mining is used to interpret and predict customer behavior using data analytics and track operational metrics in real-time using business intelligence.
Data mining helps businesses maximize revenue by discovering customer pain points, identifying opportunities for cross-selling and upselling, and minimizing risks when launching new products or business ventures.
What are the challenges of data mining?
The biggest impediment to effective data mining is poor data quality, such as incomplete data, missing or incorrect values, poor representation in data sampling, or noisy data (data with a large amount of meaningless additional information).
It can also be immensely difficult to integrate conflicting or redundant data from multiple sources and forms, such as combining structured and unstructured data. There is also the high cost of buying and maintaining software, servers, and storage applications to handle large amounts of data.
What makes data mining an important business tool?
Data mining helps businesses make more educated decisions based on real-world conditions. Data mining empowers businesses to develop smarter marketing campaigns, predict customer loyalty, identify cost inefficiencies, prevent customer churn, and personalize the customer experience using recommendation engines and market segmentation.
Does data mining require coding?
Yes. In addition to software, data scientists also use programming languages like R and Python to manipulate, analyze and visualize data.
What are the benefits of data mining?
Data mining empowers organizations to make better decisions based on real-time and historical data. By building models to predict future behaviors, businesses can have a better understanding of their customers, which gives them a competitive advantage.
Raw data in itself is not useful to businesses; it has to be processed and interpreted. Data mining is deployed in different ways across industries. For example:
Since you’re here…
Thinking about a career in data science? Enroll in our Data Science Bootcamp, and we’ll get you hired in 6 months. If you’re just getting started, take a peek at our foundational Data Science Course, and don’t forget to peep our student reviews. The data’s on our side.
About Sakshi Gupta
Sakshi is a Senior Associate Editor at Springboard. She is a technology enthusiast who loves to read and write about emerging tech. She is a content marketer and has experience working in the Indian and US markets.
Download our guide to becoming a data scientist in six months
Learn how to land your dream data science job in just six months with in this comprehensive guide.
A Complete Guide to Data Mining and How to Use It
Written by Luna Campos
Data mining is one of the most effective ways organizations can make sense of their data. This technique can be extremely valuable to streamline operations, build accurate sales forecasts, increase marketing ROI, provide valuable customer insights, and much more.
Let’s talk about what data mining is, some key definitions to keep in mind, common challenges, and how your business can harness its potential safely and ethically.
What is data mining?
Data mining is the process of analyzing big amounts of data to find trends and patterns. It allows you to turn raw, unstructured data into comprehensible insights about various areas of the business. These areas may include sales, marketing, operations, finance, and more.
Any data that has to do with your business can be mined. This data includes but is not limited to:
Feeling overwhelmed? That’s understandable. Most businesses wish they could take better advantage of their data to make better, more informed decisions — but that is much easier said than done.
Big data is a veritable gold mine in what it has to offer, but managing, analyzing, and deriving insights from it presents a lot of challenges, too. And when you start learning about data management, you come across all this technical jargon and complex definitions that seem to make it all the more complicated.
That’s where data mining comes in. It takes everything that’s overwhelming about analyzing and managing big data and makes it much more accessible and easier to understand.
How Data Mining Works
Data mining can give you important insights that solve problems, reduce risks and costs, identify market opportunities, improve customer experience, and predict customer behaviors and preferences.
Before we dive into the more tactical aspects of data mining, let’s take a look at the benefits.
Benefits of Data Mining
When done well, data mining can bring a significant advantage by providing business intelligence you wouldn’t otherwise have access to. It also gives you insights in a much more relevant and timely manner. Some of the benefits of data mining include:
1. It allows you to easily find the most important data.
Big data has some really useful information in it, but there’s also a lot you don’t need and that would hinder analyses rather than help. Data mining allows you to automatically tell the valuable information apart and construe it into actionable reports.
If you’re using a tool such as Operations Hub to track your data, you often don’t have to look at the raw numbers at all or create reports from scratch each time. Instead, you can find your most pertinent data each time you access the tool, negating the need to export and compile spreadsheet after spreadsheet of raw numbers.
2. It results in faster, automated decision-making.
Instead of needing a person to review everything and decide on a course of action, you can automate certain decisions. For example, banks can use software to identify data trends that look like fraudulent behavior and automatically block accounts within seconds, notify a responsible individual, or request additional verification from users.
Even if you have a person manually reviewing the data, you can speed up the decision-making process by having data mining processes in place that turn the big data into more digestible fragments.
3. It helps your team work more efficiently.
Imagine having your sales team review a 100-tab spreadsheet every time they want to find the number of customers in a certain industry. Data mining takes all of this manual work out of the equation by providing a way for salespeople to find this information without wading through rows and rows of big data.
There are hundreds of use cases where data mining will serve both managers and individual contributors in a team. If your job is to find patterns and trends in a data set, data mining will help you do that effortlessly.
4. It helps you gather accurate data about your customers.
Data mining can help you gather customer data from multiple sources and collate it to form informative and thorough profiles. This can give you valuable knowledge about customer trends, preferences, behaviors, similarities, and differences. That’s the type of information that helps you deliver a better customer experience overall and improve communication across all touchpoints.
5. It helps you increase revenue.
With the knowledge you get from data mining, you can build much more personalized sales pitches, create better campaigns, and tailor content and product recommendations based on known customer preferences and behaviors.
You can also predict trends in how consumers purchase or navigate your website, figure out what stops them from buying or what leads them to churn, create accurate audience segments, and offer tailored promotions. It goes without saying that these data-driven changes yield a significantly higher ROI, increasing revenue.
Now that you know the benefits of data mining, let’s take a look at some techniques you can use to get started.
Data Mining Techniques
You can get started data mining without needing a data analyst on your staff roster. We’ll start with some basic techniques, then move on to more specialized processes.
An often overlooked step when implementing data processes — including data mining — is data integration. In a nutshell, data integration means combining data from several disparate sources into a unified database for a more consistent view of the data. It’s one of the most important steps in data lifecycle management (DLM).
Advanced Data Mining Techniques
For the following techniques, you might need a data analyst who knows how to use AI and machine learning tools to further refine the data mining processes at your business.
How to Data Mine
Data mining may sound like something only an enterprise firm can do, but any company can do it, so long as you approach it in stages. For that, we recommend using CRISP-DM (Cross Industry Standard Process for Data Mining). It’s comprised of six stages:
Image Source
We break these down below.
Stage 1: Business Understanding
In this stage, your job is to figure out what your company is trying to get out of this data mining project. Is it to increase revenue? Find better prospects? Attract top talent? Create more profitable marketing campaigns? It can truly be anything, so long as you can arrive to an answer by analyzing data.
Stage 2: Data Understanding
Next up, it’s time to identify the datasets you need to answer your question. For instance, if your goal is to increase revenue, you might need the current number of customers, the number who has churned, and the average deal size.
Gather your high-quality data and store it in a format that you can easily access. If you’re just getting started with data mining, you might use something as simple as Google Sheets. If your business is growing, consider HubSpot’s data sync tool. If you’re experienced, you might opt for a tool such as Tableau.
Stage 3: Data Preparation
Clean up the data, remove duplicates, and ensure it represents your business accurately. To avoid errors, you might employ the help of a tool such as Operations Hub and appoint this task to one person. Allowing multiple people to collaborate on one dataset at the same time may lead to duplicates and redundancies.
Check out our guides on data quality and data lifecycle management to ensure you do everything you need to do in this stage.
Stage 4: Modeling
In the modeling stage, you use algorithms, artificial intelligence, and machine learning to associate, categorize, regress, and cluster your data. If you have a data analyst on staff, they might use the R and Python programming languages to carry out these data mining techniques. They might also use data mining software.
If you’re just getting started, you might use the pivot table, filtering, and data visualization tools in your spreadsheet software.
Stage 5: Evaluation
Next, it’s time to look at the results. Do your findings help you answer the business question you established in stage one? If not, then it’s time to try stage four again — it’s totally normal to have to model the data various times before gleaning the right insights.
Stage 6: Deployment
Last, you compile all of your results in a presentation or dashboard and present it to key stakeholders. You’ll all convene and figure out what to do based on what you found in your data.
Data mining has its benefits, but it can sound like a lot to tackle for a beginner in the subject. One common point of confusion is in regards to the differences between data mining and data harvesting.
Data Mining vs Data Harvesting
Data mining is the analysis of large sets of data in order to derive trends, and data harvesting is the process of extracting data from online sources to then build analyses. While data mining focuses more on the analysis of data, data harvesting focuses on the collection of data.
The two processes can be complementary if done properly. Data harvesting involves crawling a website to extract its data. You can then use data mining to organize it into intelligible information.
While it is possible to do this safely and ethically, there are plenty of malicious actors who use data harvesting methods to collect information online — such as email addresses, contact lists, photos, videos, text, or code — without users’ consent or knowledge.
Let’s take a look at one real-life example and two hypothetical examples to illustrate how harmful this practice can be.
Data Harvesting Examples
Harvesting Data from Facebook
One famous example of data harvesting you might have heard of was the Cambridge Analytica and Facebook scandal. As reported by The New York Times, the British political consulting firm started harvesting data of millions of Facebook users in order to build psychological profiles of voters and try to sell them to political campaigns.
Though the Cambridge Analytica scandal was large-scale and had huge repercussions, unethical data harvesting practices can be conducted by any type of company, regardless of size.
Acquiring Data Without Users’ Consent
Let’s say a small media startup is hoping to build more personalized content recommendations for their audience, which is mainly composed of women aged 18-24. So, in order to get more data to build these campaigns, this company decides to crawl similar websites that are often visited by the same target audience.
It finds out what type of content they consume there and builds tailored content recommendations from that. However, this data was acquired without users’ consent, which already constitutes a data harvesting malpractice.
Buying Email Lists
Another unethical data harvesting example is when a company is seeking to broaden the reach of their email newsletters, but doesn’t have a huge number of subscribers yet. So this company decides to buy a contact list from a third-party provider to reach more people. However, buying and selling contact lists may be prohibited under several data protection laws, as well as sending unsolicited emails when users didn’t explicitly provide their personal data or consent to receive emails.
The scenarios described above are perfect examples of what not to do when deploying data mining and harvesting. In the Facebook-Cambridge Analytica case, for instance, data was extracted without users’ consent or knowledge. Facebook also failed to safeguard user data against external actors, and the data was then used for purposes that the users didn’t explicitly agree with — or even necessarily knew about.
That’s why it’s paramount to be aware of the potential pitfalls with data mining and data harvesting and ensure that you carry out these practices ethically and transparently.
When Data Mining, Ensuring Data Protection and Privacy Is Key
Like any process that deals with sensitive data — including personal data — your number one concern should be to ensure that all data you’re collecting and using has been provided with explicit consent and in full compliance with any applicable privacy laws. This also includes making sure the data is secure throughout all stages of the process, including collection, storage, analysis, all the way to data deletion.
Organizations also need to implement internal rules to specify what the data can be used for and how it can be analyzed and implemented – and make sure that the insights taken from data mining themselves don’t infringe on privacy policies. As a rule of thumb, being transparent, honest, and ethical with data should be your top priority.
Some companies may want to hire staff specialized in data science and security to oversee all data management and analysis procedures, which can be a big help to ensure data protection and user privacy throughout the entire process. They can also deploy specialized tools to achieve the best results.
However, all these special know-how and tools can end up getting quite expensive, which could make data mining cost-prohibitive to smaller or more budget-conscious businesses. This cost may also scale as your company grows and the complexity of your data increases.
Integrating Your Data Before Mining
Integrating your data can make data mining even more effective and accurate. Since your data would be unified, enriched, and up-to-date after integration, it would be much easier and faster to identify trends and patterns, allowing for more agile decision-making based on current and accurate results.
If you use a syncing solution like Operations Hub to integrate your data, your customer databases are also updated in real time, so any analysis you gather from this data will be based on real-time insights and enable you to build more accurate profiles and compile reliable reports.
This type of integration can also sync customers’ communication preferences between your apps, making it much easier for you to visualize customers’ opt-ins and opt-outs in all apps to comply with data protection and privacy laws.
With that, you can not only gather accurate, reliable, and relevant insights from your data, but you can do so safely and legitimately — putting users’ privacy and protection front and center.
Editor’s note: This post was originally published in October 2020 and has been updated for comprehensiveness.
Originally published Feb 9, 2022 7:00:00 AM, updated February 09 2022
Introduction to data mining techniques
» data-medium-file=»https://i0.wp.com/dataaspirant.com/wp-content/uploads/2014/09/Introduction_data_mining.jpg?fit=300%2C251&ssl=1″ data-large-file=»https://i0.wp.com/dataaspirant.com/wp-content/uploads/2014/09/Introduction_data_mining.jpg?fit=690%2C578&ssl=1″ loading=»lazy» alt=»introduction to data mining techniques» width=»690″ height=»578″ srcset=»https://i0.wp.com/dataaspirant.com/wp-content/uploads/2014/09/Introduction_data_mining.jpg?w=940&ssl=1 940w, https://i0.wp.com/dataaspirant.com/wp-content/uploads/2014/09/Introduction_data_mining.jpg?resize=300%2C251&ssl=1 300w, https://i0.wp.com/dataaspirant.com/wp-content/uploads/2014/09/Introduction_data_mining.jpg?resize=768%2C644&ssl=1 768w» sizes=»(max-width: 690px) 100vw, 690px» data-recalc-dims=»1″>
Introduction to data mining techniques
Introduction to data mining techniques:
Data mining techniques are set of algorithms intended to find the hidden knowledge from the data. Usage of data mining techniques will purely depend on the problem we were going to solve. Some of the popular data mining techniques are classification algorithms, prediction analysis algorithms, clustering techniques. In this initial introduction post, we were going to address the basic understanding of the term data mining by presenting you a toy kind of example. You can learn more on data mining beginners guide.
Data Mining History:
In 1960s statisticians used the terms “Data Fishing” or “Data Dredging” to refer what they considered the bad practice of analyzing data without a prior hypothesis. The term “Data Mining” appeared around 1990 in the database community.
Data mining in Technical words:
Technically Data mining is the process of extracting specific information from data and presenting relevant and usable information that can be used to solve problems. There are different kinds of services in the process like text mining, web mining, audio and video mining, pictorial data mining and social network data mining.
Why is data mining hot cake topic for this generation?
Data mining is the young and promising field for the present generation because of its spacious applications. In a general way of saying, it has an attracted a great deal of attention in the information industry and in society, due to the wide availability of huge amounts of data and the imminent need for turning such data into useful information and knowledge.The information and knowledge gained can be used for applications ranging from market analysis, fraud detection, and customer retention to production control and science exploration. This is the reason why data mining is also called as knowledge discovery from data.
Data Mining Techniques :
Data Mining Applications:
Understanding of data mining with buying apple example:
Before going to explain data mining with this fresh apples, let me say some interesting facts about apples.
Nutrition: According to the United States Department of Agriculture, a typical apple serving weighs 242 grams and contains 126 calories with significant dietary fiber and modest vitamin C content, with otherwise a generally low content of essential nutrients.
Toxicity of apple seeds: The seeds of apples contain small amounts of amygdalin, a sugar and cyanide compound known as a cyanogenic glycoside. Ingesting small amounts of apple seeds will cause no ill effects, but in extremely large doses can cause adverse reactions. There is only one known case of fatal cyanide poisoning from apple seeds; in this case, the individual chewed and swallowed one cup of seeds. It may take several hours before the poison takes effect, as cyanogenic glycosides must be hydrolyzed before the cyanide ion is released.
Now Let’s step into example for basic understanding building data mining model:
Suppose your family members want to meet someone who is suffering from pancreatic cancer. We all know that the consumption of apples could help to reduce pancreatic cancer by up to 23 percent. So your father asked you to bring apples from a nearby shop. Also, your father teaches (learn) you how to buy apples by giving some set of rules.
Rules for buying apples:
On clear observation on the about listed rules, You can pick the apples which you want to buy. Your family members want to give these apples to an unhealthy person. Hence, you obviously pick green apples. So when you go for shopping you will pick small size apples which are in green color. End of the story to select apples which are good for health.
What Is Data Mining: Definition, Benefits, Applications, Top Techniques, and More
Table of Contents
We are living in an information-rich, data-driven world. While it’s comforting to know there’s a plethora of readily available knowledge, the sheer volume creates challenges. The more information available, the longer it can find the useful insights you need.
That’s why today we’re discussing data mining. We’ll be exploring all aspects of data mining, including what it means, its stages, data mining techniques, the benefits it offers, data mining tools, and more. Let’s kick things off with a data mining definition, then tackle data mining concepts and techniques.
We will now begin by understanding what is data mining.
Post Graduate Program in Data Analytics
What is Data Mining?
Typically, when someone talks about “mining,” it involves people wearing helmets with lamps attached to them, digging underground for natural resources. And while it could be funny picturing guys in tunnels mining for batches of zeroes and ones, that doesn’t exactly answer “what is data mining.”
Data mining is the process of analyzing enormous amounts of information and datasets, extracting (or “mining”) useful intelligence to help organizations solve problems, predict trends, mitigate risks, and find new opportunities. Data mining is like actual mining because, in both cases, the miners are sifting through mountains of material to find valuable resources and elements.
Data mining also includes establishing relationships and finding patterns, anomalies, and correlations to tackle issues, creating actionable information in the process. Data mining is a wide-ranging and varied process that includes many different components, some of which are even confused for data mining itself. For instance, statistics is a portion of the overall data mining process, as explained in this data mining vs. statistics article.
Additionally, both data mining and machine learning fall under the general heading of data science, and though they have some similarities, each process works with data in a different way. If you want to know more about their relationship, read up on data mining vs. machine learning.
Data mining is sometimes called Knowledge Discovery in Data, or KDD.
Now that we have learned what is data mining, we will now look at the data mining steps.
Data Mining Steps
When asking “what is data mining,” let’s break it down into the steps data scientists and analysts take when tackling a data mining project.
1. Understand BusinessВ
What is the company’s current situation, the project’s objectives, and what defines success?
2. Understand the Data
Figure out what kind of data is needed to solve the issue, and then collect it from the proper sources.
3. Prepare the Data
Resolve data quality problems like duplicate, missing, or corrupted data, then prepare the data in a format suitable to resolve the business problem.
4. Model the Data
Employ algorithms to ascertain data patterns. Data scientists create, test, and evaluate the model.
Data Analytics Free Course
5. Evaluate the Data
Decide whether and how effective the results delivered by a particular model will help meet the business goal or remedy the problem. Sometimes there’s an iterative phase for finding the best algorithm, especially if the data scientists don’t get it quite right the first time. There may be some data mining algorithms shopping around.
6. Deploy the Solution
Give the results of the project to the people in charge of making decisions.
To extend our learning on what data mining is, we will next look at the benefits.
What Are the Benefits of Data Mining?
Since we live and work in a data-centric world, it’s essential to get as many advantages as possible. Data mining provides us with the means of resolving problems and issues in this challenging information age. Data mining benefits include:
After having learned what is data mining, let us look into the drawbacks.
Are There Any Drawbacks to Data Mining?
Nothing’s perfect, including data mining. These are the major issues in data mining:
After going through what is data mining, let us look into the various kinds.
Data Analyst Master’s Program
What Kinds of Data Mining Tools Are Out There?
As engineers are fond of saying, “Use the right tool for the right job.” Here is a selection of tools and techniques that provide data analysts with diverse data mining functionalities.
Artificial Intelligence
Association Rule Learning
Clustering
Classification
Data Analytics
Data Cleansing and Preparation
Data Warehousing
Machine Learning
Regression
Two specific tools need mentioning.
In our learning about what is data mining, let us now look into the applications.
Free Course: Big Data Hadoop and Spark Developer
Data Mining Applications
Data mining is a useful and versatile tool for today’s competitive businesses. Here are some data mining examples, showing a broad range of applications.
Banks
Data mining helps banks work with credit ratings and anti-fraud systems, analyzing customer financial data, purchasing transactions, and card transactions. Data mining also helps banks better understand their customers’ online habits and preferences, which helps when designing a new marketing campaign.
Healthcare
Data mining helps doctors create more accurate diagnoses by bringing together every patient’s medical history, physical examination results, medications, and treatment patterns. Mining also helps fight fraud and waste and bring about a more cost-effective health resource management strategy.
Marketing
If there was ever an application that benefitted from data mining, it’s marketing! After all, marketing’s heart and soul is all about targeting customers effectively for maximum results. Of course, the best way to target your audience is to know as much about them as possible. Data mining helps bring together data on age, gender, tastes, income level, location, and spending habits to create more effective personalized loyalty campaigns. Data marketing can even predict which customers will more likely unsubscribe to a mailing list or other related service. Armed with that information, companies can take steps to retain those customers before they get the chance to leave!
Retail
The world of retail and marketing go hand-in-hand, but the former still warrants its separate listing. Retail stores and supermarkets can use purchasing patterns to narrow down product associations and determine which items should be stocked in the store and where they should go. Data mining also pinpoints which campaigns get the most response.
Learn over a dozen of data analytics tools and skills with PG Program in Data Analytics and gain access to masterclasses by Purdue faculty and IBM experts. Enroll and add a star to your data analytics resume now!
Do You Want to Study Data Analytics?
There’s a lot of data generated every day, and consequently, there is a correspondingly great demand for professionals to analyze that information using techniques like data mining. Simplilearn’s Data Analytics Bootcamp is the perfect data analytics certification course for anyone on a data scientist career path.
This program, held in partnership with Purdue University and collaboration with IBM, gives you broad exposure to key technologies and skills currently used in data analytics and data science. You will learn statistics, Python, R, Tableau, SQL, and Power BI. Once you complete this comprehensive data analytics course, you will be ready to take on a professional data analytics role.
According to Indeed, data scientists can earn an annual average of USD 122,875. Additionally, there is an ever-growing, healthy demand for data scientists. Let Simplilearn help you find that new career. Check out the courses today and get a start on your rewarding data-driven future!
Find our Post Graduate Program in Data Analytics Online Bootcamp in top cities:
Name | Date | Place | |
---|---|---|---|
Post Graduate Program in Data Analytics | Cohort starts on 2nd Sep 2022, Weekend batch | Your City | View Details |
Post Graduate Program in Data Analytics | Cohort starts on 7th Sep 2022, Weekend batch | Your City | View Details |
About the Author
Simplilearn is one of the world’s leading providers of online training for Digital Marketing, Cloud Computing, Project Management, Data Science, IT, Software Development, and many other emerging technologies.
Recommended Programs
Post Graduate Program in Data Analytics
Post Graduate Program in Data Science
*Lifetime access to high-quality, self-paced e-learning content.
Find Post Graduate Program in Data Analytics in these cities
The Difference Between Data Mining and Statistics
Recommended Resources
A Beginner’s Guide to the Top 10 Big Data Analytics Applications of Today
Top 9 Data Mining Tools You Need to Know in 2022 and Why
Understanding the Fundamentals of Dogecoin Mining
An Introduction to Big Data: A Beginner’s Guide
Data Mining Vs. Machine Learning: What Is the Difference?
Understanding the Fundamentals of Ethereum Mining