Let’s talk about data mining. What is data mining? Data mining is the discovery of a model in data; it’s also called exploratory data analysis, and discovers useful, valid, unexpected, and understandable knowledge from the data. Some goals are shared with other sciences, such as statistics, artificial intelligence, machine learning, and pattern recognition. Data mining has been frequently treated as an algorithmic problem in most cases. Clustering, classification, association rule learning, anomaly detection, regression, and summarization are all part of the tasks belonging to data mining.
(For more resources related to this topic, see here.)
The data mining methods can be summarized into two main categories of data mining problems: feature extraction and summarization.
This is to extract the most prominent features of the data and ignore the rest. Here are some examples:
The target is to summarize the dataset succinctly and approximately, such as clustering, which is the process of examining a collection of points (data) and grouping the points into clusters according to some measure. The goal is that points in the same cluster have a small distance from one another, while points in different clusters are at a large distance from one another.
There are two popular processes to define the data mining process in different perspectives, and the more widely adopted one is CRISP-DM:
There are six phases in this process that are shown in the following figure; it is not rigid, but often has a great deal of backtracking:
Let’s look at the phases in detail:
Here is an overview of the process for SEMMA:
Let’s look at these processes in detail:
As we mentioned before, data mining finds a model on data and the mining of social network finds the model on graph data in which the social network is represented.
Social network mining is one application of web data mining; the popular applications are social sciences and bibliometry, PageRank and HITS, shortcomings of the coarse-grained graph model, enhanced models and techniques, evaluation of topic distillation, and measuring and modeling the Web.
When it comes to the discussion of social networks, you will think of Facebook, Google+, LinkedIn, and so on. The essential characteristics of a social network are as follows:
Here are some varieties of social networks:
Social networks are modeled as undirected graphs. The entities are the nodes, and an edge connects two nodes if the nodes are related by the relationship that characterizes the network. If there is a degree associated with the relationship, this degree is represented by labeling the edges.
Here is an example in which Coleman’s High School Friendship Data from the sna R package is used for analysis. The data is from a research on friendship ties between 73 boys in a high school in one chosen academic year; reported ties for all informants are provided for two time points (fall and spring). The dataset’s name is coleman, which is an array type in R language. The node denotes a specific student and the line represents the tie between two students.
Text mining is based on the data of text, concerned with exacting relevant information from large natural language text, and searching for interesting relationships, syntactical correlation, or semantic association between the extracted entities or terms. It is also defined as automatic or semiautomatic processing of text. The related algorithms include text clustering, text classification, natural language processing, and web mining.
One of the characteristics of text mining is text mixed with numbers, or in other point of view, the hybrid data type contained in the source dataset. The text is usually a collection of unstructured documents, which will be preprocessed and transformed into a numerical and structured representation. After the transformation, most of the data mining algorithms can be applied with good effects.
The process of text mining is described as follows:
Information retrieval is to help users find information, most commonly associated with online documents. It focuses on the acquisition, organization, storage, retrieval, and distribution for information. The task of Information Retrieval (IR) is to retrieve relevant documents in response to a query. The fundamental technique of IR is measuring similarity. Key steps in IR are as follows:
Prediction of results from text is just as ambitious as predicting numerical data mining and has similar problems associated with numerical classification. It is generally a classification issue.
Prediction from text needs prior experience, from the sample, to learn how to draw a prediction on new documents. Once text is transformed into numeric data, prediction methods can be applied.
Web mining aims to discover useful information or knowledge from the web hyperlink structure, page, and usage data. The Web is one of the biggest data sources to serve as the input for data mining applications.
Web data mining is based on IR, machine learning (ML), statistics, pattern recognition, and data mining. Web mining is not purely a data mining problem because of the heterogeneous and semistructured or unstructured web data, although many data mining approaches can be applied to it.
Web mining tasks can be defined into at least three types:
The algorithms applied to web data mining are originated from classical data mining algorithms; it shares many similarities, such as the mining process, however, differences exist too. The characteristics of web data mining makes it different from data mining for the following reasons:
Web data mining differentiates from data mining by the huge dynamic volume of source dataset, a big variety of data format, and so on. The most popular data mining tasks related to the Web are as follows:
We have looked at the broad aspects of data mining here. In case you are wondering what to look at next, check out how to “data mine” in R with Learning Data Mining with R (https://www.packtpub.com/big-data-and-business-intelligence/learning-data-mining-r).
If R is not your taste, you can “data mine” with Python as well. Check out Learning Data Mining with Python (https://www.packtpub.com/big-data-and-business-intelligence/learning-data-mining-python).
Further resources on this subject:
I remember deciding to pursue my first IT certification, the CompTIA A+. I had signed…
Key takeaways The transformer architecture has proved to be revolutionary in outperforming the classical RNN…
Once we learn how to deploy an Ubuntu server, how to manage users, and how…
Key-takeaways: Clean code isn’t just a nice thing to have or a luxury in software projects; it's a necessity. If we…
While developing a web application, or setting dynamic pages and meta tags we need to deal with…
Software architecture is one of the most discussed topics in the software industry today, and…