Let’s talk about data mining. What is data mining? Data mining is the discovery of a model in data; it’s also called exploratory data analysis, and discovers useful, valid, unexpected, and understandable knowledge from the data. Some goals are shared with other sciences, such as statistics, artificial intelligence, machine learning, and pattern recognition. Data mining has been frequently treated as an algorithmic problem in most cases. Clustering, classification, association rule learning, anomaly detection, regression, and summarization are all part of the tasks belonging to data mining.
(For more resources related to this topic, see here.)
The data mining methods can be summarized into two main categories of data mining problems: feature extraction and summarization.
This is to extract the most prominent features of the data and ignore the rest. Here are some examples:
- Frequent itemsets: This model makes sense for data that consists of baskets of small sets of items.
- Similar items: Sometimes your data looks like a collection of sets and the objective is to find pairs of sets that have a relatively large fraction of their elements in common. It’s a fundamental problem of data mining.
The target is to summarize the dataset succinctly and approximately, such as clustering, which is the process of examining a collection of points (data) and grouping the points into clusters according to some measure. The goal is that points in the same cluster have a small distance from one another, while points in different clusters are at a large distance from one another.
The data mining process
There are two popular processes to define the data mining process in different perspectives, and the more widely adopted one is CRISP-DM:
- Cross-Industry Standard Process for Data Mining(CRISP-DM)
- Sample, Explore, Modify, Model, Assess (SEMMA), which was developed by the SAS Institute, USA
There are six phases in this process that are shown in the following figure; it is not rigid, but often has a great deal of backtracking:
Let’s look at the phases in detail:
- Business understanding: This task includes determining business objectives, assessing the current situation, establishing data mining goals, and developing a plan.
- Data understanding: This task evaluates data requirements and includes initial data collection, data description, data exploration, and the verification of data quality.
- Data preparation: Once available, data resources are identified in the last step. Then, the data needs to be selected, cleaned, and then built into the desired form and format.
- Modeling: Visualization and cluster analysis are useful for initial analysis. The initial association rules can be developed by applying tools such as generalized rule induction. This is a data mining technique to discover knowledge represented as rules to illustrate the data in the view of causal relationship between conditional factors and a given decision/outcome. The models appropriate to the data type can also be applied.
- Evaluation :The results should be evaluated in the context specified by the business objectives in the first step. This leads to the identification of new needs and in turn reverts to the prior phases in most cases.
- Deployment: Data mining can be used to both verify previously held hypotheses or for knowledge.
Here is an overview of the process for SEMMA:
Let’s look at these processes in detail:
- Sample: In this step, a portion of a large dataset is extracted
- Explore: To gain a better understanding of the dataset, unanticipated trends and anomalies are searched in this step
- Modify: The variables are created, selected, and transformed to focus on the model construction process
- Model: A variable combination of models is searched to predict a desired outcome
- Assess: The findings from the data mining process are evaluated by its usefulness and reliability
Social network mining
As we mentioned before, data mining finds a model on data and the mining of social network finds the model on graph data in which the social network is represented.
Social network mining is one application of web data mining; the popular applications are social sciences and bibliometry, PageRank and HITS, shortcomings of the coarse-grained graph model, enhanced models and techniques, evaluation of topic distillation, and measuring and modeling the Web.
When it comes to the discussion of social networks, you will think of Facebook, Google+, LinkedIn, and so on. The essential characteristics of a social network are as follows:
- There is a collection of entities that participate in the network. Typically, these entities are people, but they could be something else entirely.
- There is at least one relationship between the entities of the network. On Facebook, this relationship is called friends. Sometimes, the relationship is all-or-nothing; two people are either friends or they are not. However, in other examples of social networks, the relationship has a degree. This degree could be discrete, for example, friends, family, acquaintances, or none as in Google+. It could be a real number; an example would be the fraction of the average day that two people spend talking to each other.
- There is an assumption of nonrandomness or locality. This condition is the hardest to formalize, but the intuition is that relationships tend to cluster. That is, if entity A is related to both B and C, then there is a higher probability than average that B and C are related.
Here are some varieties of social networks:
- Telephone networks: The nodes in this network are phone numbers and represent individuals
- E-mail networks: The nodes represent e-mail addresses, which represent individuals
- Collaboration networks: The nodes here represent individuals who published research papers; the edge connecting two nodes represent two individuals who published one or more papers jointly
Social networks are modeled as undirected graphs. The entities are the nodes, and an edge connects two nodes if the nodes are related by the relationship that characterizes the network. If there is a degree associated with the relationship, this degree is represented by labeling the edges.
Here is an example in which Coleman’s High School Friendship Data from the sna R package is used for analysis. The data is from a research on friendship ties between 73 boys in a high school in one chosen academic year; reported ties for all informants are provided for two time points (fall and spring). The dataset’s name is coleman, which is an array type in R language. The node denotes a specific student and the line represents the tie between two students.
Text mining is based on the data of text, concerned with exacting relevant information from large natural language text, and searching for interesting relationships, syntactical correlation, or semantic association between the extracted entities or terms. It is also defined as automatic or semiautomatic processing of text. The related algorithms include text clustering, text classification, natural language processing, and web mining.
One of the characteristics of text mining is text mixed with numbers, or in other point of view, the hybrid data type contained in the source dataset. The text is usually a collection of unstructured documents, which will be preprocessed and transformed into a numerical and structured representation. After the transformation, most of the data mining algorithms can be applied with good effects.
The process of text mining is described as follows:
- Text mining starts from preparing the text corpus, which are reports, letters and so forth
- The second step is to build a semistructured text database that is based on the text corpus
- The third step is to build a term-document matrix in which the term frequency is included
- The final result is further analysis, such as text analysis, semantic analysis, information retrieval, and information summarization
Information retrieval and text mining
Information retrieval is to help users find information, most commonly associated with online documents. It focuses on the acquisition, organization, storage, retrieval, and distribution for information. The task of Information Retrieval (IR) is to retrieve relevant documents in response to a query. The fundamental technique of IR is measuring similarity. Key steps in IR are as follows:
- Specify a query. The following are some of the types of queries:
- Keyword query: This is expressed by a list of keywords to find documents that contain at least one keyword
- Boolean query: This is constructed with Boolean operators and keywords
- Phrase query: This is a query that consists of a sequence of words that makes up a phrase
- Proximity query: This is a downgrade version of the phrase queries and can be a combination of keywords and phrases
- Full document query: This query is a full document to find other documents similar to the query document
- Natural language questions: This query helps to express users’ requirements as a natural language question
- Search the document collection.
- Return the subset of relevant documents.
Mining text for prediction
Prediction of results from text is just as ambitious as predicting numerical data mining and has similar problems associated with numerical classification. It is generally a classification issue.
Prediction from text needs prior experience, from the sample, to learn how to draw a prediction on new documents. Once text is transformed into numeric data, prediction methods can be applied.
Web data mining
Web mining aims to discover useful information or knowledge from the web hyperlink structure, page, and usage data. The Web is one of the biggest data sources to serve as the input for data mining applications.
Web data mining is based on IR, machine learning (ML), statistics, pattern recognition, and data mining. Web mining is not purely a data mining problem because of the heterogeneous and semistructured or unstructured web data, although many data mining approaches can be applied to it.
Web mining tasks can be defined into at least three types:
- Web structure mining: This helps to find useful information or valuable structural summary about sites and pages from hyperlinks
- Web content mining: This helps to mine useful information from web page contents
- Web usage mining: This helps to discover user access patterns from web logs to detect intrusion, fraud, and attempted break-in
The algorithms applied to web data mining are originated from classical data mining algorithms; it shares many similarities, such as the mining process, however, differences exist too. The characteristics of web data mining makes it different from data mining for the following reasons:
- The data is unstructured
- The information of the Web keeps changing and the amount of data keeps growing
- Any data type is available on the Web, such as structured and unstructured data
- Heterogeneous information is on the web; redundant pages are present too
- Vast amounts of information on the web is linked
- The data is noisy
Web data mining differentiates from data mining by the huge dynamic volume of source dataset, a big variety of data format, and so on. The most popular data mining tasks related to the Web are as follows:
- Information extraction (IE):The task of IE consists of a couple of steps, tokenization, sentence segmentation, part-of-speech assignment, named entity identification, phrasal parsing, sentential parsing, semantic interpretation, discourse interpretation, template filling, and merging.
- Natural language processing (NLP): This researches the linguistic characteristics of human-human and human-machine interactive, models of linguistic competence and performance, frameworks to implement process with such models, processes’/models’ iterative refinement, and evaluation techniques for the result systems. Classical NLP tasks related to web data mining are tagging, knowledge representation, ontologies, and so on.
- Question answering: The goal is to find the answer from a collection of text to questions in natural language format. It can be categorized into slot filling, limited domain, and open domain with bigger difficulties for the latter. One simple example is based on a predefined FAQ to answer queries from customers.
- Resource discovery: The popular applications are collecting important pages preferentially; similarity search using link topology, topical locality and focused crawling; and discovering communities.
We have looked at the broad aspects of data mining here. In case you are wondering what to look at next, check out how to “data mine” in R with Learning Data Mining with R (https://www.packtpub.com/big-data-and-business-intelligence/learning-data-mining-r).
If R is not your taste, you can “data mine” with Python as well. Check out Learning Data Mining with Python (https://www.packtpub.com/big-data-and-business-intelligence/learning-data-mining-python).
Resources for Article:
Further resources on this subject:
- Machine Learning with R [Article]
- Machine learning and Python – the Dream Team [Article]
- Machine Learning in Bioinformatics [Article]