Keywords: Crime news, data analytics, OLAP, text mining, data warehouse Introduction
Crime can reflect the social problems and should be prevented or decreased for the good of living of people and tourists in the areas. Typically, crime information is daily presented on the website and read by millions of audience worldwide. Unfortunately, this information is rarely used to benefit for people e.g. safety and security aspects. According to the survey in the first 6 months of 2016 by Numbeo website [1], crime index in Thailand is 52.16 and it is 4th among South-East Asian countries. While Malaysia has the highest crime rate at 65.56 followed by Vietnam at 53.45 and Combodia at 52.72. In Thailand, crime statistics illustrates that Pattaya obtains the highest crime at 58.55 followed by Phuket (57.38), Bangkok (47.70), and Chiang Mai (34.94). It is noticeable that they are major cities for tourists of the country.
Typically, crime information is manually collected only when there is someone report to the officers and this data will usually be stored in database which allows police officers to retrieve, analyze, and manually generate crime reports which is very labor intensive. The main limitations of this method are 1) data is not up to date as manually collection process is time consuming and 2) scalability problem could occur because numbers of tables and amount of data are increased in the future and, consequently, querying and reporting speed are affected due to integrity checking of Database Management System (DBMS). To solve these limitations, we proposed to exploit crime news available on the Internet to
*Presented at 1st International Conference on Information Technology: October 27th – 28th, 2016
Unify Framework for Crime Data Summarization Tichakorn NETSUWAN and Kraisak KESORN http://wjst.wu.ac.th
Walailak J Sci & Tech 2017; 14(10) 770
support decision making of police officers by constructing the crime news extraction and analytics system which includes several functions as follows.
1) Extract crime data from online news website and later use for data analytic purpose. 2) Store crime data using star schema in DW which efficiently resolve the scalability problem and
effectively represent data in multidimensional forms, so called online analytical processing (OLAP). 3) Generate interactive reports of this data via website which users can perform drill-down or roll-
up operations. The proposed system can aid police officers to form the policies in order to prevent crime that may occur in the risk areas. Related works
News usually influences our lifestyles and, thus, several researchers exploited news to automate analytics in various aspects. Shojaee et al. [2] proposed a study of classification learning algorithms to predict crime status using crime dataset from University of California, Irvine (UCI) Machine Learning Repository for data mining. However, crime dataset from UCI is not updated and this results in practically useless. Yang et al. [3] studied learning approaches for detecting and tracking news events of interests using Reuters and CNN news stories. However, the presented method only be useful in some circumstances and need users involve in the loop of event identification. Seo et al. [4] presented the financial news analysis for intelligent portfolio management which is a text classification agent that takes advantage of information retrieval techniques to complement quantitative financial information. These news articles were gathered from various electronic news providers e.g. CNN Financial Network, Forbes, Reuters, NewsFactors, Motley Fool, CNet, ZDNet, Morningstar.com, Associate Press (AP), AP Financial, and Business wire. Nonetheless, it is unclear that how they store data for later use and support for scalability issue. Online news allows users to easily access news anytime and anywhere via mobile devices and personal computers. Really Simple Syndication (RSS) technology enables publishers to syndicate data automatically. A standard XML file format ensures compatibility with many different machines/programs. RSS feeds also benefit to users who want to receive timely updates from favorite websites or to aggregate data from many sites. Wanglee et al. [5] developed an automatic news aggregator system which users can read all news. However, crime news information has never been used to support decision making for the police officers in the literatures. Sudhahar et al. [6] presented a system for large scale quantitative narrative analysis (QNA) of news corpora. The task is to identify the key actors (criminals and victims) in news and the actions they performed. The system demonstrated that men were most commonly responsible for crimes against the person, while women and children were most often victims of those crimes. Wang et al. [7] analyzed online news and classified into categories by means of adaptive clustering. Moreover, the news comments were classified into categories such as negative, positive, which are also grouped into clusters helping the experts to get the view of the common people to the news. The main purpose of this work is to help the experts find which news that the people concerned the most. However, the main drawback of this work is it does not store data for offline processing and cannot represent data in multi-dimensions. Thus, it is not efficiently support for decision making of users. Most of researches in the literatures deployed database system to store the extracted data which is usually suffer from integrity checking and resulting in poor querying speed when the number of tables and volume of data become large. To the best of our knowledge, there is no crime news extraction and analytics system existing in the literatures and none of them use data warehouse as an architecture for data storage. Hence, we present this idea as a main contribution in this paper. Materials and methods
To develop online crime news extraction and analysis framework, 4 major components are introduced as illustrated in Figure 1 and can be described as follows.
(1) Collection: crime data in this work focused on English news only and was automatically collected from http://www.pattayapeople.com exploiting RSS feed service. Finally, the collection in this research contains 1,596 crime news.
Unify Framework for Crime Data Summarization Tichakorn NETSUWAN and Kraisak KESORN http://wjst.wu.ac.th
Walailak J Sci & Tech 2017; 14(10)
771
(2) Extraction and analysis: The collected data is analyzed and extracted key information using GATE framework [8]. GATE is preferred because it is open source framework and well known among researchers in this area. This step includes 3 sub-processes: 1) Name entity extraction to detect names, locations, organizations, address, date etc. 2) Crime terms detection to find key terms related to crimes. This research uses Oxford Learner’s Dictionaries [9] and MyVocabulary [10] to detect crime keywords. Example of crime terms are shown in Table 1. 3) All crime terms are processed in order to extend to other related data. Figure 2 demonstrates the example of other indirectly relevant data that can be extended from the crime terms e.g. gender, criminal, and victims. To find other relevant information, several related information is used e.g. system date, name prefix, and structure grammatical formalisms. Table 2 shows examples of structure grammatical formalism rules for criminal and victim identification. At this stage of experiment, the rules are fixed regarding to grammatical formalism. If a user wants to add more rules, he can manually do this.
(3) Classification: news classification using text mining to classify crime news into 5 categories. Table 3 depicts crime categories [11] used in this work. This task aims at showing the highest number of crime types occurred in Pattaya. To classify crime news, all extracted keywords must be counted the frequency and put them into a matrix (Table 4).
Figure 1 Online crime news extraction and analysis framework.
Finally, the important between keywords and news are computed using TF-IDF (Eqs. (1) and (2)) [12] and Artificial Neural Network (ANN) is applied for document classification task.
)/log( kk DFnIDF = (1)
kjk IDFTFIDFTF ×=− (2) where n is number of news in the collection, TF refers to the frequency of term t in a document and DF represents a number of documents containing term t.
(4) Representation: the extracted crime data is represented in a multidimensional form using OLAP which is a process to integrate data based upon star schema. Microsoft SQL Server 2008 R2 is used as a
Admin
Crime data Extraction 1. Name Entity Extraction
– Date – Location – Person
2. Crime Terms 3. Other Indirectly Relevant Information
Data warehouse
OLAP
News Classification
Online News (RSS)
1
2
3
4User
Export XML files
ETL
(Extract, Transform, Load)
Unify Framework for Crime Data Summarization Tichakorn NETSUWAN and Kraisak KESORN http://wjst.wu.ac.th
Walailak J Sci & Tech 2017; 14(10) 772
tool in this work for DW construction. It helps to perform data analytics in multidimensional tables and pre-summarized across dimensions to drastically improve query speed over relational databases. In addition, designing data at multiple levels of aggregations allows user to perform drill-down or roll-up operations. Drill-down presents data at a level of increased detail, while roll-up is the reverse operation of drill-down by decreasing detail of data. Figure 3 shows an example of star schema which comprises a fact table and dimension tables.