ITS IMPLEMENTATION RESEARCH CENTER
FINAL REPORT
AUTOMATED IDENTIFICATION OF
TRAFFIC PATTERNS
Brian L. Smith, Ph.D., P.E.
Associate Professor
Ramkumar Venkatanarayana
Transportation Engineer
University of Virginia
Center for Transportation Studies
May 30, 2008
Knowledge of the “normal traffic patterns” on the roadways is essential for a number of transportation applications such as signal system retiming and performance measurement. The simple historic average – the average of all the traffic data in a dataset, by the time of day – has traditionally been used widely to derive these traffic patterns. However, this method is significantly biased by the presence of traffic abnormalities (such as crashes, and inclement weather). One solution to avoid this bias is through visual inspection of the data by experts. The experts could potentially identify and eliminate the traffic abnormalities, and thereby identify the underlying “normal traffic patterns.” Three main challenges of this approach are: (1) the bias introduced due to subjectivity, (2) the additional time required to analyze the data manually, and (3) the increasing sizes of the available traffic data sets.
To address the above challenges, and also exploit new opportunities in the form of traffic data archives, new data analysis tools are essential. In this research study, a new method, the Quantum-Frequency algorithm, was developed, based on density-based clustering. Three other algorithms – K-Means Clustering, Wavelet-based Clustering and Median – were identified as promising algorithms, and developed further.
A 10-step methodology was developed to evaluate these promising algorithms, along with the traditional historic average algorithm. Under this methodology, the above algorithms were applied to several real-world datasets. Based on their regular convergence and high accuracy, the Quantum-Frequency Algorithm (with optimized parameter values) and the Median algorithms are recommended for practical applications.
Key contributions of this study include (1) a new definition for “normal traffic pattern,” (2) development, application and optimization of the Quantum-Frequency Algorithm, (3) development and application of the 10-step evaluation methodology, and (4) first documented quantification of the bias in the widely-used historic average algorithm.
1.2.1 Existing Information Needs
1.2.2.2 Historic Average with Expert Intervention
1.2.2.3 Expert Selection of Normal Day(s)
1.2.3 New Opportunity of Data Archives
1.3 Desirable Properties of the Ideal Algorithm
1.5 Organization of the Report
2.1.3 Concepts of “Normal” and “Abnormal”
2.1.4 Vision and Visualization
2.2.1.1 Optimal Number of Clusters (K)
2.2.2 Density-Based Clustering
3.1 Define Normal Traffic Pattern
3.2 Development and Extension of Candidate Algorithms
3.4.3 Benchmark “Normal Patterns” by Experts
CHAPTER 4 DEFINITION OF “NORMAL TRAFFIC PATTERN”
4.1 Drawbacks of Existing Definitions
4.2 Development of a New Definition
4.2.1 Definition of “Normal Traffic Pattern”
4.2.2 Limitations of the Definition
CHAPTER 5 DEVELOPING AND EXTENDING CANDIDATE ALGORITHMS
5.1 K-Means Clustering Algorithm
5.1.2 Determination of Initial Cluster Centroids
5.1.3 Calculation of Important Features
5.1.4 Calculation of Distances among Data Vectors and Centroids
5.1.6 Iteration and Halting Procedure
5.2 Clustering of Wavelet Decomposed Signals
5.2.1 Mother Wavelet and Scale of Analysis
5.2.2 Determination of Signal for Clustering
5.3 Quantum-Frequency Algorithm
5.3.3 Determination of the Frequency of Insignificance (FI)
5.3.4 Determination of “Normal” Cluster
5.3.5 Determination of Normal Pattern
5.3.6.2 Recognition of Known Abnormalities
5.4 Median and Simple Historic Average Algorithms
CHAPTER 6 PRELIMINARY EVALUATION
6.1 Parameter Initiation and Algorithm Application
6.2.1 Discussion of Evaluation Results
6.3 Optimization of QFA Parameters
7.1 Expert Benchmarking Details
7.3 Optimization of QFA Parameters
7.4.1 Discussion of Evaluation Results
7.4.2 Final Algorithm Selection
CHAPTER 8 CONCLUSIONS AND RECOMMENDATIONS
8.2 Contributions of the Research
8.3 Recommendations for Future Research
TABLE 2-1 Preliminary Assessment of Initial Candidate Algorithms
TABLE 2-2 Interpretation of Silhouette Widths
TABLE 2-3 Interpretation of Maximum Average Silhouette Widths
TABLE 3-1 Datasets for Preliminary Evaluation
TABLE 3-2 Summary of Crash, Weather and Holiday Data Streams
TABLE 3-2 Datasets for Detailed Evaluation
TABLE 6-1 Convergence of Algorithms
TABLE 6-2 Surrogate Accuracy of QFA
TABLE 6-3 Convergence for QFA Parameter Combinations
TABLE 6-4 Objective Function Values for Seasonal Datasets (3 and 4)
FIGURE 1-1 Illustration of Daily Traffic Pattern
FIGURE 1-2 Raw Traffic Data and the Historic Averages
FIGURE 3-1 Research Methodology
FIGURE 3-1 Links for Preliminary Evaluation
FIGURE 3-2 Traffic Dataset 1
FIGURE 3-3 Links for Detailed Evaluation
FIGURE 3-4 Visual Quality Assurance of Expert Analyses
FIGURE 4-1 Illustration of Unknown Traffic Abnormalities
FIGURE 5-1 Decision for Optimum “K” (Variance Feature)
FIGURE 5-2 Floor function for uniform quantization
FIGURE 5-3 Histogram of Actual Traffic Volumes
FIGURE 5-4 Histogram of Quantized Traffic Volumes
FIGURE 5-5 A “Normal” Cluster from Quantum-Frequency Algorithm
FIGURE 5-6 Traffic data, Historic Average, and Normal Pattern from QFA
FIGURE 5-7 Degree of Normality for Quantum-Frequency Algorithm
FIGURE 6-1 Traffic Dataset 1
FIGURE 6-2 Normal Cluster from QFA (60/0)
FIGURE 6-3 Normal Cluster from K-Means Clustering
FIGURE 6-4 Degree of Normality
FIGURE 6-5 Comparison of Normal Patterns
FIGURE 7-1 AM Peak – Mean Errors
FIGURE 7-2 AM Peak – Mean Absolute Errors (MAE)
FIGURE 7-3 AM Peak – Mean Absolute Percentage Errors (MAPE)
FIGURE 7-4 PM Peak – Mean Errors
FIGURE 7-5 PM Peak – Mean Absolute Errors (MAE)
FIGURE 7-6 PM Peak – Mean Absolute Percentage Errors (MAPE)
FIGURE 7-7 Full Day – Mean Errors (ME)
FIGURE 7-8 Full Day – Mean Absolute Errors (MAE)
FIGURE 7-9 Full Day – Mean Absolute Percentage Errors (MAPE)
FIGURE 7-10 Illustration of Bias from Historic Average
FIGURE 7-11 Illustration of Bias from Historic Average (AM Peak)
It is widely accepted that vehicular traffic follows rhythmic patterns, from day to day and through seasons. An understanding of these underlying traffic demand patterns is essential for many transportation applications. Such applications include systems operations, planning, data archiving, and performance measurement. For example, retiming of traffic signals promises quick and significant returns on our operational investments [Sunkari, 2004[1]]. The main reasons supporting signal retiming are: (a) Signal timing is effective only as long as the traffic patterns that were used to generate the signal timing are reasonably constant, and (b) traffic patterns change over time. In order to retime the signal system at the right times, knowledge of the traffic patterns, and their changes, is necessary.
Similar to signal retiming, several applications need timely and unbiased information regarding normal traffic patterns. Currently, the three most widely used methods to identify these patterns are: (a) the simple average of all the available data (called ‘Simple Historic Average’ in this study), (b) historic average with expert intervention, and (c) selection of a normal day through expert judgment. However, all these methods are either significantly biased or time-consuming. Further, a new opportunity is now available to many transportation agencies - in the form of large traffic data archives. And to leverage this opportunity, new data mining tools are needed. In summary, the time is ripe to explore, identify, develop and evaluate better tools for the identification of “normal” traffic patterns from large datasets. Towards this end, this research study explored several theories and algorithms from diverse fields. A new data mining method, the Quantum-Frequency Algorithm, was developed, based on density-based clustering. This algorithm and the median algorithm consistently outperformed the simple historic average in reducing the bias. The Quantum-Frequency algorithm mimics the expert visual judgment, and the tradeoffs concerning its results can be interactively determined by the analyst.
It should be noted that normal traffic is also often referred in literature as background traffic, base case traffic, or recurring, regular, representative, typical, usual, or underlying traffic. In this study, they will be referred as normal traffic patterns. Further, in this research study, a traffic pattern refers to a day-long time-series of traffic volume values at 15-minute intervals. As depicted in Figure 1-1, traffic patterns are often plotted as a scatter plot or a line, with the X-axis representing the time of the day, and the Y-axis showing the traffic volume.

FIGURE 1-1 Illustration of Daily Traffic Pattern
Highway Capacity Manual 2000. Copyright, National Academy of Sciences, Washington, D.C. Exhibit 8-7, p. 8-7. Reproduced with permission of the Transportation Research Board.
The objective of this research is:
To develop and evaluate algorithms for the identification of “normal” traffic pattern from large datasets, with minimal expert intervention.
The scope of this research study is confined to the analysis of univariate traffic volume data, collected by point sensors. As such the archived dataset refers to traffic volume data for a given location and time period.
Motivation for this research arose from the fusion of these three observations:
All these observations are explained in detail in this section.
1.2.1 Existing Information Needs
Several transportation applications, beyond traffic signal retiming mentioned earlier, need the information of traffic patterns, and their trends. Specific applications are outlined here.
· Both operations and operations planning require normal traffic patterns.
o Effective work zone planning requires analysis of large amounts of traffic data to “determine the scenario that has the least impact on the traveling public [FHWA, 2002[2]].” For such a work zone permitting process, many state agencies use QuickZone, a software program for estimating delays caused by work zones. However, QuickZone requires an accurate assessment of the underlying traffic demand (in hourly counts for each day of the week) [MitreTek Systems, 2001[3]].
o The Freeway Management and Operations Handbook (FMOH) (FHWA, 2003[4]) states that “A Transportation Management Center (TMC) requires an accurate real-time monitoring of the freeway’s performance, and how that performance compares to “normal” (using performance measures over time to define “normal”).”
o A traffic incident is defined as “an event that disrupts the “normal” flow of traffic, usually by physical impedance” (FHWA, 2003; PB Farradyne, 2000[5]; TRB, 2006[6]). Precise knowledge of the normal traffic is therefore important for disseminating timely and accurate information to motorists.
o Identification of changes in the traffic patterns is vital for determining the hours (and direction) of operation for managed lanes such as Reversible High-Occupancy Vehicle (RHOV) lanes and shoulder lanes (USDOT – HOV primer[7]; Kuhn et al, 2005[8]).
· Within performance measurement, “normal traffic” is primarily used as a benchmark to assess the outcomes of any program.
o A recent major report (NTOC, 2005[9]) recommends tracking recurring delay for different types of days, during “normal conditions.” Further, “the normal working day” is listed as one of the types of days for measuring system operations performance. However, the report provides no further details on how to identify the “normal day” or the “normal traffic conditions.”
o Another study (TRB, 20066) mentions two metrics for measuring the transportation system reliability: “Abnormal volume – related delay,” and “Return to normal flow time” for incident management. The second metric is identified as both “high” priority and “hard” to implement.
· For traffic micro-simulations, “the task for data collection is to obtain the normal daily mix of vehicles and their characteristics. (FHWA, 2007[10]).” The same guidebook further recommends the use of data from a single day that is “representative of a normal or average day.”
· For the design of the highways, the Highway Capacity Manual (HCM, TRB, 2000[11]) states (page 8-8) “…for urban roadways, a design hour for the repetitive weekday peak periods is common.” Further, “As a general guide, the most repetitive peak volumes may be used for the design of new or upgraded facilities.”
· Traffic patterns and trends are also required for several other applications, such as planning studies (Meyer and Miller, 2001[12]), safety analyses (Luo, 2005[13]), traffic forecasting (Sun et al, 2003[14]; Smith, 1995[15]), traffic data quality analyses (Ishak, 2003[16]; Park et al, 2003[17]), traffic data imputation (Conklin, 2003[18]) and traffic data aggregation (Gajeswki et al, 2000[19]; Qiao, 2003[20])
The above-mentioned application needs are met by several state-of-the-practice algorithms. However, these algorithms suffer from significant drawbacks, as explained in the next sub-section.
Currently, three algorithms are most commonly used for the identification of traffic patterns:
For each of these algorithms, its definition, basis, advantages and drawbacks, and usage in practical applications are presented in this section. The last subsection presents some uncommon methods in the literature.
Definition: In this algorithm, traffic volume data from all the days in the dataset are simply averaged for each time-of-day. The resultant series of traffic volumes, by the time-of-day, is then used as the representative traffic pattern for that dataset. The dataset may contain only one or more specific days-of-the-week (such as, only Mondays, only weekdays, only weekends, or even all days of the week).
Basis: McShane and Crowley (1976[21]) first noted that for different sites, 95% of the 77 weekdays of traffic from urban Toronto fell within a small region around the historic average. The results from one site are presented in Figure 1-1. On the basis of such a low intra-day variance, the HCM indicates the “repeatability of the basic [hourly traffic] pattern” (TRB. 2000).
Advantages and drawbacks: The main advantages of this algorithm are its simplicity and the potential for statistical aggregation. For example, the sum of hourly traffic volumes across the average day gives the average daily traffic. The main drawbacks of this algorithm are (a) the loss of natural variance seen in an actual day of traffic, and (b) the significant bias caused by the presence of incidents, as illustrated in Figure 1-2. This figure presents data from the 22 weekdays in January 2004, from the continuous count station 90272 (3 lanes along Northbound I-395, in Northern Virginia).

FIGURE 1-2 Raw Traffic Data and the Historic Averages
Usage: The historic average is used in various studies on diverse topics such as congestion (Bremmer et al, 2004[22]), improvement of roads and bridges (British Columbia, 2006[23]), port truck traffic evaluation (Giuliano et al., 2008[24]), traffic condition (TTI, 2005[25]), and traffic engineering (Garber and Hoel, 2003[26]; USDOT, 2001[27]). Frequently used statistics, such as the annual average daily traffic (AADT), are based on simple summing and averaging across the entire dataset.
1.2.2.2 Historic Average with Expert Intervention
Definition: In this method also, the historic dataset is averaged by the time of the day. However, before calculating the average, the expert intervenes and removes known traffic abnormalities caused by extraneous factors, such as holidays, major incidents and weather.
Basis: This method is also based on historic average, but accounts for the bias introduced by major incidents.
Advantages and drawbacks: The main advantage of this method is the accounting for bias. The main drawbacks of this method are (a) the loss of natural variance seen in an actual day of traffic, and (b) the need for expert’s time and additional incident data resources. Further, if additional incident data are not fully available, the experts have to visually judge the data. Such judgments vary from one expert to another, and thus introduce a subjective bias. Figure 1-3 also presents a historic average, not considering the holiday (New Years day).
Usage: This method is used in simulation studies (FHWA, 200710), traffic condition monitoring (Turochy and Smith, 2002[28]), and statistical analyses (Rakha and Van Aerde, 1995).
1.2.2.3 Expert Selection of Normal Day(s)
Definition: In this method, an expert selects one or more days as being representative of the dataset.
Basis: This is a further modification of the expert intervention in the previous method, to preserve the natural variance of the daily traffic.
Advantages and drawbacks: The main advantages of this method are (a) accounting for bias, and (b) preserving of natural variance. The main drawbacks of this algorithm are (a) the need for expert’s time, and (b) the subjective bias explained earlier, in the absence of data related to extraneous factors, which is often the case.
Usage: This method is used in congestion studies (Caltrans, 2003[29]), traffic performance monitoring (TTI, 2005), and transportation planning (Nichols, 2007 (unpublished data)).
Several other methods are also employed in the identification of traffic patterns, albeit sparsely. These include k-means clustering (Weijermars and van Berkum, 2006[30]), median of the dataset (Hallenbeck et al, 2003[31]), defining abnormal traffic (really bad days) as “those days when the travel time is twice as long as when traffic is flowing freely” (WSDOT, 2004[32]), and defining recurrent congestion (normal) as “any delay where the occupancy is less than the 1.05 times the median, for each location, day-of-week and time period” (TRB, 20066). These functional definitions do not seem to have any basis outside the mentioned publications, and have not found wide-spread acceptance.
Visser and Molenkamp (2004[33]) succinctly summarize the state-of-the-practice for identification of normal traffic patterns as follows:
“Determining the daily and weekly patterns is a bit of an art, more than a science: results are partly dependent on every individual’s own frame of reference (e.g. Is the Thursday before Easter a regular weekday as far as traffic is concerned?).”
In essence, the existing methods suffer from these three major and undesirable drawbacks (TRB, 2000; FHWA, 2003): time consuming, biased, and/or without scientific basis.
1.2.3 New Opportunity of Data Archives
Determination of traffic flow patterns requires the traffic flow (volume) data from the field. To collect this data, several methods and technologies have long been used (USDOT, 2001; Garber and Hoel, 2003), often involving significant resources (such as human presence or intervention, and equipment needs). However, since the 1990’s, traffic monitoring sensors have been widely deployed as part of the Intelligent Transportation Systems (ITS). The data from these sensors (including volume) is increasingly being archived for offline purposes (Margiotta, 1998[34]; Lomax et al, 2001[35]; Turner, 2001[36]). The Smart Travel Laboratory (STL) at the University of Virginia maintains the ITS data archive for Virginia.
The main advantage with the ITS data archive is the large spatial and temporal coverage, presenting a new opportunity to identify representative and local traffic patterns. Such locally derived traffic information is recommended for usage in analyses, instead of the example values in the HCM (TRB, 2000). However, the inherent challenges with the ITS data archives are (a) the large size, and (b) the numerous and inevitable suspect and missing data records (Conklin, 2003[37]; Turner et al, 2000[38]). To realize the opportunity, and mitigate the undesirable effects of the challenges, sophisticated data mining tools are explored in this research study.
Another important resource is the archives of volume data collected by the continuous count stations for the Highway Performance Monitoring System (HPMS). These stations are spatially sparse, compared to the ITS sensors. However, their data quality has traditionally been excellent. For this reason, the current research study extensively evaluates the algorithms with the HPMS data, and then applies them to the ITS data.
1.3 Desirable Properties of the Ideal Algorithm
As stated in the research objective, “minimal expert intervention” in the identification of “normal traffic pattern” is the main focus of this study. As such, the following two properties are vital for an algorithm to meet the research objective:
· Convergence: The algorithm should converge, providing similar results when applied to the same traffic dataset repeatedly. Further, such convergence should be consistent across diverse datasets.
· Accuracy: Expert judgment is currently the most reliable approach to identify normal traffic patterns from datasets. The algorithm should consistently yield low errors, when compared to expert judgments.
The other desirable properties of the ideal algorithm identified from a research of field requirements include:
· Surrogate Accuracy: It is desirable for the algorithm to consistently recognize known major traffic abnormalities (e.g., major crashes, weather events etc.).
· Data Usage: Traffic datasets often contain partly missing or suspect data records. It is desirable for an ideal algorithm to both (a) maximize the use of good quality data, and (b) minimize the bias from missing or low quality data.
· Degree of Normality: In addition to a time-series normal traffic pattern, it is desirable to obtain a “degree of normality” associated with such a representation.
· Speed: As the algorithm is expected to mainly be used offline, the computation time is not a major constraint. Yet, faster speed is desirable.
Several research contributions resulted from this study. Key contributions include:
1.5 Organization of the Report
The remainder of the report is organized as follows:
Chapter 2 describes the literature review. Chapter 3 explains the 10-step methodology developed to evaluate the candidate algorithms. This methodology is then applied in Chapters 4-7. Chapter 4 defines the normal traffic pattern. Chapter 5 then presents details of the development and extension of the candidate algorithms. Results of the preliminary and detailed evaluation components are then presented and discussed in Chapters 6 and 7 respectively. Finally, Chapter 8 wraps up with the conclusions and recommendations from this research study.
The literature was reviewed to serve two primary purposes: (a) To understand the background theory regarding normal traffic patterns, and (b) To explore candidate algorithms for the identification of these traffic patterns from large datasets. The first section presents an examination of the following five theory topics in detail: (a) the state-of-the-art regarding normal traffic patterns, (b) a theory of patterns, (c) the concept of “normal," (d) the fundamentals of visualization and vision science, and (e) the fundamentals of quantization.
The second section is application oriented, focusing on the algorithms. Several candidate algorithms from diverse fields are first explored for their applicability to this research study. Then, four promising algorithms are selected for extension, application and evaluation in this study. These algorithms are explained in more detail at the end of this chapter.
Most algorithms in the surveyed literature, both within transportation and beyond, focused on “recognizing” known patterns in new datasets. If the normal traffic pattern in a dataset is already known, then there is no need for an algorithm. And a normal traffic pattern is unique to the particular dataset. For these reasons, pattern recognition algorithms are not applicable to this study.
Further, no algorithms for identifying “normal patterns” were encountered in any literature surveyed for this study. Therefore, a need was recognized to better understand, and possibly refine, the objective of this research study. For this reason, literature related to the background theory was reviewed. The five theoretical aspects most relevant to the current research study are presented in this section. These are:
This section focuses on the current knowledge of the existence, definitions, and importance of these traffic patterns. The existence of traffic flow patterns is well documented (Garber and Hoel, 2003; TRB, 2000). Traffic volumes at a location are characterized to be “repetitive and rhythmic.” Research studies further illustrate the repetitive nature of traffic at the daily (24 hour cycle) and weekly (7 day cycle) levels (Williams, 1999[39]; Rakha and Van Aerde, 1995[40]; Weijermars and van Berkum, 2006).
The importance of understanding these traffic patterns and variations are also well documented. Two such reasons quoted by the HCM (TRB, 2000) are:
In spite of the acknowledged existence and the importance of the topic in the reviewed literature, there is no precise definition for a normal traffic pattern. On the contrary, traffic conditions are reported to never be the same from one day to another, or from one season to another (TRB, 2000; TRB, 2006).
Given this background, defining a normal traffic pattern is not trivial. Nevertheless, it is important and necessary for this research study. Towards this end, the theory of patterns and the concept of “normal” were studied in detail, and are presented in the next two sections.
Pattern theory provides the basis for understanding, formalizing, and identifying patterns (Grenander, 1996[41]). According to Grenander (1996),
· The concept of pattern itself presupposes that there is a structure, and that it can be understood.
· A pattern structure is defined through a set of meaningful primitives (features) and rules for manipulating them into formal structures. A set of features is the reduced representation of a dataset, which retains the information necessary for effective classification.
· In formalizing non-deterministic patterns, both the rules for the typical structure and the rules for allowed variability need to be determined. That is, the extent to which variability is allowed within a structure to still enable us to call the various instances as conforming to one pattern.
In summary, a pattern is the conformation of the observations to a set of rules. The only ‘rule’ of interest for this study on traffic is: “normal,” i.e., to find “normal” traffic. The concept of “normal” was therefore studied independently, and the findings are presented in the next section.
2.1.3 Concepts of “Normal” and “Abnormal”
A thorough understanding of the concept of “normal” proved non-trivial. Both within transportation and beyond, the literature review revealed that the concepts “normal” and “abnormal” are often defined as the opposite of each other. For example, “normal traffic conditions” are defined as traffic with “no incident or special events” (NTOC, 2005). And an “incident” is defined as “any non-recurring event that causes a reduction of roadway capacity or an abnormal increase in demand” (FHWA, 2003).
Such circular definitions cannot be translated to methods and algorithms for practical application. A thorough understanding of at least one of these concepts – “normal” and “abnormal” – is vital for the current research study. The literature on both the concepts was reviewed, as presented in the following two sections.
2.1.3.1 “Normal”
The concept of “normal” was found in the literature from diverse fields such as medicine, weather, criminal justice, and English literature. The unique findings from the survey of this body of literature are presented here.
The concept of “normal” occurs frequently in the field of medicine. In an analysis of mental and emotional disorders, Davis and Bradley (2000[42]) explore the various meanings of the word “normal” in detail. In summary, they observe that:
Everyone knows what “normal” is. … And, while the definition varies with the referent, it generally describes some commonly held understanding, a culturally accepted belief about what is typical, usual, and natural.
The term “normal” has widely different meanings in different contexts and how our failure to recognize these differences can have serious consequences. Normal can mean “healthy” (as in normal body temperature), “typical” (as in a normal routine at work), “average” (as in normal winter snowfall, when describing a class or group). It has a specific definition in statistics. In medicine, normal can be taken to be a defined standard, such as normal blood pressure.
In another medical study, Hooton examined “normal” young men (1945[43]). He observes:
[The term “normal”] smacks of the hypothetical human being who is described so exhaustively and ingenuously in textbooks of anatomy and who never turns up in the dissecting room tailored according to specifications. The word means “not deviating from a type or standard,” but what standard?
From the perspective of criminal justice, Senator Moynihan (1993[44]) and Karmen (1994[45]) differ in their interpretation and treatment of crime, but agree about “normal” in important ways. Both authors report that:
The following is the summary of the literature reviewed on the concept of “normal”:
· There is no single, precise, consensual, universal definition of “normal.”
· “Abnormal” is sometimes defined precisely. Thereby, the “normal” observations (as the complement of “abnormal” in the entire dataset) are also precisely identifiable (even if not defined). The concept of “abnormal” is explored in detail in the next section.
· In the absence of stricter definitions, the concept of “normal” is significantly explained by “repetition” and “context.”
As mentioned earlier, the literature is currently devoid of a clear definition of “normal traffic pattern,” which is important to develop algorithms. Therefore, this vital finding is used in chapter 4 towards the development of a new definition for “normal traffic pattern.”
· “Normal” may be hypothetical.
Within transportation, recent efforts have categorized seven sources of congestion, such as incidents, work zones, special events etc. (TRB, 2006). If the complete effect of these sources on traffic is known, the “abnormal” traffic is directly identifiable, in theory. However, currently, complete incident and event data are often not archived (Christens, 2003[46]). Further, even if these archives were complete, the spatial and temporal extents of the effect of abnormalities on traffic would still remain unknown. For these reasons, it is desirable to identify the abnormalities directly from the traffic flow data. With this background, the literature was surveyed for a better understanding of the concept of “abnormal.” The findings of this survey are reported in this section.
In the literature reviewed, “abnormal” observations are referred as “outliers”. Outliers are defined as observations that, in the opinion of the investigator, stand apart from the bulk of the data (Beckman and Cook, 1983[47]; Hawkins, 1980[48]; Ramaswamy et al, 2000[49]). Beckman and Cook (1983) observe further that:
… an outlier is a subjective, post-data concept. Historically, “objective” methods for dealing with outliers were employed only after the outliers were identified through a visual inspection of the data. … In small- to moderate-size univariate samples, a visual inspection of the data is perhaps the most common method of identifying outliers. … outliers must be judged with some model, either implicit or explicit, in mind.
The Handbook of Statistical Methods (NIST, 2003[50]) also reinforces the above observations:
An outlier is an observation that lies an abnormal distance from other values in a random sample from a population. In a sense, this definition leaves it up to the analyst (or a consensus process) to decide what will be considered abnormal. Before abnormal observations can be singled out, it is necessary to characterize normal observations.
These findings pose significant challenges to successfully define and automatically identify the outliers from the traffic data. For one, a subjective definition cannot be translated into a practical methodology. Secondly, state-of-the-practice reveals that traffic experts already intervene and spend much time on the “visual inspection of the data.” Finally, an “underlying model” implicitly requires the precise definition of the “normal” traffic.
In summary, the following two findings emerge:
1. Outlier definition requires apriori definition of “normal” traffic, and
2. As in transportation, subjective and visual analysis of the data by experts is common in other fields also. Given this popularity and importance, the topic of visual judgment was studied in detail, and the findings are presented in the next two sections.
2.1.4 Vision and Visualization
This section on vision and visualization, and the next section on quantization were studied to understand the visual judgment made by traffic experts. Further, study of these topics provided the basis for understanding and effectively applying the density-based clustering algorithm for the current research study. The application-specific details are therefore presented later. The two sections here explain the theoretical findings from the literature.
Vision and visualization are inter-related, but separate topics. Literature on both the topics was reviewed to understand how humans visually decode information from images and graphs. The six specific findings, relevant and important to this research study, are explained in this section. The first three findings relate to graphical representation, and the concepts of “normal” and “context” discussed in the previous section (2.1.2). The last three findings relate to the use of quantization, which is explained further in sections 2.1.4 and 4.4.4.
1.
Humans perceive structures directly
through visualization, more than with data and differently from rational
reasoning (Plyshyn, 2003[51]; Card
et al, 1999[52]).
This finding is important for two reasons. First, it provides justification for
studying vision and visualization for this research work. Given that experts often
analyze traffic data visually, the state-of-the-art and practice stand to gain
significantly through a thorough understanding of these visual analysis
processes. Secondly, it justifies the presentation of the final analysis
results from this research study in a graph format, along with data format.
2.
The information retrieved through
vision depends on several factors such as the expectation, the importance of
the information, and the functional significance of the information (Palmer,
1999[53];
Plyshyn, 2003).
All these factors provide the context, discussed earlier in the concept of
“normal.” For normal traffic patterns, the expectation is the similarity and
repetition of traffic from one day to another. To meet this expectation in a
graph, densely spaced traffic data from different days provide the necessary
functional significance.
3.
On context: Palmer (1999) states “…
appropriate [visual] context facilitates categorization, whereas inappropriate
context hinders it.” Plyshyn (2003) extends that with “…how we perceive parts
of a scene depends on making some sort of coherent sense of the whole.”
This finding suggests that traffic experts could make mistakes if they relied
solely on the visual decoding of data. Alternately, experts may need more time
to extract information from some datasets.
4.
Visual context is organized
initially at a local level, and then at a global level (Palmer, 1999).
The importance of this finding is that a traffic expert identifies the “normal”
context for smaller time periods first, rather than designating an entire day
of traffic as either normal or abnormal. Such an approach is in line with the
reality, where only a part of a day is often affected by incidents, rather than
the whole day. This finding also supports the scalar quantization presented in
the next section.
5.
The human eye is capable of
identifying details only up to a certain image quality (Bovik, 2000[54]).
This finding also strongly supports quantization of the data towards
determining which parts of which days of traffic are similar.
6.
Categorization within a context
depends on strong contrast – either in color, intensity or size (Kosslyn, 1994[55],
and Resnikoff, 1987[56] -
both, as quoted in Cart et al, 1999).
The application of this finding is explained later, in section 5.2.
Quantization is the process of approximating a signal (either continuous or discrete) to a finite and small discrete set. Quantization effectively condenses the data and abstracts away the details, which may obstruct insights. There are two types of quantization: vector quantization and scalar quantization. Vector quantization is the process of condensing an entire series of numbers (from an N-dimensional hyperspace) to a much smaller dimension of representative features. Vector quantization is extensively used in communications, signal/image processing and digital data compression (Gersho and Gray, 1992[57]; Kohonen, 2001[58]).
For reasons explained later in detail, in section 5.2, this research study uses scalar quantization. In scalar quantization, a large set of scalar data is mapped to a much smaller representative dataset (often referred as a codebook). Two widely used examples of quantization are rounding and truncating. In both cases, a vast range of numbers is represented through a much smaller set.
Mathematically, one definition of an N-point scalar (one-dimensional) quantizer Q is a mapping Q: R g C, where R is the real line and
C ≡ {y1, y2,
y3, … , yN}
R
is the output set or codebook with size | C | = N (Gersho and Gray, 1992).
Section 5.3 explains the details of applying scalar quantization to the traffic volume data in this research study.
In this section, candidate algorithms from the literature are explored towards their application to the current study.
Several algorithms from diverse fields initially exhibited signs of relevance to the current research problem. These fields include pattern recognition and classification (Duda et al., 2000[59]), image processing, data compression (Gersho and Gray, 1992), time series analyses (Brockwell and Davis, 2002[60]), outlier detection, and statistics (Hoaglin et al, 1983[61]). However, further investigation revealed that most of these algorithms suffer from either or both of the following drawbacks:
(a) The algorithm needs apriori information regarding the normal traffic patterns and/or the abnormalities,
(b) The algorithm needs extensive training and fine-tuning for different datasets.
Any algorithm with these drawbacks fails to meet the primary objective of this research study – To minimize expert intervention in the identification of traffic patterns. Table 2-1 presents the results of the preliminary assessment of all these algorithms. Based on this assessment, four algorithms have been identified as promising candidates for the current research study. Details of these four candidate algorithms, as well as their documented advantages and disadvantages, are presented in the subsequent subsections.
TABLE 2-1 Preliminary Assessment of Initial Candidate Algorithms
|
Algorithm |
Needs Apriori Pattern Information |
Needs Fine-Tuning |
Comments |
|
Neural Networks |
Yes |
Yes |
Supervised pattern recognition and classification algorithms require training and fine-tuning for each dataset, which do not meet the objectives of this study. |
|
Bayesian classifiers |
Yes |
Yes |
|
|
Expert system |
Yes |
Yes |
|
|
Genetic algorithm |
Yes |
Yes |
|
|
Hidden Markov Models (HMM) |
Yes |
|
Assumes Markov property for traffic being “normal” or “abnormal”-which is far fetched. |
|
Support Vector Machine (SVM) |
Yes |
|
This algorithm is comparable to Euclidean distance in K-means algorithm. |
|
Principal Component Analysis (PCA) |
Yes |
|
Dimensionality reduction is not important for current study. Further, both PCA and Linear Discriminant Analysis (LDA) assume linear relationship among features. |
|
Fuzzy C-Means |
No |
Unclear |
High complexity. Results are comparable to K-means clustering, as discussed below. |
|
Quad Trees |
No |
Unclear |
This algorithm is comparable to density-based clustering considered for further evaluation. |
|
K-Nearest Neighbors |
Yes |
No |
Parametric/empirical algorithms useful for classification, when several objects are already classified. The traffic abnormalities are not known apriori. |
|
Parzen Windows |
Yes |
No |
|
|
Time Series Analysis |
Yes |
Unclear |
Traffic data is a time series. However, previous research (Williams, 1999) shows explicit traffic pattern and incident identification is necessary before analysis and modeling. Time series data mining (TSDM) also requires clear apriori definition of the events to be mined (Povinelli, 1999[62]). |
|
Outlier Detection, including Incident Detection |
Mostly, Yes |
Yes |
Discussion of these algorithms is presented below. |
|
|
|
|
|
|
K-Means Clustering |
No |
Unclear |
If any, minimal, fine-tuning is expected for these algorithms. These algorithms are therefore selected for further assessment, development, extension, and evaluation in this research study. These algorithms are discussed in detail in the following subsections. |
|
Density-Based Clustering |
No |
Unclear |
|
|
Wavelet Decomposition (prior to clustering) |
Unclear |
Unclear |
|
|
Median |
No |
No |
The median is usually not biased by outliers, unlike the historic average. This algorithm is also therefore selected for further assessment, and is discussed in detail in a following subsection. |
An extended discussion of some of the above algorithms is presented below:
· Fuzzy C-Means clustering has been applied in transportation for travel time forecasting (Park and Rilett, 1998[63]) and ITS data quality assessment (Ishak, 2003[64]). However, Martin et al (2001[65]), investigating incident detection, do not recommend this algorithm as an unsupervised tool, due to the complexity involved. Further, Dave and Krishnapuram (1997[66]) explain that the results of the fuzzy logic clustering algorithm are often similar to the K-Means clustering algorithm. For these reasons, K-means clustering has been selected for further investigation in this study.
· Outlier detection requires knowledge of the underlying distribution of the main body of the data, the number of outliers, and/or the cutoff outlier distance (Ramaswamy et al, 2000; Knorr and Ng, 1998[67]). Empirical definitions such as “3 sigma from mean” used in some studies (Weil et al, 1998[68]) do not necessarily and consistently detect the important outliers. Further, existing outlier detection algorithms do not perform well when the percentage of outliers in the data is large (above 35%) (Kosinski, 1999[69]). However, such levels of outliers often occur within traffic data, and the algorithm results need to be especially trustworthy in such cases.
· An example of outlier detection is the set of traffic incident detection algorithms. These algorithms are not applicable to the current study, because they suffer from the following drawbacks: (1) One research reports only 50% success (Luo, 200613), (2) Incident detection only identifies the start of the incident. The current study requires identification of the end of the incident effect on traffic, and (3) Non-incident events such as work zone effects, weather effects are also important for this study.
The last four algorithms in the above table, selected as candidate algorithms, are discussed in more detail in the following four subsections.
This section and the next (2.2.2) describe different clustering algorithms. The common background, advantages and disadvantages of these algorithms are described here, and the specifics of each algorithm are explained within the respective sections.
By definition, clustering algorithms group (cluster) similar data or entities together (Everitt, 1993[70]). Clustering algorithms are mainly divided into two categories: (a) centroids adjustment algorithms, and (b) hierarchical algorithms. K-means algorithm, described further in this section, is a centroids adjustment algorithm. Hierarchical algorithms have been documented to be unstable and not very accurate (Hastie et al, 2001[71]). Therefore hierarchical algorithms have not been considered further for this study. Section 2.2.2 describes Density-Based Clustering, which is different from the other two popular clustering algorithms.
The main advantages of clustering algorithms are: (a) The clusters need not be known apriori, and (b) These algorithms are quite transparent. The main drawbacks are: (a) The lack of semantic meaning to their outputs (the clusters), and (b) Selection of appropriate parameter values remains more of an art than science.
Assume a dataset with N data vectors. In K-means clustering, a set of K pre-defined or random data vectors are selected as the initial centroids for the K clusters. The distances of all the other (N-K) data vectors in the dataset are evaluated from these cluster centroids. A data vector is attributed to that cluster to whose centroid the data vector is closest (i.e. smallest distance). The new cluster centroids are then calculated based on all the data vectors within each cluster. This process is repeated until the centroids converge (within a small error margin), or halted after a pre-defined maximum number of iterations are performed.
In essence, the K-means clustering algorithm aims at minimizing an objective function. The objective function widely used is a squared error function:
![]()
J is an indicator of the
distance of the n data points from their respective cluster centroids, where
is
the square of the chosen distance measure between a data point
and
the cluster centroid
.
The following parameter inputs have to be provided to the algorithm: K (the number of clusters), calculation of important features, the initial cluster centroids (if not random), the distance measure, the centroid calculation, and the halting procedure. The algorithm will provide the final clusters to which each data vector belongs.
The details of optimizing K are presented in the following subsection.
2.2.1.1 Optimal Number of Clusters (K)
The literature presents some criteria for determining the optimum number of clusters in a given dataset. These include the Silhouette width, the Akaike Information Criterion (AIC), and the Bayesian Information Criterion (BIC) (Hastie et al, 2001; Everitt, 2003). In this research study, the silhouette width criterion has been selected, since the silhouette width also provides a measure of validity for the final cluster memberships. The details of its calculations and interpretation are as follows (UNESCO[72]).
Consider any object i of the data set, and let A denote the cluster to which it is assigned, and then calculate
a(i) = average dissimilarity (or distance) of i to all other objects of A
Now consider any cluster C different from A and define
d(i, C) = average dissimilarity (or distance) of i to all objects of C
Compute d(i, C) for
all clusters C
A,
and then select the smallest of those as
b(i) = min d(i, C) , C
A
Let B denote the cluster which attains the minimum distance (other than A) (i.e., d(i, B) = b(i)). B is called the neighbor of object i. The value s(i) can now be defined as:
s(i) = ![]()
s(i) lies between -1 and +1. The value of s(i) may be interpreted as follows:
TABLE 2-2 Interpretation of Silhouette Widths
|
s(i) = 1 |
‘within cluster’ dissimilarity a(i) is much smaller than the smallest ‘between cluster’ dissimilarity. In other words, object i has been assigned to an appropriate cluster. The second best cluster B is not nearly as close to i as the actual cluster A. |
|
s(i) = 0 |
a(i) and b(i) are approximately equal. Hence, it is not clear whether i should be assigned to A or B. It can be considered as an ‘intermediate’ case. |
|
s(i) = -1 |
Object i is badly classified. When s is close to negative one, the object is poorly classified. Its dissimilarity with other objects in its cluster is much greater than its dissimilarity with objects in another nearest cluster. |
The average silhouette width of all the objects can be used to select the optimal number of clusters. The above procedure is repeated for different values of K. And the optimal K is selected as the one which yields the highest silhouette width. The range of maximum average silhouette widths (SC), and their interpretation is presented in Table 2-3.
TABLE 2-3 Interpretation of Maximum Average Silhouette Widths
|
Range of SC |
Interpretation |
|
0.71-1.0 |
A strong structure has been found |
|
0.51-0.70 |
A reasonable structure has been found |
|
0.26-0.50 |
The structure is weak and could be artificial. Try additional methods of data analysis. |
|
£ 0.25 |
No substantial structure has been found |
2.2.2 Density-Based Clustering
Density-based clustering is a mode-searching algorithm. In this method, clusters are viewed as regions in which the data is dense, separated by regions of low density (Jain and Dubes, 1988[73]). This clustering method is claimed to be appropriate for unimodal datasets, and is used extensively in remote sensing applications.
This algorithm is highly promising to the current research problem, for two reasons:
· First, the density-based clustering is aligned with the earlier findings regarding the definition of “normal.” Section 2.1 identified “repetition” and “context” as the two important aspects of defining “normal.” Density-based clustering identifies the repetitions (mode) in the dataset. Further, the particulars of defining “density” and “low density” (for cluster boundaries) could define the appropriate context.
· Second, the density-based clustering is aligned with the findings of how humans judge data visually, which is a widely used method for identifying normal traffic patterns. Humans view “sparse” data away from the dense clusters as “abnormal.”
The density-based clustering algorithm is extended for this research study as described briefly in this section. Further details of the extension are provided in section 5.3. In summary, for each time of day, traffic flow data from the entire dataset is partitioned by constructing a histogram of non-overlapping cells, called quantums (because of the scalar quantization). The quantums with relatively high “frequency” counts form the modes (and are designated as “normal”), and the cluster boundaries fall along the quantums with low frequency counts. Since the “quantum” size and the “frequency” cut-off define the “normal traffic pattern,” this extension of the density-based clustering is named “Quantum-Frequency Algorithm.”
Inputs for this algorithm are the quantum size and the frequency cut-off information. The output consists of the “normal” cluster and its density (the number of data points within the normal cluster).
Wavelet analysis (or decomposition) consists of breaking a given signal in terms of other small signals (wavelets, i.e. small waves). An example of such an analysis is depicted in Figure 2-1. Such a representation of the original signal in terms of the coefficients of other wavelets significantly reduces the dimension of the data vector without losing information. Within transportation, wavelets have been successfully applied for incident detection ((Karim and Adeli, 2002[74]; Adeli and Karim, 2000[75]; Teng and Qi, 2003[76]), and determining data aggregation intervals (Qiao et al, 2003). Wavelets have therefore been selected in this study as a possible approach to transform the traffic volume data. The details of the transformation are presented below.

FIGURE 2-1 Details of Wavelet Decomposition
* The intensity of the color on the decoded signals stands for the details as well as their coefficients. S is the original signal.
Wavelet transforms use a family of building blocks to represent a signal (Daubechies, 1993[77]). This transformation consists of a mother wavelet and a scaling function. The mother wavelet is a short waveform of limited duration, with zero mean. The scaling function (F) is repeatedly (denoted by the suffix j) applied on the mother wavelet (Y) to obtain a number of wavelets, which form the building blocks (MATLAB, 2004[78]).
The wavelets obtained by
scaling the mother wavelet are additionally translated along the time axis
(denoted by the suffix k), as in Figure 2-1. Each of these wavelets
at
a given scaling is compared to the original signal, by convolving the original
signal with the scaling function. The coefficients
resulting
from such a wavelet transformation essentially explain the original signal in
terms of the wavelets. All the wavelets at a given scaling are then presented
along with their individual coefficients as a detail
.
The relation among these entities is given in equation 1:
Eqn.1
By repeated application of
the scaling function on the mother wavelet, an infinite number of wavelets and
hence details can be obtained from the original signal. Depending on the scope
of the analyses, the initial details are useful. Later details are combined to
form the approximation
of
the signal. The original signal
can
then be succinctly represented by the following equation, as the sum of the
approximation and the lower details.
Eqn.2
where
is
the approximation to original signal, at scale J, defined by
Eqn.3
In summary, the original signal (S) is first decomposed to one detail (D1) and an approximation (A1). This approximation (A1) is then further decomposed to the next level of detail (D2) and approximation (A2), and so on.
In this study, wavelet decomposition is used as a data transformation, before performing clustering.
A major advantage of wavelet analyses is the ability to decompose a signal without losing sight of either the long-term (low-frequency) trend (as approximation) or the details.
The algorithm inputs include the wavelet basis, and the scale (or level) of analysis. The algorithm outputs are different signals – the approximation and the details.
Median is a well-known statistical descriptor of scalar datasets. The median traffic pattern for a dataset will be determined as explained here. The entire dataset will be first considered at each time of day. This subset, containing scalar traffic volume values, will be arranged in an increasing or decreasing order. The central point of this subset is the median for that time of day. In this manner, the median for each time of the day will be determined. The time-series described by the medians at each time of day is the median traffic pattern for the given dataset.
The main advantages of this algorithm are:
· The median is considered an unbiased and a robust measure of central tendency, compared to historic average.
· No specific parameters or fine-tuning is required.
The main disadvantages of this algorithm are:
· Median is an ordered statistic, with high computational burden. However, even one year of weekday traffic data will only have around 250 data points at each time of the day. Therefore, this oft-reported concern is not expected to a major issue for this study.
· If the dataset is influenced by a number of incidents, the median would describe neither the incident traffic pattern, nor the non-incident traffic pattern. The actual effect of this disadvantage will be studied as part of this research study.
This algorithm needs no input other than the dataset. The output from this algorithm is the median traffic pattern.
This chapter captured the various relevant findings from the literature. Both the theoretical background and the practical algorithms were studied. Four candidate algorithms have been selected for further application and evaluation in this research study: (a) K-means clustering, (b) Density-Based Clustering, (c) Wavelet decomposition, and (d) Median. Basic explanations of these algorithms, their inputs and outputs, as well as their known advantages and drawbacks have been documented. The next three chapters will explain the application of these algorithms.
A 10-step methodology has been developed to carry out this research study. As depicted in Figure 3-1, these 10 steps are grouped together into four major components:
a) Definition of normal traffic pattern,
b) Development and extension of candidate algorithms,
c) Preliminary evaluation of candidate algorithms, and
d) Detailed evaluation of select algorithms.

FIGURE 3-1 Research Methodology
An overview of these four components of the methodology is presented in this chapter. In the next four chapters (4-7), the results from this study are presented and discussed.
3.1 Define Normal Traffic Pattern
This step is the first component of the research methodology. In this step, a definition for “normal traffic pattern” is developed. Chapters 1 and 2 documented that a precise definition for “normal traffic pattern” is not available currently. However a clear definition is essential to evaluate the theoretical soundness of candidate algorithms. Consolidating the findings from the literature review of other fields (pattern theory, concepts of “normal” and “abnormal”), and the state-of-the-art and practice, a new definition is developed, and presented in chapter 4.
3.2 Development and Extension of Candidate Algorithms
The candidate algorithms selected from the literature review required further development and extension, to enable their application to the current research problem. For example, several specific details are required for applying wavelet analysis to traffic data. These details include the selection of an appropriate mother wavelet, an appropriate level of analysis, and the specific transformed signal to be used for further analyses. Such details are determined in this step, for each candidate algorithm.
In this study, four candidate algorithms were selected from the literature review, in chapter 2. Details of the development and extension of these algorithms are presented in chapter 5.
The third component of the methodology, preliminary evaluation, consists of 4 steps (3-6). In these steps, the initial candidate algorithms are applied to a number of diverse datasets, and their quantitative performance is measured. The purpose of this component is to further narrow down the candidate algorithms for the detailed evaluation.
Details of steps 5 and 6 are algorithm-dependent. Therefore these details are presented in Chapter 5, along with the development and extension of candidate algorithms. The following two subsections present the details of steps 3 and 4.
Traffic data is the primary ingredient for identifying the normal traffic patterns. Such traffic data is mainly specified by two dimensions: (1) the location (roadway, direction etc.) and (2) the time period (for e.g., January 2005, Weekdays). For each selected location and time period, the preliminary evaluation also requires information on the “known abnormalities.” These known abnormalities provide a surrogate measure of accuracy for the “normal traffic” identified by the candidate algorithms. For this study, select crashes, weather events and holidays from the available databases constitute the known abnormalities.
The following two subsections first explain the process of selecting each dataset, and then present the actual datasets selected for this study.
For specifying a traffic dataset, the two main dimensions noted above (location and time period) are often divided into further factors and levels. Well-known factors and levels include (TRB, 2000; Garber and Hoel, 2003; Williams, 1999):
1. Functional type of roadway (3 levels): Urban, Rural or Recreational route.
2. Corridor type (2 levels): Highways, and Signalized arterials.
3. Day of week (2 levels): Weekdays, and Weekends.
4. Temporal Aggregation (2 levels): 15-minutes and 1 hour.
5. Amount of data (3 levels): Monthly, Seasonal (multiple months), and Yearly.
Further finer factors and levels also exist, such as:
However, even a simple factorial experimental design of just the first set of five major factors results in a prohibitive total of 72 (3*2*2*2*3) distinct combinations. A fractional factorial design is therefore applied in this study, focusing on the following factors and levels: Urban highways, and monthly, weekday, 15-minute data. The main reasons for this focus are as follows:
Based on the above discussion, 16 datasets are selected for the preliminary evaluation in this study. Specific details of all these datasets are presented in Table 3-1 and Figure 3-1, representing the diverse geography, roadways, and traffic conditions across the state. These datasets may be summarized as follows:
The link id represents the “location on the roadway, comprising all the lanes traveling in a given direction.” This research study uses the existing traffic data archives at the Smart Travel Laboratory and the Virginia Department of Transportation (VDOT) central office. These archives include both the continuous count station data of the Traffic Monitoring System (TMS), and the ITS sensor data.
TABLE 3-1 Datasets for Preliminary Evaluation
|
Dataset # |
Link ID |
Link Details |
Month, Year |
Day of week |
|
1, 2 |
090272 |
I-395 North ((Northern Virginia) NOVA) |
Jan, 2004 |
Weekdays and Weekends |
|
3 |
Jun-Aug, 2005 |
Weekdays |
||
|
4, 5 |
150037 |
I-564 South (Hampton Roads) |
Jun-Aug, 2005 |
Weekdays and Weekends |
|
6 |
090212 |
Rt.29 North (Arlington, NOVA-Urban Highway) |
Dec, 2005 |
Weekdays |
|
7 |
080254 |
I-81 North (Northwest Region) |
Jan, 2005 |
Weekdays |
|
8 |
880060 |
I-64 West (Near Waynesboro) |
Dec 2005 |
Weekdays |
|
9 |
040296 |
I-95 North (Central Region) |
Jan, 2005 |
Weekdays |
|
10 |
010148 |
I-77 North (Southwest Region) |
Jan, 2005 |
Weekdays |
|
11, 12 (Rural) |
190093 |
SC 619 West, 1 Lane (Joplin Road, Prince William County, NOVA) |
Jun, 2005 |
Weekdays and Weekends |
|
13, 14 (Rural) |
080264 |
Rt.42 South (Northwest Region) |
Dec, 2005 |
Weekdays and Weekends |
|
15 (data quality) |
150070 |
I-64 West (Hampton Roads) |
Mar, 2005 |
Weekdays |
|
16 (data quality) |
376 |
TMC Detector Station, I-66 West, 3 Lanes (NOVA) |
Jan, 2006 |
Weekdays |
|
|
||||
|
Link Legend |
||||
|
Urban Links |
|
|
|
|
|
|
|
|
||
|
Rural Links |
|
|
||
|
Data Quality Links |
|
|
||
FIGURE 3-1 Links for Preliminary Evaluation
A time-series plot of dataset 1 is presented in Figure 3-2.

FIGURE 3-2 Traffic Dataset 1
For each traffic dataset in Table 3-1, “known abnormalities” in the vicinity of its “location and time period” are identified. The three types of abnormalities used in this study are: select crashes, weather events and holidays. Following the descriptions of these three types of abnormalities, a summary of the type and number of abnormalities for each traffic dataset is presented in Table 3-2.
Major crashes are defined in this study as the ones involving fatalities, tractor-trailers, and/or multiple vehicles (three or more). Only crashes within 5 miles upstream and downstream vicinity of a link were deemed to cause traffic abnormality. Crash data for this study was queried from the VDOT’s HTRIS (Highway Traffic and Roadway Information System) database. No fatal crashes were reported for any of the datasets.
Precipitation is the only weather event considered for the current research study. Previous research (Agarwal et al, 2005[79]) documents that heavy precipitation, in excess of 0.25 inches per hour, adversely affects traffic. Only the weather stations within 10 mile vicinity of a traffic link were deemed relevant. If two weather stations were identified in the vicinity of a link, the closest one is designated as station 1, and the other is designated as station 2. Weather data for this research study was obtained from the National Climatic Data Center (NCDC) website[80].
Finally, a list of holidays and potential “holiday-affected” travel days were obtained from the Traffic Engineering Division (TED) within VDOT.
TABLE 3-2 Summary of Crash, Weather and Holiday Data Streams
|
Dataset
|
Crashes |
Weather Events |
Holidays
|
||
|
Tractor |
3+ Vehicles |
Station 1 |
Station 2 |
||
|
1 |
0 |
4 |
0 |
- |
6 |
|
2 |
0 |
1 |
0 |
- |
4 |
|
3 |
0 |
13 |
6 |
- |
3 |
|
4 |
0 |
2 |
9 |
- |
3 |
|
5 |
0 |
0 |
3 |
- |
2 |
|
6 |
0 |
2 |
0 |
- |
6 |
|
7 |
0 |
0 |
0 |
0 |
9 |
|
8 |
1 |
0 |
2 |
0 |
6 |
|
9 |
2 |
2 |
1 |
- |
9 |
|
10 |
0 |
0 |
0 |
1 |
9 |
|
11 |
0 |
0 |
- |
2 |
0 |
|
12 |
0 |
0 |
- |
0 |
0 |
|
13 |
0 |
0 |
0 |
- |
6 |
|
14 |
0 |
0 |
0 |
- |
3 |
|
15 |
0 |
6 |
1 |
- |
1 |
|
16 |
1 |
6 |
0 |
- |
9 |
Four performance measures were developed for the preliminary evaluation. All the four measures are based on the desirable properties of the ideal algorithm (discussed in section 1.3). The first two are quantitative measures, and the last two are qualitative measures.
The first measure alone is vital for selecting algorithms towards detailed evaluation. The other 3 measures provide a sense of how close the algorithm is to the “ideal.”
Convergence: When applied multiple times to the same dataset, deterministic algorithms will always provide the same results, by definition. Therefore they are deemed consistently convergent. On the other hand, non-deterministic algorithms may provide different results when applied multiple times to the same dataset. In this study, an algorithm will be deemed convergent if for at least 70% of the applications (i.e. 7 or more out of 10 times), it provides the same result for the same dataset. Further, if an algorithm converges for at least 70% of the datasets, it shall be deemed consistently convergent.
Surrogate Accuracy: Some algorithms classify the traffic dataset into “normal” and “abnormal” groups. In such cases, the number of “known abnormalities” correctly identified as the “abnormal” group provides a surrogate accuracy for the algorithm, for that dataset. It is desirable for an algorithm to correctly recognize most known abnormalities. However, recognition of “known abnormalities” is only a surrogate measure of the actual traffic abnormalities, for the following reasons:
· Some “known abnormalities” may not have significantly impacted the traffic.
· Some high-traffic-impact abnormalities may be unknown, due to incomplete documentation.
Data Usage: For this measure, the researcher answers the question – whether an algorithm uses all the available data or not? Traffic datasets often contain partly missing or suspect data records. An algorithm that does not use all the available data will not benefit from the availability of a data archive. It is desirable for an ideal algorithm to both (a) maximize the use of good quality data, and (b) minimize the bias from missing or low quality data.
Degree of Normality: For this measure, the researcher answers the question – whether an algorithm can determine the trust or confidence associated with its time-series normal traffic pattern output? In case of a low value for this measure, the analyst may declare that no “reasonable” normal traffic pattern is definable for that dataset, or decide to analyze another comparable dataset.
Detailed evaluation is the fourth and final component of this methodology, and comprises of the last 4 steps (steps 7-10). In these steps, select highest potential algorithms are rigorously evaluated. The algorithms are applied to a statistically stratified sample of datasets, and their results are directly compared to expert judgments of the same datasets. The main purpose of the detailed evaluation is to determine the accuracy of the algorithms, and thereby select the best algorithm(s), together with their recommended parameter values. The individual steps are detailed in the following sub-sections.
Traffic data is the only data of interest for the detailed evaluation. The challenges and constraints explained in section 3.3.1.1, for step 3 of the methodology, are all valid for this step too. In addition, each dataset selected in this step has to be manually analyzed by experts. And the experts selected for this study were unable to give a large amount of their time to these analyses, owing to their other commitments. Therefore, the number of datasets was selected to maintain a balance between these two tradeoffs: minimizing the time needs on the experts, and preserving statistical validity.
The detailed evaluation continues the same focus established in the preliminary evaluation, namely urban highways, and monthly, weekday, 15-minute data. In addition, a few seasonal (3-months) and yearly datasets are also employed. The five urban highway links selected for this study are presented in Figure 3-3. The link descriptions and the details of each dataset selected for this study are presented in Table 3-2. The “shapes” of the underlying traffic pattern for each link are inherently different, owing to their geographic location and vehicular restrictions.
|
(A) Hampton Roads (HR) Region |
Link Legend
|
|
|
|
150012 (HR) |
|
|
|
290267 (NOVA) |
|
|
|
90114 (NOVA) |
|
|
|
90106 (NOVA) |
|
|
|
190064 (NOVA) |
|
|
(B) Northern Virginia (NOVA) Region |
|
|
FIGURE 3-3 Links for Detailed Evaluation
TABLE 3-2 Datasets for Detailed Evaluation
|
Dataset # |
Link ID |
Link Details |
Year |
Month(s) |
Comments |
|
1 |
290267 |
Route 267 West; 4 Lane Toll Road, with a Parallel 2-Lane Express Facility |
2005 |
Jan |
Monthly data |
|
2 |
Feb |
||||
|
3 |
Mar |
||||
|
4 |
Jan-Mar |
Seasonal |
|||
|
5 |
Jan-Dec |
Yearly |
|||
|
6 |
150012 |
I-664 East; 3 Lanes: Leading to a Bridge/Tunnel |
2006 |
Jan |
Monthly data |
|
7 |
Feb |
||||
|
8 |
Mar |
||||
|
9 |
Jan-Mar |
Seasonal |
|||
|
10 |
Jan-Dec |
Yearly |
|||
|
11 |
90114 |
I-66 East; 4 Lanes: 1 HOV, 2 General Purpose, 1 Shoulder Lane |
2006 |
Jan |
Monthly data |
|
12 |
Feb |
||||
|
13 |
Mar |
||||
|
14 |
Jan-Mar |
Seasonal |
|||
|
15 |
Jan-Dec |
Yearly |
|||
|
16 |
90106 |
I-66 East; 2 Lanes: HOV Restriction in AM peak |
2006 |
May |
Monthly data |
|
17 |
Jun |
||||
|
18 |
Jul |
||||
|
19 |
May-Jul |
Seasonal |
|||
|
20 |
Jan-Dec |
Yearly |
|||
|
21 |
1998 |
Dec |
Different time period |
||
|
22 |
190064 |
I-495, Beltway, South; 4 General Purpose Lanes |
2006 |
Oct |
Monthly data |
|
23 |
Nov |
||||
|
24 |
Dec |
||||
|
25 |
Oct-Dec |
Seasonal |
|||
|
26 |
Jan-Dec |
Yearly |
|||
|
27 |
1998 |
Dec |
Different time period |
The critical performance of interest for the detailed evaluation is algorithm accuracy. In this study, accuracy is determined through the following three widely used error statistics:
For each algorithm and dataset, all the error statistics are first computed by comparing the algorithm result to each expert’s judgment. The final error statistic for the algorithm, for that dataset, is obtained by averaging all these errors. The best algorithm is expected to be consistently accurate, with low error values.
In comparison to other times of the day, the traffic volumes are regularly high during peak periods (both AM and PM peaks). As such, the effects of “abnormalities” on traffic are also expected to be highest during these periods. For this reason, the above error statistics are computed for the entire day, as well as the peak periods (both AM and PM peaks). In this study, AM peak is defined as 5:00 AM to 8:59 AM, and PM peak is defined as 16:00 to 18:59.
Mean error (ME) is an important statistic for biased algorithms. If an algorithm is systematically biased in one direction, then the ME and the MAE values will be numerically same. Even if the ME values are low for an algorithm, if its MAE values are high, then the algorithm is likely producing a number of large random errors. To compare two algorithms with similar ME and MAE values, MAPE is a valuable metric. An algorithm with lower MAPE values is more sensitive to the actual traffic volume values.
3.4.3 Benchmark “Normal Patterns” by Experts
Expert judgment of traffic datasets is the most reliable method available presently for identifying normal traffic patterns. Towards benchmarking the algorithms, experts are first identified in this step. Next, a visual quality assurance is performed on the analysis results of the experts, to eliminate any unintentional errors. Out of the 103 total traffic patterns identified by all the experts, 6 patterns were eliminated in this manner.
Along with the traffic patterns, two other pieces of information are obtained from the experts. First, the time spent by the experts for manual analysis provides a benchmark for the time savings realizable by using automated algorithms. Second, the specific methods used by each expert are documented to provide a reference for future studies.
In this study, four experts were selected and invited to provide the benchmark “normal traffic patterns.” Each expert was asked “to identify one normal traffic pattern spanning the entire day, for each of the 27 datasets in the detailed evaluation.” These experts come from both research and practice, and their expertise lie in diverse application areas noted in section 1.2.1. These experts are nationally acclaimed, and are highly trusted by their peers for their skills. This balanced group of experts is therefore well-suited to provide a sound benchmark for the current study.
· Ms. Cathy McGhee, P.E., is the Acting Associate Director for Systems Operations and Traffic Engineering, at the Virginia Transportation Research Council, Charlottesville, Virginia. She has been a research scientist at the VTRC for almost 15 years. Her expertise includes transportation systems operations, and microscopic traffic simulation. Additionally, she is highly familiar with the traffic and data in Virginia.
· Mr. Keith M. Nichols is a Senior Transportation Engineer, with the Hampton Roads Planning District Commission (HRPDC). His expertise includes transportation planning and data analyses. He is also highly familiar with the traffic and data in Virginia.
· Mr. Ralph L. Jones, P.E., is xxx with the Traffic Engineering Division, at the Virginia Department of Transportation. His expertise includes data collection and archiving, data quality assessment and analyses. Mr. Jones is also highly familiar with the traffic and data in Virginia.
· Mr. Shawn M. Turner, P.E., is a Research Engineer with the Texas Transportation Institute. His expertise includes quality assessment and usage of data archives, and performance measurement. He has performed several research studies that use “normal traffic patterns,” including the calculation of “incident capacity loss,” data quality assessments, and data imputations.
It is worth noting that another internationally acclaimed expert contacted for this research study showed much interest in participating, but later withdrew, as the time required to analyze even one dataset “thoroughly” was significantly large.
In this step, a visual quality assurance is performed on all the expert judgments to identify any unintentional errors. An example of such an error is presented in Figure 3-4. Such errors are excluded from the detailed evaluation of the algorithms. In all these cases, the entire “biased” pattern is excluded from further analysis i.e., in this case, Expert 1’s judgment for dataset 24 is excluded from further analysis.

FIGURE 3-4 Visual Quality Assurance of Expert Analyses
In this chapter, a 10-step methodology was developed to carry out the current research study. Application of this methodology and the ensuing results are the subject of the next 4 chapters (4-7).
CHAPTER 4
DEFINITION OF “NORMAL TRAFFIC PATTERN”
Chapter 3 presented the methodology to carry out this research study. “Defining normal traffic pattern” is the first major component, and also the first step in this methodology. The first section in this chapter presents the drawbacks of the existing definitions. The second section then develops a new definition.
4.1 Drawbacks of Existing Definitions
The literature reviewed for this study (section 2.1.1) revealed that a precise definition of “normal traffic pattern” does not exist currently. Instead, the concept of “normal traffic pattern” currently encompasses a wide spectrum of understanding. One end of the spectrum acknowledges a repetitive, rhythmic day-to-day traffic variation. The other end of the spectrum explains that “no two days of traffic are the same.”
Taking a different approach, many studies define “normal” (underlying, or recurrent) traffic as the opposite of traffic affected by “abnormal” (or non-recurrent, singular) events. With such a definition, the traffic is essentially explained (as normal or abnormal) on the basis of its causal factors such as crashes, unusual or large weather events, work zones, and holidays. Although conceptually appealing, such a definition, based on causal factors, poses two major practical constraints.
First, for a given roadway location, and time of day, several factors affecting traffic remain apparently similar (almost repeating) from one day to another. Examples of such factors include (1) travel behavior of people (similar origin, destination, time of day, and day of week), (2) roadway geometry (3) seasonal weather, (4) vehicle capabilities and sizes, (5) demographics and local economy. In reality, these factors regularly change, even if only in small steps. And these small changes obscure the day-to-day “repetitions” and “rhythm” in traffic.
Second, knowledge of the abnormalities is insufficient to determine the “normal traffic patterns. Two main limitations of this approach are
· Several abnormalities remain unknown or undocumented. An example illustrating this situation is presented in Figure 4-1. The traffic on January 26th and 27th are visibly different from the other days. However, no underlying known causal factor could be associated with these days.
· The spatial extent, the temporal extent, as well as the magnitude of the traffic affected by abnormalities are “not clearly and completely” separable from the underlying, normal, recurrent traffic pattern. An example of this limitation is the challenge in determining the extent of traffic affected by an incident on a side street.

FIGURE 4-1 Illustration of Unknown Traffic Abnormalities
To overcome these above limitations, “normal traffic pattern” needs to be defined on the basis of the available traffic data, rather than the “known abnormalities.” Such a definition is developed in the next section.
4.2 Development of a New Definition
Sections 2.1.2 and 2.1.3 reviewed the concepts of “pattern” and “normal.” From these sections, three main findings are identified as relevant for the development of a new definition of “normal traffic pattern.” These findings are:
1. A pattern is defined by a set of rules that apply to a given dataset.
2. The concept of “normal” is significantly explained by repetition and context.
a. However, no universal constant is known regarding the minimum number of repetitions required to define “normal.”
b. Being “normal” is not a property of an observation, devoid of context. The same observation (in this study, a day of traffic) could be judged as either “normal” or “abnormal,” depending on the context (i.e. the other observations in the dataset).
3. “Normal” may be hypothetical. i.e., Not even one full day of traffic in a dataset may exhibit all the so-called “normal” features (i.e., fully devoid of the effects of incidents and other non-recurring events).
The only “rule” of interest for “normal traffic pattern” is that the traffic should be “normal.” Synthesizing these findings and the information from the previous section, the new definition developed in this study is presented in the following sub-section.
4.2.1 Definition of “Normal Traffic Pattern”
“Normal traffic pattern” of a given traffic dataset is defined as “A representative time-series of the largest subset containing similar traffic data.”
Both the words “largest” and “similar” reflect the two key aspects of “normal”: “repetition” and “context.” These words are not defined further, so as to provide the opportunity for different algorithms and methods to exhibit their individual strengths. A representative of such a subset may be the average of all the data in the subset, or a day that falls fully within the subset. Algorithms are allowed to derive other such interpretations for the word “representative.”
Visual judgment of traffic data by experts is in line with this definition. The expert visually identifies the similarities and the largest group from time-series plots of the entire dataset.
4.2.2 Limitations of the Definition
One rare, but probable limitation of this definition exists. Consider a situation where traffic abnormalities occur frequently, around the same location, and time of day. In such cases, the above definition would identify traffic affected by abnormalities as the “normal traffic.” Two solutions are proposed to overcome this limitation:
1. Using a large parent dataset, and
2. A visual quality assurance by the analyst.
In this chapter, limitations of the existing definitions of “normal traffic pattern” are explained, and a new, enhanced definition is developed.
CHAPTER 5
DEVELOPING AND EXTENDING CANDIDATE ALGORITHMS
In this chapter the second major component of the research methodology is applied. In this step, candidate algorithms are developed and extended. Section 2.2 identified four promising candidate algorithms, namely:
In this study, all these algorithms are compared to the widely-accepted simple historic average algorithm. The application details of all these algorithms are presented in the following sections. The first two algorithms are extended, building on their usage in past studies. However, the reviewed literature makes no mention of applying Density-Based Clustering within transportation. Therefore, Quantum-Frequency Algorithm is fully developed based on the concepts of Density-Based Clustering. Finally, the application details of the median and the simple historic average algorithms are presented together, because of their similarities.
5.1 K-Means Clustering Algorithm
Expanding on the summary presented in section 2.2.1, the specific steps of the K-means clustering algorithm are:
Further details of these steps are presented in the following subsections.
Conceptually, a traffic dataset comprises of two types of data – the “normal” traffic, and the “abnormal” traffic. However, because of their expected singular nature, the “abnormal” traffic data may not fall into one cluster. Therefore, this study will employ three different K values: 2, 3 and 4. The criteria for convergence of results and the process of optimization among different “K” values are presented in section 5.1.6.
5.1.2 Determination of Initial Cluster Centroids
No apriori information is available for narrowing down the selection of initial cluster centroids. Therefore, all the cluster centroids will be randomly selected from the available days of data.
5.1.3 Calculation of Important Features
Pattern recognition and feature extraction tools identify several potential features. Along with the direct use of the time-series traffic data vector, the following two features have been selected for this study:
![]()
i.e., the covariance of a time-series traffic data vector, ands its own lagging sequence (at lag h).
5.1.4 Calculation of Distances among Data Vectors and Centroids
Distances between two data vectors are measurable in various ways (Everitt, 1993). These include the Euclidean distance, Minkowski distance, Manhattan distance, Mahalanobis distance etc. Euclidean distance, or the Cartesian geometric distance between two vectors, has been traditionally used in many studies. Weijermars (2007) has recently used the Euclidean distance measure for clustering traffic data. For these reasons, the Euclidean distance has been selected for this study.
Euclidean distance could be measured directly between two days of data, or between their variances or autocovariances. Since variance and autocovariance are scalars, the Euclidean distance between two days of data, using their variance (or autocovariance), is the absolute difference of the variance (or autocovariance) values. In this study, all the three distances are used:
In this research, the Euclidean distance measure is used both for the untransformed traffic data, as well as the Wavelet-transformed data.
In this study, the centroid of a cluster is calculated as the simple arithmetic mean of all the vectors within that cluster. As such, the centroid of a cluster is the simple historic average of all the days designated to that cluster.
5.1.6 Iteration and Halting Procedure
In this study, the first five steps are iterated 20-50 times, and then halted. Monthly datasets contain a maximum of 23 weekdays, and the cluster memberships are highly unlikely to change beyond 20 iterations. Similarly, 30-50 iterations are deemed sufficient for 3-monthly datasets.
Five other aspects are important for the proper application of K-Means algorithm in this study. These are: (1) Convergence criteria and Normal cluster, (2) Normal pattern, (3) Recognition of known abnormalities, (4) Optimization of K value, and (5) Degree of normality measure. This section details these aspects.
For each K value, the three criteria for declaring convergence are:
1. A minimum of 70% of repeated runs (7 out of 10) of the same algorithm on the same dataset should result in the same clusters.
2. The average silhouette width value should be above 0.51, representing moderate or good clusters.
3. The largest and second largest clusters should be different by at least 10%. Then the largest cluster is declared as the “normal” cluster. For example, in clustering a dataset, if the two largest clusters contain 30 and 29 days of data, then the cluster results will be deemed non-convergent.
A simple arithmetic average of all the days within the “normal” cluster is declared as the “normal pattern” for that dataset. All the days not within the “normal” cluster are deemed abnormal. If a known traffic abnormality (such as a major crash) falls on such an abnormal day, then the clustering is declared to have correctly “recognized the known abnormality.”
If more than one of the three K values applied to a dataset converges, then the optimum K value is determined. The average silhouette width values for different K values are plotted. The knee represents the optimal “K.” An example of such a plot, for the variance feature, is presented in Figure 5-1.

FIGURE 5-1 Decision for Optimum “K” (Variance Feature)
The number of days of data in the “normal” cluster is considered as a measure of the “degree of normality” for the identified “normal” cluster.
5.2 Clustering of Wavelet Decomposed Signals
In this algorithm, traffic data is first transformed into wavelet decomposed signals. This transformed data is then clustered using K-Means clustering algorithm, as explained in the previous section (5.1). In the third step (section 5.1.3), these transformed signals will be used in place of the actual traffic data.
Section 2.2.3 mentioned two important inputs for carrying out wavelet analyses – the mother wavelet, and the scale of analysis. These inputs are examined in the first subsection. Further, a wavelet decomposition of a traffic signal will result in several wavelet signals – one approximation and one or more details. The second subsection explains these signals and determines which signals will be clustered.
5.2.1 Mother Wavelet and Scale of Analysis
Several mother wavelets and examples are available in the reviewed literature. However, no specific guidance is available for the appropriate selection of mother wavelet or the scale of analysis. The reviewed literature simply mention that the mother wavelet and the scale of analysis are domain-dependent, and likely different from the experiences in other domains. However, traffic data analysis using wavelets is still a developing field.
Two recent studies (Xie and Yunlong, 2006[81]; Adeli and Karim, 2005[82]) have used the Daubechies wavelet (DB4) as the mother wavelet, for short-term traffic volume prediction, and incident detection. Another study (Yun et al, 1998[83]) analyzed traffic data using chaos theory. Such chaotic systems are often described using fractal geometry, which is also the basis for Daubechies system of wavelets. For these reasons, Daubechies wavelet (DB4) is selected for this research study.
Further, the above-mentioned studies have analyzed traffic volume data for scales 2, 3 and 4. Therefore the current study will also analyze the traffic data using DB4, to scale 4. From this analysis, one approximation signal (A4) and four detailed signals (D4, D3, D2, and D1) will become available for further clustering. The next section explains these signals and the rationale for selecting them for the current study.
5.2.2 Determination of Signal for Clustering
For a given mother wavelet and the scale of analysis (>1), an approximation and several details become available. On the one hand, the approximation is devoid of several high-frequency fluctuations existing within the original data. Such a smoothed dataset is conducive for the identification of “similarities” among different days of data i.e., the “normal” traffic. On the other hand, the details capture the high frequency fluctuations, and are conducive for the identification of the “differences” among different days of the data, i.e. the “abnormal” traffic. For these reasons, all the details as well as the approximation are explored in this research study for further clustering.
The remaining steps of the K-Means clustering algorithm, such as selection of K etc. will be applied in this algorithm just as with the actual traffic data vectors in the previous section.
5.3 Quantum-Frequency Algorithm
The Quantum-Frequency algorithm (QFA) is a density-based clustering algorithm. In summary, the density of the time-series traffic volume dataset is determined for each time-of-day. The high density data form the clusters, separated by the low density areas, called cluster boundaries. The densest cluster is then declared as the normal cluster. A normal traffic pattern for the entire day is obtained by considering together the normal cluster for each time-of-day.
The five main steps of this algorithm, developed in this study, are:
All these steps, along with their scientific bases and rationale, are explained in the following subsections using an example real-world dataset. This dataset comprises all the 22 weekdays from January 2004, for link 90272. This link is a 3-lane site on I-395 North, near the Virginia-Washington D.C. border.
In this first step, each traffic volume value is first quantized as an independent scalar variable. Quantization is the process of approximating a traffic volume value to its nearest value in a pre-defined set of discrete values. The discrete data set is referred as quantums. These quantums are similar to the bins in a histogram. The quantization is performed solely for the determination of the frequency of occurrence of each volume value, and thereby to determine the highest densities and the clusters.
In this study, the floor function (or step function) is selected for quantization. This function is illustrated in Figure 5-2. According to this function, any traffic volume value between 0 and 59 vehicles/hour/lane is considered as 0; any value between 60 and 119 is considered as 60, etc. The value of 60 vehicles per hour per lane, called the “quantum size” (Q), is selected as an initial value, for three reasons:
a. This value is about 2.5-3% of the expected highway capacity values (2000-2400 vehicles/hour/lane). And a 3% variation in traffic volumes is considered significant for many purposes.
b. Its proximity to 50. The value of 50 provides a rounding and avoids false precision, which is recommended by the Highway Capacity Manual.
c. This value is easily scalable for analyzing data at smaller aggregation intervals, such as 15-minutes or 5-minutes. 60 vehicles/hour/lane scales to 15 vehicles/15-minutes/lane etc.
Scalar quantization is supported by findings in the field of visualization and vision science (findings 4 and 5, in section 2.1.3). First, instead of clustering together entire days of traffic, only the values at each time-of-day are compared across different days. Second, such an abstraction is inherently used by human experts in deciphering information from visual graphs.
Quantization helps this algorithm in two ways: First, it helps in providing a context for each day of data (at each time-of-day), with respect to the other days in the dataset. Second, it helps in the determination of the data distribution and hence density, i.e., the high-frequency areas. Both these aspects are also important for identifying “normal.”

FIGURE 5-2 Floor function for uniform quantization
In this step, there is no actual loss of information. Quantization is used only to find the distribution of the traffic flow (volume) values, in step 2. Once the “normal” cluster is determined in step 4, only the original traffic volume values are used in further calculations.
In this step, the distributions of the entire quantized traffic volume dataset, for each time-of-day, are determined. Each such distribution effectively finds the number of times a quantized volume value occurs. These numbers are here after referred as frequency (of occurrence of that quantized volume value). This frequency distribution describes the density of the dataset.
The frequency distribution of the original traffic volume values (at 13:00), and the corresponding quantized data are presented in Figures 5-3 and 5-4. The next two steps explain how the two distributions are similar, and the quantization aids the algorithm in the clustering process.

FIGURE 5-3 Histogram of Actual Traffic Volumes

FIGURE 5-4 Histogram of Quantized Traffic Volumes
5.3.3 Determination of the Frequency of Insignificance (FI)
This is an important decision step, where the second parameter for this algorithm is defined. In this step, the low cut-off density value is determined. The low-density boundaries separate the high-density clusters. As a surrogate of density, the “frequency of insignificance” (FI) (the low-frequency cut-off value) identifies the boundaries between the clusters. If the frequency of a quantized volume value is less than or equal to this “frequency of insignificance” (FI), such a volume value is ignored as insignificant. Only quantized volume values with non-insignificant frequencies are considered in further calculations to find clusters.
Based on the last observation on visualization and vision science (section 2.1.3), such a cut-off frequency closely mimics the visual judgments by experts. An initial value of “zero” is proposed for this parameter FI. From a visual perspective, a value of “zero” is analogous to identifying two “clusters” of data as foreground information, separating them from a contrasting white background (through both color and contrast). A different value, say, 1 or 2, is analogous to identifying clusters through the color intensity (low vs. high).
Using a value of “zero” for FI, the cluster boundaries for the example dataset, at 13:00 hours, are illustrated as white block arrows in Figure 5-4.
5.3.4 Determination of “Normal” Cluster
In this step, all the neighboring clusters are first coalesced together as a single cluster. Such coalescing is depicted as ovals, in Figure 5-4. Next, the cluster with the highest frequency is declared as the “normal” cluster for that time period. This declaration is depicted as the red oval, in Figure 5-4. Comparing Figure 5-4 to the parent dataset in Figure 5-3, the relative positions of the various days of data have remained unchanged through quantization and density-based cluster determination. Quantization simply makes the “pattern” more pronounced through the definition of densities and the cluster boundaries.
Through the mutually exclusive definitions of normal and abnormal, the quantized volume values outside the red oval bear the label of “abnormal.” This process is performed separately for each time-of-day. All the “normal” clusters from each time-of-day combine to form the “normal” cluster for the entire day. Such a “normal” cluster for the entire day is depicted in Figure 5-5, for specific parameter values (Q=60, and FI=0). In this figure, the “X” axis is the time-of-day, and the “Y” axis is the quantized traffic volume values. For each time-of-day and Q, the frequency value is presented. The yellow shading depicts the “normal” cluster for the entire day.

FIGURE 5-5 A “Normal” Cluster from Quantum-Frequency Algorithm
5.3.5 Determination of Normal Pattern
The normal traffic pattern for the parent dataset is determined by averaging the actual traffic volume data from all the data points that fall within the final “normal” cluster. Depiction of an example parent traffic dataset, the simple historic average, and QFA normal traffic pattern are all presented in Figure 5-6.
Further, a day of data that falls completely within the “normal” cluster is considered as a fully normal day.

FIGURE 5-6 Traffic data, Historic Average, and Normal Pattern from QFA
Four other aspects are important for the proper application of QFA in this study. These are: (1) Convergence criteria, (2) Recognition of known abnormalities, (3) Degree of normality measure, and (4) Optimization of parameters. This section details these aspects.
QFA is a deterministic algorithm, and will yield the same results for the same dataset and parameter values, when applied several times. However, an application of QFA to a particular dataset, with specific parameters, is deemed convergent only if the “normal” cluster contains at least one fully “normal day.”
5.3.6.2 Recognition of Known Abnormalities
All the traffic data values not contained in the “normal” cluster are deemed “abnormal.” Unlike K-Means clustering, QFA may identify some parts of a day as “normal,” and some other parts as “abnormal.” Therefore, the following detailed criteria are developed to decide whether QFA correctly recognized “known abnormalities:”
1. A crash or weather event is considered to be recognized by QFA if at least one 15-minute data point within the first one-hour of the crash or weather event is clustered as abnormal.
2. A holiday is considered to be recognized by QFA if at least 48 data points (half the day) is clustered as “abnormal.”
3. A holiday-affected travel day is considered to be recognized by QFA if at least 24 data points (one quarter of the day) are clustered as abnormal.
The number of traffic volume values within the normal cluster, for each time-of- day, illustrates the “degree of normality” associated with that cluster. A plot of this measure is presented in Figure 5-7. Using the Figures 5-5 and 5-7, a procedure for determining the optimum parameter values (for quantum size and frequency of insignificance) is developed next.

FIGURE 5-7 Degree of Normality for Quantum-Frequency Algorithm
The QFA result for a dataset depends completely on the values of its two parameters – “quantum size” (Q) and “frequency of insignificance” (FI). As such, it is possible to finely optimize the QFA parameter values for each specific dataset. However, the goal of this research study is to find a set of parameter values that are reasonably optimized for most datasets, as a broad guideline for initial parameter values for other datasets. The process of selecting the appropriate values for Q and FI to optimize the QFA results is described in this section.
The main goal of QFA is to find the highest density of traffic volume values across the whole day. To optimize this QFA objective, the following three objective functions are compared for each parameter value combination: (1) Largest normal cluster width (LNCW), (2) Smallest “degree of normality” (SDN), and (3) Average density (AD). These three objective functions are first defined. Next, the process of determining the optimum parameter values is explained.
For a given time-of-day, the “normal cluster width (NCW)” is defined as the difference between the largest and the smallest traffic volume values, within the normal cluster. LNCW represents the largest “normal cluster width (NCW)” across the entire day. For example, at 1300 hours, for the example dataset, NCW is 182 vehicles/hour/lane (1545-1363). The value of LNCW for this dataset and parameter combination is 426 vehicles/hour/lane, occurring at 7:00 AM.
The degree of normality was described in the previous section, for each time-of-day. The smallest “degree of normality” (SDN) is the smallest value of the “degree of normality” across the entire day. Using the example dataset, the “degree of normality” at 1300 hours is 16. The value of SDN for this dataset and parameter combination is 11, occurring at 8:00AM.
For a given time-of-day, the “density” is defined as the ratio of “degree of normality” to “normal cluster width.” Units of “density” are not important. Average density (AD) of the normal cluster is then the arithmetic average of the “density” at each time-of-day. Again using the same example dataset, the “density” at 1300 hours is 0.09. The value of average density (AD) for this dataset and parameter combination is 0.12.
In terms of these objective functions, the goal of optimizing QFA is restated as follows: To determine the parameter value combination which maximizes AD and SDN, and minimizes LNCW. In general, smaller Q and/or larger FI parameter values yield high AD and low LNCW, but low SDN. However, beyond a certain limit, smaller Q and large FI values will not even converge – i.e., the normal cluster will not include even one fully normal day. Complete optimization details and results are presented in the next chapter.
5.4 Median and Simple Historic Average Algorithms
Both these algorithms are based on standard statistical measures, and need no parameters. They are both deterministic, and converge to the same results for a dataset, when applied multiple times. Since these algorithms always converge and other preliminary evaluation metrics are not mandatory, these algorithms are selected directly for the detailed evaluation.
Furthermore, the other preliminary evaluation metrics are not deemed applicable to these algorithms for one main reason: Both the algorithms do not separate a given dataset as “normal” and “abnormal.” In the reviewed literature, the 95% confidence intervals around the arithmetic average and 3 inter-quartile range around the median are sometimes used to identify “extreme outliers” in the dataset. In this study, these “extreme outliers” are not deemed comparable to “abnormal traffic.” Further, the widely-used historic average algorithm to identify traffic pattern does not identify and eliminate “abnormal” data in this manner.
This chapter fully developed and extended the application details of the candidate algorithms from the literature. These developed algorithms are evaluated in the next two chapters.
CHAPTER 6
PRELIMINARY EVALUATION
This chapter explains the preliminary evaluation of the candidate algorithms developed and expanded in the previous chapter. As such, this chapter applies steps 5 and 6 of the research methodology developed in chapter 3. First, for each algorithm, its parameters are initiated and its application explained. Then the evaluation results, along with the parameter optimization, are presented and discussed. Finally, candidate algorithms for the detailed evaluation are selected.
6.1 Parameter Initiation and Algorithm Application
The rationales for selecting initial parameter values for each algorithm were explained in the previous chapter. These initial parameter values are summarized here.
For each application of K-Means Clustering algorithm (both the basic version, and with wavelet analysis), three K values (2, 3, and 4) are evaluated. Two new features are employed in this study: (1) daily variance, and (2) daily autocovariance at 15-minute lag. In addition, the Euclidean distance among the daily traffic data vectors is also employed.
For wavelet based clustering, Daubechies (DB4) mother wavelet is used, and the traffic data are analyzed to 4 levels. All the resulting wavelet-decomposed signals are used for further clustering based on Euclidean distance metric.
For Quantum-Frequency algorithm (QFA), “quantum size” (Q) is initiated at 60 vehicles/hour/lane, and “frequency of insignificance” (FI) is initiated at “0.” Other parameter values evaluated for optimization include Q (40, 50 and 80); and FI (1 and 2).
A Java-based program was developed at the Smart Travel Laboratory for the application of K-means algorithm. Wavelet decomposition was carried out in Matlab. The Quantum-Frequency algorithm was developed in both SAS and also Java. The two programs have different eases of application and graphing capabilities.
Section 3.3 identified sixteen traffic datasets and their associated “known abnormalities” towards preliminary evaluation. To each of these datasets, all the algorithms from the previous chapter, and the parameter combinations from section 6.1, were applied. The results from these applications are presented here.
Table 6-1 first presents the number of datasets (and the percentage) that converged for each algorithm and parameter combination. For QFA, only the first 15 datasets can converge. The 16th dataset does not have even one full day of data. Therefore, this last dataset is not considered in determining the percentage of datasets converged. By definition, the median and historic average algorithms converged for all datasets.
TABLE 6-1 Convergence of Algorithms
|
Algorithm |
Parameter 1 |
Parameter 2 |
Number of datasets converged (percentage) |
|
K-Means Clustering |
Euclidean distance |
K=2 |
4 (25%) |
|
K=3 |
1 (6.3%) |
||
|
K=4 |
0 |
||
|
Variance Feature |
K=2 |
10 (62.5%) |
|
|
K=3 |
2 (12.5%) |
||
|
K=4 |
1 (6.3%) |
||
|
Autocovariance Feature |
K=2 |
5 (31.2%) |
|
|
K=3 |
3 (18.8%) |
||
|
K=4 |
1 (6.3%) |
||
|
Wavelet based K-Means Clustering |
Approximation Signal (A4) |
K=2 |
7 (43.8%) |
|
K=3 |
2 (12.5%) |
||
|
K=4 |
0 |
||
|
Detail Signal (D1) |
K=2 |
0 |
|
|
K=3 |
0 |
||
|
K=4 |
0 |
||
|
Detail Signal (D2) |
K=2 |
0 |
|
|
K=3 |
0 |
||
|
K=4 |
0 |
||
|
Detail Signal (D3) |
K=2 |
1 (6.3%) |
|
|
K=3 |
0 |
||
|
K=4 |
0 |
||
|
Detail Signal (D4) |
|
3 (18.8%) |
|
|
|
0 |
||
|
|
0 |
||
|
QFA |
Q=60 |
FI=0 |
13 (86.7%) |
|
Median |
- |
- |
16 (100%) |
|
Historic Average |
- |
- |
16 (100%) |
Other than the Median and the Historic Average algorithms, only the QFA converged to a sufficient degree. Even so, to demonstrate the use of the second performance measure (surrogate accuracy), this measure for the QFA is presented in the following table 6-2. Both the number of incidents/events identified, and the total are presented in the table.
TABLE 6-2 Surrogate Accuracy of QFA
|
|
Number of incidents/ events identified |
Total incidents/ events |
|
Tractor-trailer incidents |
3 |
4 |
|
Multi-vehicle incidents |
10 |
36 |
|
Weather events |
5 |
25 |
|
Holiday events |
10 |
76 |
One major reason for the apparently low surrogate accuracy of QFA has already been discussed in chapter 4: Known abnormalities may not necessarily result in abnormal traffic. Another potential reason for the low identification of holiday events is that the definition of “identification” for QFA in this study may have been too strict. The direct accuracy of QFA is determined in the detailed evaluation in the next chapter.
6.2.1 Discussion of Evaluation Results
The evaluation results for K-means clustering and QFA algorithms are separately discussed first, followed by a comparative discussion.
All the K-means clustering algorithms, for different parameter combinations, were similar in the following aspects:
· For some weekday datasets, the Fridays were separated from the rest of the weekdays.
· For several weekend datasets, the Saturdays and Sundays were separated.
· For several datasets, the holidays and the non-holidays were successfully separated.
The above findings are in line with another recent study on traffic patterns (Weijermars, 2007). However, these findings were not universal across the parameters or the datasets. The variance and autocovariance features introduced anew in this study performed better than the traditional Euclidean distance metric in their convergence, clustering structure, as well as the surrogate accuracy.
The new wavelet-based clustering demonstrated some potential by identifying strong structures in the underlying data, in comparison with the Euclidean distance algorithm. However, these wavelet algorithms require further development and evaluations, to bring them to maturity.
In the preliminary evaluation, QFA could not successfully analyze two types of datasets: (a) Small datasets, with less than 10 days of data; (b) Datasets with multiple major patterns, such as monthly weekend datasets, which contains Saturdays and Sundays. In addition, the datasets with high inherent variance benefit most from an interactive analysis and optimization of the parameters. For example, dataset 1 is highly variant compared to the other monthly datasets.
For a comparative analysis, Dataset 1 is considered in the following illustrations. Four k-means clustering algorithms (Euclidean distance, Variance, Autocovariance, and Wavelet approximation) identified the same 3 days (01.01.05, 01.26.05, and 01.27.05) as different from the rest. And the average silhouette widths were high in all the four cases, indicating strong clustering structures. Figure 6-1 shows that these three days are indeed different from the rest of the days in the dataset. However, several other days are also different from the rest, especially in the AM peak period.

FIGURE 6-1 Traffic Dataset 1
Figures 6-2 and 6-3 depict the “normal clusters” identified by the QFA and K-Means algorithms. Figure 6-4 presents the degree of normality for the two different “normal” cluster results. The K-means algorithm has an overall high and constant “degree of normality” across the day. However, its normal cluster is also wide in places. On the other hand, the “degree of normality” for QFA varies across the day. For some times of the day, the degree of normality for QFA encompasses the entire dataset, and in two instances, only 9 out of the 21 days are clustered as “normal.”

FIGURE 6-2 Normal Cluster from QFA (60/0)

FIGURE 6-3 Normal Cluster from K-Means Clustering

FIGURE 6-4 Degree of Normality
Figure 6-5 compares the “normal” patterns from K-Means clustering and QFA, along with the median and the simple historic average. For the AM peak, Figures 6-1 and 6-5 demonstrate that the QFA and Median more closely follow the dense region of the parent dataset, compared to the K-means cluster results. The highest difference between the “normal patterns” of K-Means algorithm and QFA is around 50 vehicles/15-minutes/lane, i.e. 200 vehicles/hour/lane. The simple historic average (SHA) performed much worse than even the K-Means clustering algorithm.

FIGURE 6-5 Comparison of Normal Patterns
Based on the preliminary evaluation, the major strengths and weaknesses of the studied algorithms are summarized as follows:
· K-means clustering declares an entire day of traffic as either normal or abnormal. Such designation poses one of these two problems: First, when only a part of a day is affected, the remaining parts may also be rejected as abnormal. Second, if such a day is designated as normal, the final normal cluster would be affected by the abnormal parts. In spite of this drawback, the K-means clustering may still be favorable to the simple historic average, which is always biased by every abnormality in the available data.
· In contrast, QFA neither declares an entire day of data as abnormal, nor does it ignore the bias of every abnormality. Instead, QFA uses every data point that falls within a dense cluster, and ignores the rest.
· The median does not use the actual values of most of the data. And this aspect is its strength with most datasets, as the final pattern is unaffected by the presence of a few outliers. However, if a large number of outliers are present in the dataset, the median algorithm would also yield significantly biased results.
· The first major drawback with all the adaptations of the K-means clustering algorithm in this study is their non-convergence.
· The second major drawback of all the adaptations of the K-means clustering algorithm is the poor clustering structures, even for the datasets when the algorithms converged. i.e. the biggest cluster (designated as “normal” in this study) often contained many days with significant abnormal traffic.
· The main drawbacks of the median algorithm are that neither its surrogate accuracy, nor the degree of normality associated with its output pattern is determinable.
6.3 Optimization of QFA Parameters
In this study, the QFA parameters were initialized as Q=60 and FI=0. Several other parameter combinations were evaluated, towards optimization. As expected, smaller Q and larger FI produced higher AD, and lower LNCW – both of which are desirable. However, they also result in the undesirable effects of smaller SDN and hence lower convergence. For the first 15 datasets, the numbers of datasets that converged under different parameter value combinations are presented in Table 6-3.
TABLE 6-3 Convergence for QFA Parameter Combinations
|
|
Q |
||||
|
40 |
50 |
60 |
80 |
||
|
FI |
0 |
10 |
12 |
13 |
14 |
|
1 |
7 |
7 |
10 |
12 |
|
|
2 |
2 |
5 |
5 |
7 |
|
Only 4 parameter combinations fall above the 70% minimum convergence rate established in this study i.e. (Q, FI) values of (50, 0), (60, 0), (80, 0) and (80, 1). Based on student paired t-tests for all the datasets, differences between (50, 0) and (60, 0) were insignificant for LNCW and SDN. For LNCW, both these combinations were statistically significantly different from (and better than) (80, 0) and (80, 1). For SDN, both these combinations were statistically significantly different from (and worse than) (80, 0). (50, 0) and (80, 1) are not significantly different, but (60, 0) is significantly different from (and better than) (80, 1). For AD, (50, 0) is consistently better than (60, 0), which is better than (80, 0).
Consolidating all these above findings, the optimal QFA parameters are decided to be either (50, 0) or (60, 0), depending on the objective function that is given more weight. For this study, the parameter values (60, 0) will continue to be used for monthly datasets.
Only two seasonal (3-monthly) datasets have been analyzed in this study. Therefore statistical tests to optimize their parameters are not feasible. Keeping the Q value constant (60) from the above analysis, FI values of 0, 1 and 2 are evaluated. These values for both Datasets 3 and 4 are presented in Table 6-4.
TABLE 6-4 Objective Function Values for Seasonal Datasets (3 and 4)
|
FI |
AD |
LNCW |
SDN |
|
0 |
0.59, 1.00 |
198, 171 |
49, 49 |
|
1 |
0.71, 1.20 |
148, 159 |
47, 46 |
|
2 |
0.8, 1.31 |
140, 159 |
44, 45 |
For dataset 3, FI change from 0 to 1 improves the LNCW by 25%. But the improvement is much less from FI=1 to 2. Further, for dataset 4, from FI change from 1 to 2, LNCW does not change, and the SDN slightly decreases. Further, the % improvement in AD from 0 to 1 is higher than from 1 to 2. For these reasons, the parameter combination of (60, 1) is identified as the initial values for seasonal (3-monthly) datasets.
The above optimization provides a starting point for parameter initialization. The interactive nature of the parameter selection and immediate, transparent results from QFA allows the analyst to customize the optimal parameter combination for a given dataset.
Sections 6.2 and 6.3 point out that the K-means algorithms (both directly, and with wavelet transformation) have performed quite poorly in the current study. On the other hand, the Quantum-Frequency algorithm (QFA) provides interactive ability for the expert to identify a deterministic, quantifiable normal cluster, and a normal day of traffic. Further, the median also performed favorably, in comparison to K-Means clustering. Therefore, for this study, the QFA and the median are selected for the detailed evaluation and direct comparison with the simple historic average, in the next chapter.
This Chapter presents the application of the third and final major component of the 10-step methodology developed in Chapter 3. The first three steps of this component - dataset selection, performance measures identification, and expert benchmarking - have all been explained in Chapter 3. In this chapter, these three steps are first summarized, followed by a detailed explanation of the final step – evaluation and parameter optimization.
7.1 Expert Benchmarking Details
Two of the four experts (designated as Experts 1 and 2, i.e. E1 and E2) plotted each dataset by the time of the day. Then they visually identified the abnormal traffic and eliminated them first. Finally, they selected one day from the rest as the representative, normal traffic pattern for that dataset. One of these experts (E1) did not select normal traffic patterns from the annual datasets stating that the datasets are too variant to select just one normal traffic pattern.
The third expert (E3) reviewed each dataset visually, and also with statistics such as the average, and variance. After eliminating the apparently abnormal traffic, the rest of the “normal” data is averaged for each time of the day, to produce the final normal traffic pattern.
The fourth expert (E4) uses a number of documented rules to identify data affected by malfunctioning sensors, data affected by holidays or events. The data is also visually and individually inspected by various analysts within their group, on a daily basis. The abnormal data is then flagged in the archive. Finally, all the data not flagged as abnormal are simply averaged by the time of the day, to obtain the normal traffic pattern.
The first three experts reported that identification of normal traffic pattern took around 10-20 minutes per dataset, with an average of nearly 15 minutes per dataset. The fourth expert could not easily quantify the amount of time taken for the identifying the normal traffic pattern, since the daily analysis could not be easily accounted for.
For the detailed evaluation, 27 datasets have been selected. These datasets include 17 monthly datasets, 5 seasonal (3-monthly) datasets, and 5 annual datasets.
For the monthly and seasonal datasets, the QFA parameters are initiated from the findings of preliminary evaluation as (60, 0) and (60, 1) respectively. By linear interpolation, the initial parameters for the annual datasets are selected as (60, 6). As with the optimization carried out on the monthly datasets in the preliminary evaluation, this chapter optimizes the parameters for the seasonal and annual datasets.
The QFA was executed using the SAS code. Both the simple historic average and the median were computed within Microsoft Excel spreadsheets. All the comparison of results between the algorithms and the expert benchmarks, including plotting of graphs, were also performed in Microsoft Excel spreadsheets.
7.3 Optimization of QFA Parameters
For the monthly datasets, no further optimization was performed. The parameter values recommended from the preliminary evaluation (60, 0) were used mainly. However four of the seventeen monthly datasets (datasets 21, 23, 24 and 27) did not converge for these parameter values. Therefore, the parameter value combination of (80, 0) was used for these datasets.
For both the seasonal and the annual datasets, optimal parameter value selection based on t-tests was not feasible, since only 5 datasets were evaluated. The main purpose of the particular values selected in this study is to provide a broad guideline and initial parameter selection for several different datasets.
Parameter values for the five seasonal (3-monthly) datasets were optimized in the same manner as described in the previous chapter. In addition to (60, 1), the following three parameter value combinations were evaluated in detail: (40, 1), (40, 2), and (60, 2). Dataset 14 did not converge for (40, 2). LNCW was highest for all the datasets, for (60, 1). And the SDN was lowest for all the datasets, for (40, 2). SDN and AD for all the datasets were comparable for (40, 1) and (15, 2). However, LNCW for (40, 1) was consistently smaller for (40, 1) than (60, 2). For these reasons, the parameter value combination (40, 1) is selected in this study for the seasonal datasets.
For annual datasets, comprising 1 full year of data, parameter optimization was carried out in the same manner as for the seasonal datasets. Beyond the parameter value combination of (60, 6) explained above, several other parameter value combinations were evaluated, towards optimization. The other parameter values evaluated were: Q= (40 and 60), and FI = (2, 4, 6, 8, and 10). Of these 10 combinations evaluated, (60, 8) is selected for the current study, because of its yielding of higher AD, smaller LNCW and higher SDN values.
As explained in section 3.4.2, the normal traffic pattern identified by each algorithm, for each dataset, is individually compared to the benchmark established by each expert. Then the following two error statistics are computed: Mean Absolute Error (MAE) and Mean Absolute Percentage Error (MAPE), for three periods of the day: the entire day, the AM peak period and the PM peak period. Graphs of these error metrics are presented here in Figures 7-1 to 7-9.

FIGURE 7-1 AM Peak – Mean Errors

FIGURE 7-2 AM Peak – Mean Absolute Errors (MAE)

FIGURE 7-3 AM Peak – Mean Absolute Percentage Errors (MAPE)

FIGURE 7-4 PM Peak – Mean Errors

FIGURE 7-5 PM Peak – Mean Absolute Errors (MAE)

FIGURE 7-6 PM Peak – Mean Absolute Percentage Errors (MAPE)

FIGURE 7-7 Full Day – Mean Errors (ME)

FIGURE 7-8 Full Day – Mean Absolute Errors (MAE)

FIGURE 7-9 Full Day – Mean Absolute Percentage Errors (MAPE)
All the above graphs are discussed in detail in the next section.
7.4.1 Discussion of Evaluation Results
In this section, the evaluation results are discussed from three perspectives: (1) Based on a descriptive analysis of the results, (2) Based on the statistical student t-tests, and (3) Based on a qualitative analysis of the algorithms.
All the 9 graphs plotted in the previous section show that the Median and QFA were consistently more accurate than the historic average algorithm – across the error metrics, datasets and time periods of the day. These differences are more pronounced for the Peak periods, than the Full day. The low volume periods in the evenings and nights seem to be reducing the average error across the full day. The importance of observing the MAPE values is discernable for all the three time periods considered, especially for datasets 6-8. A simultaneous observation of both the ME and MAE values, for all the three periods of the day (but most pronounced for the peak periods), points out that the HA is significantly biased by abnormal events, whereas the errors for the other two algorithms are not systematic (their ME values are closer to zero, even when the MAE values are higher).
Further descriptive statistics were also calculated for a qualitative assessment of the results. Of the 81 cases (27 datasets * 3 time periods) investigated, the errors (MAE and MAPE) for QFA and Median were within 10% difference of each other, in 51 cases. The HA error was within 10% difference to both QFA and Median in only 13 cases. The highest difference between the Median and QFA error was about 30-40%, in a total of 6 out of 81 cases. In 2 of these cases, QFA was better than the Median, and in the other 4 cases, the Median was better than the QFA. However, the HA error was 30% or more different from either the Median or the QFA in 49 of the 81 cases. In 18 of these 49 cases, the HA was more than 100% different from either the Median or the QFA. For all but one case (Dataset 13, PM peak), the HA was outperformed by both the QFA and the Median.
Paired t-tests were performed to compare the MAE and MAPE for all the three algorithms, for all the datasets. For both the AM peak and the PM peak, both the MAE and MAPE were not statistically significantly different between QFA and Median. But both of these were different from the Historic Average (HA). For the entire day, the Median slightly outperformed the QFA results, and both these algorithms were again significantly different from the Historic average.
The simple historic average was consistently and significantly biased, mainly for the AM and PM peak periods, wherein it is often more important to track the traffic patterns more accurately. An example of this bias is presented in Figures 7-10 and 7-11. For dataset 27, in Figure 7-10, the historic average is significantly lower than the other identified patterns, mainly in the peak periods. The AM peak period, from 7:00 to 10:00, is magnified in Figure 7-11. For this period, the historic average is nearly 200 vehicles/hour/lane lower than the other patterns for the entire period of 3 hours.

FIGURE 7-10 Illustration of Bias from Historic Average

FIGURE 7-11 Illustration of Bias from Historic Average (AM Peak)
The median is computationally simple, in comparison with QFA. For most datasets, where the number of abnormalities is not large, the median closely tracked the benchmark provided by the experts. However, the QFA has three main advantages over the median:
7.4.2 Final Algorithm Selection
Based on the results presented in the above sections, both the median and the QFA are recommended, over the widely used historic average. The median is a useful tool for datasets with few abnormalities. The QFA is a more sophisticated tool, applicable to many other datasets.
In this chapter, all the details and results of the detailed evaluation were presented. Based on these results, the Median and QFA are recommended for future practical applications.
CHAPTER 8
CONCLUSIONS AND RECOMMENDATIONS
This chapter lists the significant findings and conclusions from this study, the contributions of this study, as well as recommendations for future research.
Key findings from this research are listed here.
· There is no single, accepted definition for the concept “normal.” This study found that the designation of “normal” depends on two aspects: (a) the context, and (b) the frequency of occurrence (repetitions).
· The following conclusions and findings pertain to the K-means clustering:
o For all the evaluated features, this algorithm showed significant convergence problems. Even when the algorithm converged for a dataset, the largest cluster contained several visible abnormalities.
o For all the evaluated features, the K-means clustering often, but not always, successfully separated the different days of the week and the holidays from the rest of the dataset.
o The two new K-means clustering features for traffic data (variance and autocovariance) converged more than the widely used Euclidean distance feature.
o Wavelet-based clustering converged fewer times than the basic Euclidean-distance clustering. However the wavelet-based clustering identified stronger cluster structures.
· Three algorithms – Historic Average, Quantum-Frequency Algorithm and the Median – were evaluated in detail, through application to several real-world datasets. Both the Median and the Quantum-Frequency Algorithm consistently out-performed the simple historic average for all the three time periods considered, and for both the error metrics considered.
· The historic average is significantly biased, as anticipated. This study quantified this bias, for the first time.
· The following findings about normal traffic patterns follow from the analyses in this study:
o Context-dependence of “normal”: The normal traffic pattern of a dataset may be different from that of its subset or a superset. I.e. The normal traffic pattern for the first week of January, for the entire month of January, and for the entire year may all be different.
o A given traffic dataset may not have a normal pattern.
o The normal range for a dataset is often wider than that for a small subset (for example, a year and a month).
8.2 Contributions of the Research
The major contributions resulting from this research study are listed here:
· The concept of “normal” is addressed thoroughly, bringing forth several clarifications that were not previously addressed. Thus a theoretical foundation is now established for defining normal traffic patterns directly from large datasets. Based on this foundation, a detailed definition of “normal traffic pattern” is developed in this study as the “largest group of similar days of traffic.”
· A comprehensive 10-step methodology for evaluating algorithms that identify normal traffic patterns was developed. This methodology is available for professionals and researchers to evaluate further algorithms in future.
· The Quantum-Frequency Algorithm (QFA) – developed, applied and evaluated in this study – is available for ready application to other datasets. QFA is based on the density-based clustering, and closely mimics the expert visual judgment.
· This is the first study to quantify the bias of the widely used simple historic average algorithm.
· Two new K-means clustering features were proposed and applied for traffic data (variance and autocovariance), both of which converged more than the widely used Euclidean distance feature.
· Several application details regarding wavelet-based clustering was raised and solved in this research study.
· The principles, algorithm, observations and findings from this research study extend beyond “normal” traffic patterns. For example, they can be extended to other “normal” data series such as traffic speeds, travel times etc.
8.3 Recommendations for Future Research
The current research study could be extended in several directions. Major potential ideas for future research include the following.
· Recommendations for extending the Quantum-Frequency Algorithm include the following:
o This research study employed a constant quantum size for each time of the day. Instead, variable quantum sizes may yield results further similar to the experts’ judgments.
o In this study, for each time of the day, the traffic volume data is directly compared from one day to another. In addition to this direct comparison, exploring the relationship among 2 or more successive data points may further enrich the final patterns, especially for data at aggregation intervals smaller than 15 minutes.
o In its current form, the Quantum-Frequency Algorithm identifies only one main, normal traffic pattern from the dataset. The algorithm could potentially be enhanced to identify situations when a dataset exhibits no pattern, or multiple patterns.
o The algorithm may also be adapted to multivariate traffic data, such as volumes, speeds, and occupancies together; or different classes of vehicles together.
· Further studies are required to evaluate both the wavelet-based clustering and variance-based clustering, both of which seem more promising than the traditional Euclidean distance based K-means clustering.
· Entropy theory from information sciences explains the information content in a given dataset. This theory may provide further basis for optimizing the QFA parameters or other algorithms for identifying normal traffic patterns.
· Extension of research to other traffic datasets
o This study analyzed only traffic volume data. Similar algorithms and methodologies could be applied and evaluated for the identification of “normal” patterns for other transportation data, such as speeds, travel times, or crash data (spatial/temporal clustering).
o In this study, the traffic volumes were analyzed at 15-minute aggregation intervals. Many TMCs collect data at lower aggregation intervals (such as 1-minute) and may be interested in analyzing the data at that level also. This avenue could be explored in future.
· Application of the identified normal traffic patterns: This study focused on the first step of identifying the normal traffic patterns from the traffic datasets, with the final objective of using them in applications such as work zone permitting, incident detection, event detection (such as weather) etc. Actual adaptation and use of the normal traffic patterns to these applications need to be studied in future.
· Pattern-Based Data Archive Model: Archiving is the process of storing traffic data for non-real-time uses such as performance measurement. Such archives depend on data models, which define what data to store, and how (tables, columns etc.). Currently, every single data point from every sensor is archived as a traffic data log. Instead, a different model could be developed based on the results of this study. Since many days of data are similar to each other – forming the normal traffic patterns – the data archive size could be significantly reduced by archiving simply the representative, normal traffic patterns and all the abnormal data alone, separately. These two datasets could then be fused together during querying. Potential benefits of this approach include: storage space savings, reduced database maintenance needs, and faster query speeds.
[1] Sunkari, S., The Benefits of Retiming Traffic Signals, ITE Journal, Vol. 74, No.4, Institute of Transportation Engineers, Washington, DC, April 2004
[2] Federal Highway Administration (FHWA), Intelligent Transportation Systems in Work Zones: A cross-cutting study, November 2002
[3] MitreTek Systems. User Guide: QuickZone Delay Estimation Program, Version 0.99. March, 2001.
[4] FHWA, Freeway Management and Operations Handbook, Final Report, September 2003
[5] PB Farradyne, Traffic Incident management Handbook, Prepared for Federal Highway Administration, Office of Travel Management, November 2000
[6] Transportation Research Board (TRB), Guide to Effective Freeway Performance Measurement: Final Report and Guidebook, NCHRP web document 97, August 2006
[7] USDOT, Maximizing the Benefits of HOV Facilities: Reassessing Lane Eligibility and Hours of Operation, A primer, http://hovpfs.ops.fhwa.dot.gov/cfprojects/uploaded_files/hov_primer_final.pdf
[8] Kuhn, B., Goodin, G. et al. Managed Lanes Handbook, October 2005 http://managed-lanes.tamu.edu/products/reports/0-4160-24.pdf
[9] National Transportation Operations Coalition (NTOC), Performance Measurement Initiative Final Report, July 2005
[10] FHWA, Traffic Analysis Toolbox Volume IV: Guidelines for Applying CORSIM Microsimulation Modeling Software, January 2007 http://ops.fhwa.dot.gov/trafficanalysistools/tat_vol4/vol4_guidelines.pdf
[11] Transportation Research Board. Highway Capacity Manual. TRB, National Research Council, Washington D.C., 2000
[12] Meyer, M. D., and Miller, E., Urban Transportation Planning, McGraw-Hill, 2001
[13] Luo, L., Identification of traffic patterns associated with crashes on Interstate Highways, Masters Thesis, University of Virginia, January 2006
[14] Sun, H., Liu, H. X., Xiao, H., He, R. R., and Ran, B., Short-Term Traffic Forecasting Using the Local Linear Regression Model, 82nd Annual Meeting of Transportation Research Board, National Research Council, Washington D.C., January 2003
[15] Smith, B.L., Forecasting Freeway Traffic Flow for Intelligent Transportation Systems Application, PhD Dissertation, University of Virginia, May 1995
[16] Ishak, S., Quantifying the Uncertainties of Freeway Detector Observations using Fuzzy-Clustering Approach, 82nd Transportation Research Board Annual Meeting, National Research Council, January 2003
[17] Park, E. S., Turner, S., and Spiegelman, C.H., Empirical Approaches to Outlier Detection in ITS Data, 82nd Transportation Research Board Annual Meeting CD ROM, January 2003
[18] Conklin, J. H., Data imputation strategies for transportation management systems, Masters Engineering Thesis, University of Virginia, 2003
[19] Gajewski, B., Turner, S., Eisele, W., and Spiegelman, C. H., Intelligent Transportation Systems Data Archiving: Statistical Techniques for Determining Optimal Aggregation Widths for Inductive Loop Detector Speed Data, Transportation Research Record 1719, TRB, National Research Council, Washington D.C., 2000
[20] Qiao, F., Wang, X., and Yu, L., Optimizing Aggregation Level for ITS data based on Wavelet Decomposition, 82nd Transportation Research Board Annual Meeting, National Research Council, January 2003
[21] McShane, W.R., and K.W. Crowley, Regularity of Some Detector-Observed Arterial Traffic Volume Characteristics, Flow Theory, Transportation Research Record 596, Transportation Research Board, National Research Council, Washington D.C., 1976
[22] Bremmer, D., Cotton, K.C., Cotey, D., Prestrud, C.E., and Westby, G., Measuring Congestion: Learning from Operational Data, Transportation Research Record 1895, TRB, National Research Council, Washington D.C., 2004
[23] British Columbia, Improving Roads and Bridges for people, goods and transit throughout Greater Vancouver, Program Definition Report Summary, January 2006
[24] Giuliano, G., O’Brien, T.J., Zhou, J., and Tan, L., Impacts of Port Gate Operations on the Highway System: Case Study, 87th Transportation Research Board Annual Meeting, National Research Council, January 2008
[25] TTI, Freeway Traffic Conditions and Trends in the Phoenix Region, 2004 – Final Report, June 2005
[26] Garber, N. J. and L. A. Hoel. Traffic and Highway Engineering. Revised Second Edition, PWS Publishing, USA, 2003
[27] USDOT, Traffic Monitoring Guide, May 2001
[28] Turochy, R.E., and B.L. Smith. Alternative Approaches to Condition Monitoring in Freeway Management Systems. Final Report VTRC 02-R8. Virginia Transportation Research Council. Charlottesville, Virginia. January, 2002.
[29] Caltrans, 2002 HICOMP report, November 2003
[30] Weijermars, W., and E. van Berkum. Analyzing highway flow patterns using cluster analysis. ITSC, 2006.
[31] Hallenbeck, M., Ishimaru, J.M., and Nee, J., Measurement of Recurring Versus Non-Recurring Congestion: Technical Report, Report No.568.1, October 2003
[32] Washington Department of Transportation (WSDOT), Measures, Markers and Mileposts: The Gray Notebook for the quarter ending September 30, 2004
[33] Visser, J., and L. Molenkamp. Vulnerability quick scan of a national road network. The Second International Symposium on Transportation Network Reliability (INSTR), Christchurch, New Zealand, August, 2004.
[34] Margiotta, R., ITS as a Data Resource: Preliminary Requirements for a User Service, SAIC, April 1998
[35] Lomax, T., Turner, S., Margiotta, R., Monitoring Urban Roadways: Using Archived Operations Data for Reliability and Mobility Measurement, submitted to Federal Highway Administration, May 31, 2001
[36] Turner, S. M., Guidelines for Developing ITS Data Archiving Systems, Report 2127-3, Texas Transportation Institute, September 2001
[37] Conklin, J. H., Data imputation strategies for transportation management systems, Masters Engineering Thesis, University of Virginia, 2003
[38] Turner, S., Albert, L., Gajewski, B., and Eisele, W., Archived Intelligent Transportation System Data Quality, Transportation Research Record 1719, TRB, National Research Council, Washington, D. C. 2000
[39] Williams, B., Modeling and Forecasting Vehicular Traffic Flow as a Seasonal Stochastic Time Series Process, PhD dissertation, University of Virginia, 1999
[40] Rakha, H., and Van Aerde, M., Statistical Analysis of Day-to-Day Variations in Real-Time Traffic Flow Data, Transportation Research Record 1510, Transportation Research Board, National Research Council, Washington D.C., 1995
[41] Grenander, U. Elements of Pattern Theory. Johns Hopkins University Press, Baltimore, 1996
[42] Davis, P.V., and J.G. Bradley. The Meaning of Normal. In What’s Normal. Narratives of Mental & Emotional Disorders. Edited by Carol Donley & Sheryl Buckley. The Kent State University Press, Kent, Ohio, 2000.
[43] Hooton, E.A. “Young Man, You are Normal,” Findings from a Study of Students. G.P. Putnam’s Sons, New York, 1945.
[44] Moynihan, D.P., "Defining deviancy down: How we've become accustomed to alarming levels of crime and destructive behavior," The American Scholar, 1993 (Winter)
[45] Karmen, A., "Defining Deviancy Down": How Senator Moynihan's Misleading Phrase About Criminal Justice Is Rapidly Being Incorporated Into Popular Culture, Journal of Criminal Justice and Popular Culture, 2(5) (1994) 99-112
[46] Christens, P. Statistical Modeling of Traffic Safety Development. PhD Dissertation, Danish Transport Research Center and Informatics and Mathematical Modeling, March 31, 2003
[47] Beckman, R.J., and R.D. Cook, Outlier………..s, Technometrics, Vol.25, No. 2, May 1983
[48] Hawkins, D.M., Identification of Outliers, Chapman and Hall, New York, 1980
[49] Ramaswamy, S., R. Rastogi, and K. Shim, Efficient Algorithms for Mining Outliers from Large Data Sets, MOD 2000, ACM, 2000
[50] National Institute of Standards and Technology (NIST)/SEMATECH, e-Handbook of Statistical Methods, 2003, http://www.itl.nist.gov/div898/handbook/
[51] Plyshyn, Z., Seeing and Visualizing: It’s not what you think, The MIT Press, Cambridge, MA, 2003
[52] Card, S.K., J.D. Mackinlay, and B. Shneiderman, Readings in Information Visualization: Using Vision to Think, Morgan Kaufmann Publishers, San Francisco, 1999
[53] Palmer, S.E., Vision Science: Photons to Phenomenology, The MIT Press, Cambridge, MA, 1999
[54] Bovik, A., Handbook of Image and Video Processing. Academic Press, 2000
[55] Kosslyn, S.M., Image and Brain: The Resolution of the Imagery Debate, MIT Press, Cambridge, MA, 1994
[56] Resnikoff, H.L., The Illusion of Reality, Springer Verlag, New York, 1987
[57] Gersho, A., and R.M. Gray. Vector Quantization and Signal Compression. Kluwer Academic Publishers, Boston, 1992
[58] Kohonen, T. Self-Organizing Maps. 3rd Edition. Springer-Verlag, Berlin, 2001
[59] Duda, R.O., P.E. Hart, and D.G. Stork, Pattern Classification, 2nd Edition, Wiley-Interscience, 2000
[60] Brockwell, P. J. and R. A. Davis. Introduction to Time Series and Forecasting. 2nd Ed, Springer-Verlag New York, Inc., 2002
[61] Hoaglin, D.C., F. Mosteller, and J.W. Tukey, Understanding Robust and Exploratory Data Analysis, John Wiley and Sons, Inc., New York, 1983
[62] Povinelli, R.J., Time Series Data Mining: Identifying Temporal Patterns for Characterization and Prediction of Time Series Events, PhD Dissertation, University of Wisconsin, Milwaukee, 1999
[63] Park, D. and L.R. Rilett, Forecasting multiple-period freeway link travel times using modular neural networks. Transportation Research Record 1617, Transportation Research Board, National Research Council, Washington D.C.,1998
[64] Ishak, S., Quantifying the Uncertainties of Freeway Detector Observations using Fuzzy-Clustering Approach, 82nd Transportation Research Board Annual Meeting, National Research Council, January 2003
[65] Martin, P. T., Perrin, J., and Hansen, B., Incident Detection Algorithm Evaluation, University of Utah, March 2001, Last accessed: 01-29-04 http://www.ndsu.nodak.edu/ndsu/ugpti/MPC_Pubs/html/MPC01-122.html
[66] Dave, R. N. and R. Krishnapuram. Robust Clustering Methods: A Unified View. IEEE Transactions on Fuzzy Systems. Vol. 5, No.2, May 1997
[67] Knorr, E., and R. Ng, Algorithms for mining distance-based outliers in large datasets. In Proceedings of the VLDB Conference, pages 392-403, New York, September 1998
[68] Weil, R., J. Wootton, and A. Garcia-Ortiz. Traffic Incident Detection: Sensors and Algorithms. Mathematical and Computer Modelling. Vol. 27, No. 9-11, pp. 257-291, 1998
[69] Kosinski, A. S. A procedure for the detection of multivariate outliers. Computational Statistics & Data Analysis 29, 1999, pp 145-161
[70] Everitt, B S. Cluster Analysis. 3rd Ed, Edward Arnold. 1993
[71] Hastie, T., R. Tibshirani, and J. Friedman. The Elements of Statistical Learning: Data Mining, Inference, and Prediction. Springer. 2001
[72] UNESCO, Statistical Guide for Partitioning Around Medoids, Section 7.1.1, http://www.unesco.org/webworld/idams/advguide/Chapt7_1_1.htm, Last Accessed: 05/01/2008
[73] Jain, A.K., and R.C. Dubes, Algorithms for Clustering Data, Prentice Hall, Englewood Cliffs, New Jersey, 1988
[74] Karim, A., and H. Adeli. Incident Detection Algorithm using Wavelet Energy Representation of Traffic Patterns. Journal of Transportation Engineering, ASCE, May/June 2002
[75] Adeli, H., and A. Karim. Fuzzy-Wavelet RBFNN Model for Freeway Incident Detection. Journal of Transportation Engineering, ASCE, November/December 2000