Analyze Data¶
This tool helps to analyze data by features.
General usage¶
$ hwrt analyze_data --help
usage: hwrt analyze_data [-h] [-d FILE] [-f]
optional arguments:
-h, --help show this help message and exit
-d FILE, --handwriting_datasets FILE
where are the pickled handwriting_datasets?
-f, --features analyze features
Plug-in System¶
It can be extended by a plugin system. To do so, the configuration file ~/.hwrtrc has to be edited. The following two entries are important:
data_analyzation_plugins: /home/moose/Desktop/da.py
data_analyzation_queue:
- TrainingCount:
- filename: trainingcount.csv
- Creator: null
The value of data_analyzation_plugins indicates where the file with self-written data analyzation classes is located. Could could looke like this:
#!/usr/bin/env python
# -*- coding: utf-8 -*-
import time
from collections import defaultdict
# hwrt modules
from hwrt import HandwrittenData
from hwrt import utils
from hwrt import data_analyzation_metrics
from hwrt import geometry
class TrainingCount(object):
"""Analyze how many training examples exist for each recording."""
def __init__(self, filename="creator.csv"):
self.filename = data_analyzation_metrics.prepare_file(filename)
def __repr__(self):
return "TrainingCount(%s)" % self.filename
def __str__(self):
return "TrainingCount(%s)" % self.filename
def __call__(self, raw_datasets):
write_file = open(self.filename, "a")
write_file.write("symbol,trainingcount\n") # heading
print_data = defaultdict(int)
start_time = time.time()
for i, raw_dataset in enumerate(raw_datasets):
if i % 100 == 0 and i > 0:
utils.print_status(len(raw_datasets), i, start_time)
print_data[raw_dataset['handwriting'].formula_in_latex] += 1
print("\r100%"+"\033[K\n")
# Sort the data by highest value, descending
print_data = sorted(print_data.items(),
key=lambda n: n[1],
reverse=True)
# Write data to file
write_file.write("total,%i\n" %
sum([value for _, value in print_data]))
for userid, value in print_data:
write_file.write("%s,%i\n" % (userid, value))
write_file.close()
Default metrics¶
There are also many ready-to-use metrics: