Merge Tags with django-taggit

Jan 01 2015

Today I cleaned up the database for Fiddle Salad and Python Fiddle. Both use the same Django back-end for code storage. While browsing tags, I noticed that often both CamelCase and lowercase spellings were used for tags. Since I was working on a tag suggest feature earlier this week, I decided to convert all tags to lowercase so that tag suggestions would not be redundant. An additional benefit is further normalization of the data. Fortunately, I found a fork of django-taggit, the Django app I used for tagging, that supported enforcing lowercase tags everywhere. Two management commands were already present for normalizing data, mergetags and lowercasetags. django-taggit had two fields for each tag, a name and slug. lowercasetags converted all tag names to their lowercase form. mergetags takes at least two tag slugs and merges all tags into a single destination tag. The result is that all associations are moved to a single tag. While mergetags is suitable for manually resolving redundant data, the number of tags on Fiddle Salad is too large. I wrote an command to automate this process:

from django.core.management.base import BaseCommand, CommandError
from taggit.models import Tag, TaggedItem
from django.core.exceptions import ObjectDoesNotExist

class Command(BaseCommand):
    help = 'merges all tags automatically'

    def merge(self, extra_slugs, dest_slug):
        try:
            dest_tag = Tag.objects.get(slug=dest_slug)
        except ObjectDoesNotExist:
            raise CommandError('Destination Tag "%s" does not exist' % dest_slug)

        for slug in extra_slugs:
            try:
                tag = Tag.objects.get(slug=slug)
            except ObjectDoesNotExist:
                raise CommandError('Tag "%s" does not exist' % slug)

            items = TaggedItem.objects.filter(tag=tag)
            count = items.count()
            for i, item in enumerate(items):
                if i % 20 == 0:
                    self.stdout.write('Merging %s %d/%d\n' % (slug, i+1, count))
                obj = item.content_object
                if not obj:
                    return
                obj.tags.remove(tag)
                obj.tags.add(dest_tag)
            tag.delete()

            self.stdout.write('Successfully merged tags into "%s"\n' % dest_slug)


    def handle(self, *args, **options):
        for tag in Tag.objects.all():
            if Tag.objects.filter(name=tag.name).count() > 1:
                tags = Tag.objects.filter(name=tag.name).order_by('id')
                dest = tags[0].slug
                extras = []
                for tag in tags[1::]:
                    extras.append(tag.slug)
                self.merge(extras, dest)

Because performance is not a concern for a single-time data processing script, I did not bother to optimize the queries nor run-time. This script would be useful for anyone who wants to normalize tags in the same manner, so it is in a git repository. Finally, I tested the new command on a clone of the production database.

bash-4.1$ python manage.py lowercasetags
Lowercasing 1/1621
Lowercasing 21/1621
.
.
.
Lowercasing 1621/1621
bash-4.1$ python manage.py mergealltags
Merging jquery_1 1/46
Merging jquery_1 21/46
Merging jquery_1 41/46
Successfully merged tags into "jquery"
Successfully merged tags into "jquery"
Merging stylus_1 1/7
Successfully merged tags into "stylus"
Merging hello_1 1/10
Successfully merged tags into "hello"
Merging test_1 1/147
Merging test_1 21/147
Merging test_1 41/147
Merging test_1 61/147
Merging test_1 81/147
Merging test_1 101/147
Merging test_1 121/147
Merging test_1 141/147
Successfully merged tags into "test"
Merging me_1 1/2
Successfully merged tags into "me"
Merging no_1 1/5
Successfully merged tags into "no"
Merging one_1 1/16
Successfully merged tags into "one"
Merging things_1 1/3
Successfully merged tags into "things"
Merging learning_1 1/4
Successfully merged tags into "learning"
Successfully merged tags into "body"
Merging week-one_1 1/1
Successfully merged tags into "week-one"
Merging studio_1 1/33
Merging studio_1 21/33
Successfully merged tags into "studio"
Merging internet_1 1/36
Merging internet_1 21/36
Successfully merged tags into "internet"
Merging assignment_1 1/6
Successfully merged tags into "assignment"
Merging homework_1 1/6
Successfully merged tags into "homework"
Merging lessons_1 1/1
Successfully merged tags into "lessons"
Merging code_1 1/12
Merging tags_1 1/7
Successfully merged tags into "tags"
Merging two_1 1/6
Successfully merged tags into "two"
Merging salcedo_1 1/3
Successfully merged tags into "salcedo"
Merging page_1 1/15
Successfully merged tags into "page"
Merging music_1 1/4
Successfully merged tags into "music"
Merging table_1 1/7
Successfully merged tags into "table"
Merging band_1 1/9
Merging texas_1 1/1
Successfully merged tags into "texas"
Merging biography_1 1/2
Merging assignment-two_1 1/2
Successfully merged tags into "assignment-two"
Merging website_1 1/9
Merging a_1 1/2
Successfully merged tags into "a"
Merging words_1 1/1
Successfully merged tags into "words"
Merging section_1 1/2
Successfully merged tags into "section"
Merging header_1 1/1
Successfully merged tags into "header"
Merging ui_1 1/4
Successfully merged tags into "ui"
Merging first_1 1/8
Successfully merged tags into "first"
Merging random_1 1/1
Successfully merged tags into "random"
Merging internet-studio_1 1/5
Successfully merged tags into "internet-studio"
Merging angularjs_1 1/11
Successfully merged tags into "angularjs"
Merging i_1 1/2
Successfully merged tags into "i"
Merging lines_1 1/1
Successfully merged tags into "lines"
Merging row_1 1/1
Successfully merged tags into "row"
Merging alex-alpha_1 1/1
Successfully merged tags into "alex-alpha"
Merging assignment-one_1 1/2
Successfully merged tags into "assignment-one"
Merging google_1 1/1
Successfully merged tags into "google"
Merging man_1 1/4
Successfully merged tags into "man"
Merging nick_1 1/1
Successfully merged tags into "nick"
Merging cartoon_1 1/1
Successfully merged tags into "cartoon"
Merging batman_1 1/2
Successfully merged tags into "batman"
Merging code_1 1/7
Merging the_1 1/1
Successfully merged tags into "the"
Merging animation_1 1/2
Successfully merged tags into "animation"
Merging band_1 1/4
Merging assignment-one-of-three_1 1/2
Successfully merged tags into "assignment-one-of-three"
Merging status_1 1/1
Successfully merged tags into "status"
Merging python_1 1/2
Successfully merged tags into "python"
Merging cat_1 1/1
Successfully merged tags into "cat"
Merging none_1 1/7
Successfully merged tags into "none"
Merging adam_1 1/2
Successfully merged tags into "adam"
Merging school_1 1/3
Successfully merged tags into "school"
Merging website_1 1/9
Merging biography_1 1/2
Merging bootstrap_1 1/8
Successfully merged tags into "bootstrap"
Merging datamill_1 1/5
Successfully merged tags into "datamill"
Merging gentoo_1 1/2
Successfully merged tags into "gentoo"
Merging dobschal_1 1/1
Successfully merged tags into "dobschal"
Merging weimar_1 1/1
Successfully merged tags into "weimar"

When all went fine, I ran lowercasetags and mergealltags on both Fiddle Salad and Python Fiddle. Now I was really impressed with the results as I clicked through the tags on both sites. The tags on Fiddle Salad were much better organized as they were ordered by popularity. While looking through the tags, I noticed that “test” was among the top. I decided to add ‘test’ to the list of stopwords for django-taggit. These stopwords are removed during save so that they are not associated with new snippets.
Now that the tags are normalized, I am ready to move on and deploy tag suggestions.

One response so far

  • http://www.marcanuy.com/ Marcelo Canina

    After normalizing the data you can add the following case insensitive setting: “TAGGIT_CASE_INSENSITIVE = True”.