A Hugging Face model a day

2025-06-21

Let me begin by saying my interaction with hugging face has been almost null since trying out PCA models for a Deep Learning Course.

The motivation behind this post comes after knowing HF has ways to automate most of the things I do day by day, and small nice-to-have activities that would make my life marginally better.

Why waste time building out stuff when people have spent months or years mantaining models in (near) good health?

The idea is to have a one stop place where I can record:

Today 2 models: transcribing audio and image generation using training on specific features.

Whisper.ccp: Transcribing

The model that sparked this idea. Whisper supports locally transcribing, no upload limit, and is fully free to use.

My main need here was transcribing a 1.45 hour interview with two people and no background noise. The tricky thing is the interview was done in Argentinian Spanish and the main speaker used a lot of gestures, meaning transcribing the audio also requires some flexibility around understanding that phrases accompannying gestures are not necessarily easy to transcribe.

I manually downloaded pre-converted models from here.

#Preconverted model used
curl -L -o models/ggml-base.bin https://ggml.ggerganov.com/ggml-model-whisper-base.bin

# set up
brew install cmake
cmake --version

#
cmake -B build
# if done: [100%] Built target bench

One small set up needed is Whisper does not support .m4a files. It only supports .wav, .mp3, .flac, and raw PCM formats. You will need to convert it into .wav first.

# is your file in wav? if so convert
# you might need: brew install ffmpeg
ffmpeg -i ~/Desktop/name.m4a ~/Desktop/name.wav


# model include in -f the name of your audio ie. ~/Desktop/name
# output saves your transcript outside of the terminal
./build/bin/whisper-cli -m models/ggml-base.bin -l es -f your-spanish-audio.wav --output-txt

In case you need to transcribe and translate it back into another language you can also do so by adding --translate before -l. Next up you will see a stream of transcription in your terminal which will then save as a txt file.

If you can't be bothered to look into each transcription, you can just throw this into Chat GPT 4.0 explain basic ideas that were talked about and gather your main takeaways.

CryptoGAN: making punk pfps

CryptoGAN is a lightweight GAN interface that lets you generate new pixel-art-style punk profile based on the original CryptoPunks dataset.

It builds on SNGAN. Whenever you see SNGAN, it builds from "Spectral Normalization for Generative Adversarial Networks" by Takeru Miyato, Toshiki Kataoka, Masanori Koyama, Yuichi Yoshida. Spectral normalization "stabilizes training," the high-level idea is it helps ensure that the discriminator doesn't overpower the generator, leading to better image quality

This was not easy at all. And I wouldn't say I've achieved what I set out to do either. My idea was to obtain 100% clear quality punk from the generator, or a grid filled with a set of entirely new punks. Like the originals, but iterating on their design.

Step 1 — Download fails

I started by trying to download the full CryptoPunks image dataset from cryptopunks.app, but quickly hit rate limits (HTTP 429). Turns out, scraping 10,000 images gets you cut off pretty fast. Introducing sleep limits and delays did not work either.

Instead I pivoted to the huggingnft/cryptopunks dataset on Hugging Face. This gave me clean, ready-to-use PNGs of all punks. I wrote a script to extract them into a local data/images/ folder. Success.

Step 2 — Training from scratch: fails, the sequel.

The original cryptogans repo claims to support training (train.py), but no script exists. Now, building a train script was not the idea. So I gave up on retraining the model and decided to just use the pretrained generator instead – not the coolest process.

Step 3 — Loading the pretrained model.

The repository uses from_pretrained_keras() — which broke immediately because of a compatibility issue. Some painful debugging later, I learned that I needed to load the model using TFSMLayer()— the only way to load old SavedModels in Keras 3.

Step 4 — Gradio: deprecated code everywhere.

Once the model loaded, the Gradio app didn't work. All examples used functions which were removed in Gradio 4x. I had to rewrite the interface using gr.Slider() and gr.Image() directly.

When the app finally ran, the output was washed out and blurry.

Not punk at all, sad ghosts at best.

Grey washed out punks

I had two bugs here:

Step 5 — Fixing image clarity: Interpolation.

This was faster. The output tensor made the model return a dictionary with a "postprocess" key. Once I extracted that, clipped the values to [0, 1], converted them to uint8, and rendered them using interpolation="nearest", the punks looked punkier.

Clear quality punks

Also notice how the generated cigar introduced too much noise in the image. Some punks had half a cigar or smoke around them. No matter how many iterations the model seemed to skew against sepcial features like the pipes, and no matter how many iterations long hair punks were never present.