Looking for Complex Text Manipulation CLI tools

Cactus_Head@programming.dev · edit-2 14 hours ago

Looking for Complex Text Manipulation CLI tools

bus_factor@lemmy.world · 15 hours ago

Your description is too vague to really get a good answer. In general, if you’re doing complex string manipulation, you’ll use a full-fledged programming language with regex support, like Python, Perl or Awk, possibly piped into each other and/or other tools like Sed or Cut. I can’t be more specific than that without a more specific description where you describe the actual data and criteria.

Are you starting with the first or second example? Why do the prefix numbers change between examples? How do you tell text and title/subtitle apart?

Cactus_Head@programming.dev · 14 hours ago

Why do the prefix numbers change between examples?

My bad, i fixed it

I want to show that the two terms are related e,g Star and Jedi by grouping them together

Franchises

Stars wars
Jedi

Transformers

Also i am not able to add line breaks between bullet points in markdown. so instead i get this

Franchises

Stars wars
Jedi
Transformers

So i cant show the grouping thing in lemmy here. I would have also liked The list i make to be markdown compatible but i guess that separate issue.

Cactus_Head@programming.dev · edit-2 14 hours ago

Basically i collect keywords( e.g: transformers, A Deep dive, Harry Potter The worst, Xbox, stars worst, Jedi) from videos on my YouTube home page and organize them into a lists

YouTuber terms:
- A Deep Dive
- The Worst

Franchises:
- Star wars
- Jedi
- Harry Potter
- Transformers

Companies:
- Xbox

And Turn it into:

A Deep Dive,The Worst, Star wars, Jedi, Harry Potter, Transformers,Xbox

Removing the titles and subtitles.

How do you tell text and title/subtitle apart

I was thinking of putting a symbol like “#” for example, in front of the Title

# - YouTuber terms:

so the script knows to ignore that whole line, like in general programming

a14o@feddit.org · 14 hours ago

This is not difficult to achieve at all with tools like sed or awk. But unless you provide a concrete example input file or files, all we can do is point to those tools.

Cactus_Head@programming.dev · edit-2 7 hours ago

Something like this?

- Franchise(Title): 

  - Harry potter

  - Perfect Blue

  - Jurassic world
  - Jurassic Park

  - Jedi
  - Star wars
  - The clone wars

  - MCU

  - Cartoons(Sub-Title):

    - Gumball 

    - Flapjack

    - Steven Universe

    - Stars vs. the forces of Evil

    - Wordgril

    - Flapjack

Turned into

Harry potter,Perfect Blue,Jurassic world,Flapjack,Jedi,Star wars,The clone wars,MCU,Gumball,Flapjack,Steven Universe,Stars vs. the forces of Evil

Both “Franchis” and “Cartoons” where removed/ not included with the other words.

bus_factor@lemmy.world · edit-2 40 minutes ago

If you can’t install a dedicated tool like yq but don’t mind creating a standalone script, python would be able to do this out of the box on pretty much any computer, calculator or toaster you can get your hands on in 2026:

#! /usr/bin/env python3

import yaml
import sys

def parse_yaml(filename):
    with open(filename) as fd:
        return yaml.safe_load(fd)

def get_leaf_nodes(data_iterable):
    output = []
    for v in data_iterable:
        if isinstance(v, dict):
            output += get_leaf_nodes(v.values())
        elif isinstance(v, list):
            output += get_leaf_nodes(v)
        else:
            output.append(v)
    return output

print(",".join(get_leaf_nodes(parse_yaml(sys.argv[1]))))

$ /tmp/foo.py /tmp/foo.txt
Harry potter,Perfect Blue,Jurassic world,Jurassic Park,Jedi,Star wars,The clone wars,MCU,Gumball,Flapjack,Steven Universe,Stars vs. the forces of Evil,Wordgril,Flapjack

This takes the first argument on the command line, parses it as yaml, finds all leaf nodes recursively, and prints a comma-separated list of the results.

bus_factor@lemmy.world · 45 minutes ago

If you wanted a somewhat cruder approach using basically ubiquitous tools, you could do something like this:

$ grep '^ *-' /tmp/foo.txt | grep -v ': *$' | sed 's/ *- //' | tr '\n' ',' | sed s'/,$/\n/'
Harry potter,Perfect Blue,Jurassic world,Jurassic Park,Jedi,Star wars,The clone wars,MCU,Gumball ,Flapjack,Steven Universe,Stars vs. the forces of Evil,Wordgril,Flapjack

Here I’m first using grep '^ *-' to get all lines starting with any amount of whitespace and a leading dash, then piping that to grep -v ': *$' to remove anything with a colon at the end (including those with whitespace after the colon), then using tr '\n' ',' to replace all newlines with commas, and then sed s'/,$/\n/' to replace the trailing comma with a newline again (although sed is finicky across platforms wrt newlines, so you may want to just replace it with an empty string instead).

The above is hardly an efficient approach, but it does the job.

bus_factor@lemmy.world · 43 minutes ago

If you’re feeling a little old school (and some might say masochistic), you could so a similar crude parser with a perl oneliner. This would be more efficient compute wise, but it’s a bit of an acquired taste readability wise:

$ perl -ne 'chomp; push @a, $1 if /^\s*-\s*(.*[^:\s])\s*$/; END{print join(",", @a), "\n"}' /tmp/foo.txt
Harry potter,Perfect Blue,Jurassic world,Jurassic Park,Jedi,Star wars,The clone wars,MCU,Gumball,Flapjack,Steven Universe,Stars vs. the forces of Evil,Wordgril,Flapjack

Here perl -n makes perl look at each line individually, chomp strips off the trailing newline, we match for /^\s*-\s*(.*[^:\s])\s*$/ (a string starting with a dash and ending with something not a colon) and append the content of the matching parenthesis to an implicitly declared array @a. Then we add an END{} block which will be executed after all lines are parsed, where we print the array joined on ,.

bus_factor@lemmy.world · 46 minutes ago

If you can stick to valid YAML like your example is, you can use a reasonably short yq command to get a comma-separated string of all scalar values:

$ yq -r '[.. | scalars] | join(",")' /tmp/foo.txt                
Harry potter,Perfect Blue,Jurassic world,Jurassic Park,Jedi,Star wars,The clone wars,MCU,Gumball,Flapjack,Steven Universe,Stars vs. the forces of Evil,Wordgril,Flapjack

.. goes down the tree recursively, scalars filters out only scalar values, [] around those two makes them an array, and piping it all to join(",") makes it into a comma-separated string.

moonpiedumplings@programming.dev · edit-2 10 hours ago

This is technically yaml I think, a list (with one entry) of lists that contains mostly single items but also one other list. You should be able to parse this with a yaml parser like pythons built in one.

Note that yaml is picky abiut the syntax though, so it wouldn’t be able to handle deviations.